Were you unable to attend Transform 2022? Check out all the summit sessions in our on-demand library now! Look here.
The world is filled with situations where one size does not fit all – shoes, health care, the desired number of sprinkles on a fudge sundae, to name a few. You can add data pipelines to the list.
Traditionally, a data pipeline handles the connectivity to business applications, controls the requests and flow of data into new computing environments, and then manages the steps needed to clean, organize, and present a refined data product to consumers, inside or outside the enterprise walls. These results have become indispensable in helping decision makers drive their business forward.
Lessons learned from Big Data
Everyone is familiar with the Big Data success stories: How companies like Netflix build pipelines that manage more than a petabyte of data every day, or how Meta analyzes over 300 petabytes of clickstream data on its analytics platforms. It is easy to assume that we have already solved all the hard problems when we reach this scale.
Unfortunately, it’s not that simple. Just ask anyone who works with operational data pipelines – they’ll be the first to tell you that one size definitely doesn’t fit all.
MetaBeat will bring together thought leaders to provide guidance on how metaverse technology will transform the way all industries communicate and do business on October 4th in San Francisco, CA.
For operational data, which is the data that underpins the core parts of a business such as finance, supply chain and HR, organizations routinely fail to deliver value from analytics pipelines. That is true even if they were designed in a way similar to Big Data environments.
Why? Because they’re trying to solve a fundamentally different data challenge with essentially the same approach, and it’s not working.
The problem is not the size of the data, but how complex it is.
Leading social or digital streaming platforms often store large data sets as a series of simple, ordered events. One line of data is captured in a data pipeline for a user watching a TV show, and another records every “Like” button clicked on a social media profile. All this data is processed through data pipelines at tremendous speed and scale using cloud technology.
The datasets themselves are large, and that’s okay because the underlying data is extremely ordered and clear to begin with. The highly organized structure of clickstream data means that billions upon billions of records can be analyzed in a short amount of time.
Data pipelines and ERP platforms
For operational systems, such as Enterprise Resource Planning (ERP) platforms that most organizations use to run their essential day-to-day processes, it is a completely different data landscape.
Since their introduction in the 1970s, ERP systems have evolved to optimize every ounce of performance to capture raw transactions from the business environment. Every sales order, ledger entry and supply chain inventory needs to be captured and processed as quickly as possible.
To achieve this performance, ERP systems have evolved to manage tens of thousands of individual database tables that track business data elements and even more relationships between those objects. This data architecture is effective in ensuring that a customer’s or supplier’s records are consistent over time.
But, as it turns out, what’s great for transaction speed in that business process is usually not so great for analytics performance. Instead of the clean, simple, and well-organized tables that modern web applications create, it’s a spaghetti-like mess of data, spread across a complex, real-time, mission-critical application.
For example, analyzing a single financial transaction for a company’s books may require data from over 50 different tables in the backend ERP database, often with multiple lookups and calculations.
To answer questions that span hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to produce results. Unfortunately, these questions simply never provide answers in time and leave the business flying blind at a critical point in the decision-making process.
To address this, organizations are trying to further evolve the design of their data pipelines with the aim of routing data into increasingly simplified business views that minimize the complexity of various queries to make them easier to run.
This could work in theory, but it comes at the cost of simplifying the data itself. Instead of enabling analysts to ask and answer questions with data, this approach often summarizes or reshapes the data to increase performance. That means analysts can get quick answers to pre-defined questions and wait longer for everything else.
With inflexible data pipelines, asking new questions means going back to the source system, which is time-consuming and quickly becomes expensive. If something changes in the ERP application, the pipeline breaks completely.
Rather than using a static pipeline model that cannot respond effectively to data that is more interconnected, it is important to design this level of connectivity from the start.
Instead of making the piping ever smaller to break up the problem, the design should include these connections instead. In practice, that means addressing the root cause behind the pipeline itself: Making data available to users without the time and cost associated with expensive analytical queries.
Each linked table in a complex analysis puts additional pressure on both the underlying platform and those tasked with maintaining business performance by tuning and optimizing those queries. To rethink the approach, one needs to look at how everything is optimized when the data is loaded – but, importantly, before any queries are run. This is generally referred to as query acceleration, and it provides a useful shortcut.
This query acceleration approach provides many multiples of performance compared to traditional data analytics. It achieves this without the data having to be prepared or modeled in advance. By scanning the entire dataset and preparing that data before running queries, there are fewer constraints on how queries can be answered. This also improves query utility by delivering the full extent of the raw business data available for exploration.
By questioning the fundamental assumptions of how we collect, process and analyze our operational data, it is possible to simplify and streamline the steps needed to move from expensive, fragile data pipelines to faster business decisions. Remember: One size does not fit all.
Nick Jewell is senior director of product marketing at Incorta.
Data Decision Makers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people involved in data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices and the future of data and data technology, join us at DataDecisionMakers.
You may even consider contributing an article of your own!
Read more from DataDecisionMakers