Leading organizations differentiate themselves by analyzing massive amounts of interrelated data to predict business outcomes. High-volume, complex data analytics requires detailed data (not summaries) because influencing an individual’s actions requires that you track and analyze their unique interactions with your company. Traditional analytic systems and traditional databases cannot meet today’s need for predictive analytics on massive amounts of data.
It’s easy to overlook data movement when thinking about analytics and analytic performance. However, as data volumes increase, the simple act of moving data to an analytic engine dramatically decreases overall performance. To illustrate, a major credit card company takes two weeks to build its analysis files while an insurance company needs six days to perform this task. For many large-data analyses moving data consumes far more time than all other activities combined. I will compare traditional systems, comprised of physically separate database and compute servers, and various forms of contemporary analytic data warehouses. I will also note the types of analytics typically available to each analytic system.
Recognizing that database servers were not built for complex analytics, vendors paired a compute server with the database server. These traditional two-server analytic systems extract data from the database (either the data warehouse or the transactional database system) and move them onto another server, where they perform model building, model validation, and scoring processes. Moving a big data set from the database server to the analytic server is very inefficient and results in a large lag between the time data are created and their analysis. Beyond performance, this architecture has many challenges, including increased network load, overhead for analyst, demand for redundant infrastructure, data governance and synchronization issues, and data security concerns.
The next generation of analytic servers was driven by the need to minimize data movement. Most data warehouse vendors have built what they call in-database analytics. The main innovation was collocation of the compute and database engine to eliminate the need to copy data to another server for analysis. However, data must still be moved from disk to memory before the real analytics can happen. Moreover, the data transfers are not optimized – these systems must move entire tables even if only a subset of rows and columns is necessary to perform the analysis. And in many cases these data warehouses only offer SQL-based in-database analytics based like MIN, MAX, AVERAGE, and MEDIAN.
In terms of performance, the in-database stream processing architecture rises to the top. This architecture, found in the IBM Netezza data warehouse, eliminates the need to copy data to memory as data are analyzed as they stream off disk - minimizing data movement and data volume prior to scoring. Data minimization is accomplished with three capabilities: zone map technology and two types of filter technology. Zone map acceleration exploits the natural ordering of rows in a data warehouse to avoid scanning rows that are not relevant to the analytic query. Next, project and restrict engines eliminate columns and rows, respectfully. The IBM Netezza data warehouse appliance delivers unbeatable performance because it performs complex analytics as data streams off disk.
Many large-data analytics processes lack performance due to data movement. Traditional two-server solutions must move data within the database server and then over a network to the analytic server. General purpose data warehouses eliminate data movement across a network by collocating database and analytics servers, but are still hampered by copying data from disk to memory before scoring. High performance analytics servers take advantage of a stream processing architecture to eliminate unnecessary data movement. In other words, by using the IBM Netezza stream processing architecture, the credit card and insurance companies would immediately recognize performance gains of two weeks and six days, respectively. I hope this blog post helps you see that not all in-database analytics solutions are optimized for large-scale data. I welcome your feedback and I’m happy to field questions.