Tuesday, July 03, 2012

Why should compression only work for read-only data?


A number of our competitors make bold compression claims, like “we provide 4X, or even 10X+ compression”. What they do not tell you is that you have to choose between multiple different compression levels, and/or compression libraries/types. They also do not mention that you cannot use their compression on tables that have inserts, updates, and/or deletes occurring. Nor do they mention the overhead that you can expect to see if you turn on their compression.

Let’s look at these three points in a little more detail.

In today’s world of reduced budgets, one of the easiest ways to save money with a data warehouse is to reduce the amount of disk space required to store the data. To this end, nearly all data warehouse solutions offer some form of database compression. In order to use compression with many of our competitors the table must be an Append-only table. 

Append-only tables have the following limitations:
            •Cannot UPDATE rows in an append-only table
•Cannot DELETE rows from an append-only table
•Cannot ALTER TABLE...ADD COLUMN to an append-only table
•Cannot add indexes to an append-only table.

These limitations are because of the way these vendors have implemented their ‘database compression’. While Netezza has built in algorithms specifically designed for database usage, others uses a compression library that “compresses” the data file as it is being written to disk, and then uses the same library to un-compress the file as it reads it from disk. For anyone that has used a tool like gzip, WinZip, pkzip, WinRar, etc. you all know how slow these tools are, and how much CPU cycles they use. This is the same consideration and overhead you will have with these other vendors if you use their compression. In fact this overhead can be so bad that some customers who have presented at our competitors’ conferences have talked about tests where a single query running on a table with compression used up over 95% of the CPU, while the same query against the same data in a table that was not compressed used less than 5% of the CPU.


On top of the performance impact, there is also the DBA overhead. With one competitor the DBA has to choose between 3 types of compression (i.e. compression libraries) and 9 different levels of compression, each of which work better for different data types. That is 27 different combinations/permutations that the DBA has to choose between, for each table.

With IBM Netezza compression is always on, there is no “better algorithm” for different tables, and because of the way the Netezza architecture works, when you get 4X compression with your data in Netezza, you see an associated 4X improvement in performance, for all types of workloads, not just reads. 

No comments: