Thursday, June 28, 2012

FUD Competitors are Spreading on Netezza


Recently I was made aware of some FUD (fear, uncertainty, and doubt) that a competitor has been sending to our current and prospective clients. This FUD contained a number of gross inaccuracies, as well as some “points” that really made me scratch my head and wonder how much this competitor really understands data warehousing and analytics.

This competitor claimed that Netezza scanned the entire table for all SQL / Analytic operations.

This claim is absolutely not true. While Netezza does not have indexes that the DBA must create and maintain, it does automatically build and maintain zone maps for all tables in the database. These zone maps contain the minimum and maximum value for all columns in every extent within each table. So, before the query starts reading from disk, it looks at the predicates in the query, and compares them to the zone maps to determine which table extents can be skipped and which need to be read.

For example, if you want to calculate the total sales of Red Ford Mustangs in June 2011, Netezza can skip any table extent that does not have data for June 2011. So, for a database with 7 years of sales history, it can skip any extent that has a maximum that is less, or a minimum that is greater than, June 2011. This eliminates 98% or more of the I/O required.

Our competitor claims that their range partitioning would automatically eliminate processing on all of the months other than June 2011 and is a better solution. While their range partitioning will eliminate the I/O like Netezza, there is a whole bunch of effort that partitioning brings to their solution that they do not talk about. In their solution you create a table, and then you create "inheritance children", one per partition. So for a table with 7 years of data, that is 84 monthly partitions, and 84 tables (the base table plus the 83 inheritance children). That might not seem too bad, but there's more. If you have a primary key, foreign key, index, constraint, or permission on the table, you need to apply it to the table and each of its inheritance children, it is not a global operation. So, for these 84 tables with 3 user groups with different permissions, a primary key, a constraint, and one non-unique index, that would be 84 * (3 + 1 + 1 + 1) or 504 DDL statements to set this up, and to maintain over time.

And on top of that, their bulk loader is not partition aware, so you need to write a rule for each table/inheritance child, adding 84 more DDL statements to the list.

In Netezza you write one statement to create the table, vs. 588 DDL statements for the exact same table and data in this competitor.

I’ll respond to some more of the claims this competitor has been making over my next few posts.

3 comments:

Unknown said...
This comment has been removed by a blog administrator.
gprimos1 said...

Netezza Zone Maps are basically a combination of the best both partitioning and indexing. If you have your data organized right, you can get minimize the data read off disk as you mention. Beyond the DDL argument though, traditional database indexes and partitions have significant disadvantages to the zone map approach.

With partitions, every implementation I have seen requires there to be fixed number of partitions (years, months). If I only need a single day, I still have to scan a month's worth of extents off disk. Netezza zone maps allow me to organize on date so I only have to read a single day off disk. In addition, traditional partitions are hard to maintain, usually requiring you to explicitly create new partitions with new time periods. Zone Maps just happen based on how you organize your data. You can even defer organizing for more recent data so you don't have to incur the cost at load time. For that matter, zone maps can be used if you are trying to filter data based on seconds! Try to do that with most partitioning schemes.

Indexes are even more illustrative of the advantage zone maps offer. Most indexes are useful only for single-record retrieval. The problem is the cost of maintaining and storing and reading the indexes. While there are differences in implementation (B-tree, bitmap, etc), they all boil down to mapping a single index value to a single record. If you have a billion rows in your fact table, you are going to have a billion rows in your (thin) index. Zone maps are always going to be an order of magnitude smaller in size since they are at the extent level, and so the maintenance costs are trivial.

Also, if you are processing large percentage of data, indexes are no help at all. If your data is not organized (sorted) on the indexed column, then the values you are looking for are likely to be spread throughout the extents on disk. That means you have to read the entire table anyway, with the additional cost of having to read the index first!

If there is another approach than zone maps that will actually get you better performance when processing large amounts of data, I would like to hear it, but I can't imagine it myself.

Alex Buzunov said...

Hi,
Stats gathering/grooming is still a challenge in NZ for large tables, even with zonemaps.
Without partitions it whole locks table drawing it unavailble for users.
How do you make sure zone maps cove incremental inserts/updates without grooming and stats gathering?