Monday, July 09, 2012
Indexes do NOT make a warehouse agile
I took a few days off to visit the Finger Lakes in New York (you really should go there if you like hiking and/or wine) and came back to an overflowing email inbox. Some of these emails were from clients that have been receiving more correspondence from our competitors with claims of their superiority over IBM Netezza.
One of these claims was that their solution is more nimble and able to handle broader workloads because they have indexes and aggregates. This one made me think for a few minutes, but I still feel that the Netezza approach where there is no need for indexes is still a far better solution.
While an index or aggregate can be used to improve/optimize performance for one or more queries, the upfront table, aggregate, and index design/build phase will cause other systems to take MANY TIMES longer to get up and running efficiently than IBM Netezza appliances. In fact, some of our competitors’ customers have openly talked about months long implementation cycles, while IBM Netezza customers talk about being up and running in 24 hours…
Instead of days or weeks of planning their data models, IBM Netezza appliance customers simply copy their table definitions from their existing tables, load the data, and start running their queries/reports/analytics. There is absolutely no need to create any indexes (and then have to choose between up to 19 different types of indexes) or aggregates. Where other data warehouse solutions require weeks of planning, IBM Netezza appliances are designed to deliver results almost immediately.
Today’s data warehouse technologies made it possible to collect and merge very large amounts of data. Systems that require indexes are fine for creating historical reports because you simply run the same report over and over again. But today’s business users need answers promptly. The answer to one question will determine the next question that they are going to ask, and that answer the next, and so on. This thought process is known as “train of thought” analysis, and this can lead to competitive advantages required in the economy of today and tomorrow.
Outside of IBM Netezza data warehouse appliances, standard operating procedure is for users to extract a small sample of data, move it out of the data warehouse to another server, and then run the analytics against that sample. This is required because the systems cannot support these ad-hoc analytic reports without completely exhausting the system resources, and impacting all other users. Even if these other systems had the right indexes all of the time, they would still need to move massive amounts of data into memory before processing it. This has led to users with other data warehouse solutions sampling their data and copying a small sample to a dedicated analytics server.
The small sample size allows the analysis to complete in a reasonable amount of time, but by restricting this analysis to a small subset of the data, it becomes harder to spot (and act on) the trends found within it. We discussed this above with the baseball example, but it applies to everyone. Credit cards are another example, with credit card numbers and identities being stolen every day, the card companies need to detect these misuses immediately in order to limit their liability and prevent loss. While there are quick “indicators” of fraud, like a new credit card being used at a pay phone at an airport for the first time, most indicators come from correlating more than one transaction. For example, a card cannot be used in Kansas and then in Orlando 7 minutes later, unless one of the transactions is a web transaction or the card number was manually entered because it was taken over the phone. So, if a card was physically swiped in Kansas and then in Orlando less than a few hours later, the account must have been compromised. Now, if the fraud detection application only looked at every 10, or 100, transactions, they would miss at least one, if not both, of these transactions most of the time.