Friday, July 13, 2012
A discussion of data models for analytics
Some of our competitors recommend that you use 3rd Normal Form (3NF) for their data structures, as they believe that is the optimal architecture for the ad hoc queries that form the basis for decision support and analytical processing of today. While 3NF can save storage space, it makes queries harder to write, and slower to execute. A big down side of 3NF for data warehousing is that it causes the database to join tables for most queries. “Joins” can be a performance pitfall because they force large volumes of data to be moved around the system. To speed up these queries, DBAs using these other databases create and maintain aggregates and/or indexes across tables. In fact, some tables can have 3, 4, 5 or even more aggregates/ indexes if they are joined to tables using different columns. It is important to realize that these aggregates/indexes require knowledge of the queries, reports and analytics that are going to be run within the system, now and in the future.
Think of it this way—if you are heading out for a Sunday drive, and have no destination in mind, how can you use a navigation system to give you directions?
Another issue that many of our competitors’ customers report is that they need to create a duplicate copy of their data in a dimensional model in order to meet their desired response times. In many cases this dimensional model will contain many star schemas, as well as a number of copies of the fact tables containing different levels of data. The issue with this approach is that the application developer and the business users must be fully aware of what data these new tables and aggregates really contain. If they aren’t aware, they can inadvertently make a business decision based on an old snapshot of the data, or a small, skewed sample of the data. In addition, if a user mistakenly picks the wrong “table,” that query might take 10 or even 100 times longer to run that it would if they completely understood the model.
Because IBM Netezza appliances do not need indexes or pre-computed aggregates to perform well, there is no need to tune the data model for IBM Netezza appliances. In fact, many customers simply copy the data model directly from their old data mart/warehouse or from their transactional system into the IBM Netezza appliance and leave it as-is. They then simply load a single copy of the data and start running their queries.