Friday, July 13, 2012

A discussion of data models for analytics


Some of our competitors recommend that you use 3rd Normal Form (3NF) for their data structures, as they believe that is the optimal architecture for the ad hoc queries that form the basis for decision support and analytical processing of today. While 3NF can save storage space, it makes queries harder to write, and slower to execute. A big down side of 3NF for data warehousing is that it causes the database to join tables for most queries. “Joins” can be a performance pitfall because they force large volumes of data to be moved around the system. To speed up these queries, DBAs using these other databases create and maintain aggregates and/or indexes across tables. In fact, some tables can have 3, 4, 5 or even more aggregates/ indexes if they are joined to tables using different columns. It is important to realize that these aggregates/indexes require knowledge of the queries, reports and analytics that are going to be run within the system, now and in the future.
Think of it this way—if you are heading out for a Sunday drive, and have no destination in mind, how can you use a navigation system to give you directions?

Another issue that many of our competitors’ customers report is that they need to create a duplicate copy of their data in a dimensional model in order to meet their desired response times. In many cases this dimensional model will contain many star schemas, as well as a number of copies of the fact tables containing different levels of data. The issue with this approach is that the application developer and the business users must be fully aware of what data these new tables and aggregates really contain. If they aren’t aware, they can inadvertently make a business decision based on an old snapshot of the data, or a small, skewed sample of the data. In addition, if a user mistakenly picks the wrong “table,” that query might take 10 or even 100 times longer to run that it would if they completely understood the model.

Because IBM Netezza appliances do not need indexes or pre-computed aggregates to perform well, there is no need to tune the data model for IBM Netezza appliances. In fact, many customers simply copy the data model directly from their old data mart/warehouse or from their transactional system into the IBM Netezza appliance and leave it as-is. They then simply load a single copy of the data and start running their queries.

3 comments:

fajar uno said...

tulisan yang anda buat sangat menarik, saya juga punya tulisan yang menarik, kamu bisa kunjungi di http://repository.gunadarma.ac.id/bitstream/123456789/2979/1/78.pdf

Unknown said...

Very interesting piece Dwaine. I've been analyzing different types of software for asset tracking to help my company improve our data management operations... This was very enlightening to read as I begin my search for best practices. Thanks for the great content.

Naz said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Big Data Hadoop and Spark Developer, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Netezza Training. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Sangita Mohanty
MaxMunus
E-mail: sangita@maxmunus.com
Skype id: training_maxmunus
Ph:(0) 9738075708 / 080 - 41103383
http://www.maxmunus.com/