Dwaine Snow's Thoughts on Databases and Data Management: Performance in an Ad-Hoc Environment

I start off my Introduction to Netezza (or Netezza 101) presentations with the question “Would you still use Google if it took 7 days and 3 people o get your answer”, and most people scratch their head at first wondering why I ask this. Let me explain.

In my opinion, Google is a lot like today’s analytic tools/applications. When you use Google, you give it some search terms, and see what it comes up with. If you are lucky, the information you want is on the first page, but maybe it is on page two or three. Or maybe the search results are not what you were looking for, so you change the search terms completely, or add/remove some terms and search again. Eventually you find what you are looking for.

Today’s train of thought analytics is a lot like Google. You never know what someone is going to ask. You don't know where the result of one query will lead, or what the results will ultimately be. Unlike the report that gets the total sales for the past month, there is no “map” to guide train of thought analytics. IBM Netezza provides the unique ability to move beyond just reporting, and allows users to uncover the answers they are looking for.

IBM Netezza is the best solution for train of thought analytics for a number of reasons:

IBM Netezza does NOT require indexes to perform optimally
IBM Netezza allows analytics across all of your data, not just a subset (I talked about this in an earlier post[1])

In my opinion a database that relies on indexes (or even worse, 19 different types of indexes) and aggregates to perform well is ill suited for train of though analytics. Now, these databases can run without indexes, as long as you take a very small sample of the data to run against… But then you run into incorrect and/or poor results that will not help drive more value, and may in fact hurt the bottom line.

Let’s use an example to show why a system that needs indexes is not suited for train of thought, or any ad-hoc, analytics. Given the following tables:

create table car_sales(

delivery_date date,

make varchar(45),

model varchar(45),

model_year char(4),

color varchar(20),

VIN varchar(45),

selling_dealer varchar(45))

create table dealer_sales(

VIN varchar(45),

cost decimal (10,2)

sales_price decimal(10,2),

sales_person varchar(45))

create table sales_people(

sales_person varchar(45),

dealer varchar(45)

region varchar(10))

If you want the total of all sales then you need only access the dealer_sales table. There are two things that will make this query faster, a materialized view or an index that just contains only the price column in the dealer_sales table. In this case the rows in table dealer_sales are 102 bytes wide. If you created an index on the price column, the index would be roughly 1/8^th of the width of the table, so the query should run roughly 8 times faster. After seeing this query run a few times, a good DBA will create this index, but it is too late for the people who already ran their analytic queries.

Now, what if the regional sales manager wants to track the total sales by region at any given time. In order to calculate this you need to join the dealer_sales and car_sales tables, and you probably would want to create a join index (or an aggregate depending on which database you are using) to allow that to be done efficiently. The only column in these tables that is common across both tables is the VIN column, and that is not one of the columns that you are even referencing in the query (price and region, and probably date as well so he or she can look at the current day’s sales, as well as the current week, month, year to date, etc.).

If I were a regional manager I would also want to do some optimization of the delivery of new cars to the dealers where they are most likely to sell at the highest profit margin. Now, there are a number of ways to do this, and each has an increasing potential to drive more sales, if the current trends continue:

1) Compute the average net profit (sales_price – cost) for every region and ship more cars to that region than to other regions

· This would require at least two join indexes, and likely other secondary indexes on the sales_price and cost columns in dealer_sales.

2) Compute the average net profit for each make (i.e. Ford, Chevy, Dodge, Acura, etc.), and ship more of that make of car to the dealers for that make in that region than to other regions

· This would require the same indexes as above, but also an index on the make column in the car_sales table.

3) Compute the average net profit for each make and model, and ship more of that model of car to the dealers for that make in that region than to other regions

· This would require the same indexes as above, but also an index on the make and model columns in the car_sales table.

4) But, maybe different regions also have a preference for color, like in the north eastern US where white does not show the dirt from the sand and salt on the roads in the winter. If we combine the color preference with the make and model preference from above, then we can tailor the inventory of each dealer to their customer set.

· The color preference query would require the same indexes as above, but also an index on the color column in the car_sales table.

5) Taking this even further, let’s look at inventory optimization. What if we can eliminate the need to keep cars on the lot for months before they sell. The historical sales might indicate that in Maine there is a three week period at the beginning of May when their sales are three times their normal rate for a two week period. Well, then I would want to make sure I have enough cars on the lot there (optimized by make, model and color of course) to meet this need, but I do not want to keep this much inventory all year long as it ties up a lot of money.

· This would require more tuning of the data model with an index on the delivery_date column in the car_sales table.

Now, you can say that after all of this, you have all of the indexes and/or aggregates you might ever need. But that is only true for these queries. What about the local dealership manager who wants to see which salesman has sold the most cars in the past month, or sold the car with the highest average profit margin.

This is train of thought analytics, where one answer drives the next question. It is like going for a drive with no destination in mind. You cannot use MapQuest or Google Maps, since you do not know where you are going.

The revolutionary design of the IBM Netezza data warehouse provides exceptional performance without the need for indexes. The IBM Netezza architecture is based on a fundamental computer science principle: when operating on large data sets, do not move data unless absolutely necessary. IBM Netezza data warehouse appliances fully exploit this principle by utilizing commodity components called field programmable gate arrays (FPGAs) to filter out extraneous data as early in the data stream as possible, and as fast as data streams off the disk. This process of data elimination close to the data source removes I/O bottlenecks and frees up downstream components (CPU, memory and network) from processing superfluous data, and is therefore expected to have a significant multiplier effect on system performance.

[1] http://dsnowondb2.blogspot.com/2012/05/importance-of-agility-for-analytic.html

Dwaine Snow's Thoughts on Databases and Data Management

Thursday, May 03, 2012

Performance in an Ad-Hoc Environment

No comments: