I start off my Introduction to Netezza
(or Netezza 101) presentations with the question “Would you still use Google if
it took 7 days and 3 people o get your answer”, and most people scratch their
head at first wondering why I ask this. Let me explain.
In my opinion, Google is a lot like
today’s analytic tools/applications. When you use Google, you give it some
search terms, and see what it comes up with. If you are lucky, the information
you want is on the first page, but maybe it is on page two or three. Or maybe
the search results are not what you were looking for, so you change the search
terms completely, or add/remove some terms and search again. Eventually you
find what you are looking for.
Today’s train of thought analytics is
a lot like Google. You never know what someone is going to ask. You don't know
where the result of one query will lead, or what the results will ultimately
be. Unlike the report that gets the total sales for the past month, there is no
“map” to guide train of thought analytics. IBM Netezza provides the unique
ability to move beyond just reporting, and allows users to uncover the answers
they are looking for.
IBM Netezza is the best solution for
train of thought analytics for a number of reasons:
- IBM Netezza does NOT require indexes to perform optimally
- IBM Netezza allows analytics across all of your data, not just a subset (I talked about this in an earlier post[1])
In my opinion a database that relies
on indexes (or even worse, 19 different types of indexes) and aggregates to
perform well is ill suited for train of though analytics. Now, these databases
can run without indexes, as long as you take a very small sample of the data to
run against… But then you run into incorrect and/or poor results that will not
help drive more value, and may in fact hurt the bottom line.
Let’s use an example to show why a
system that needs indexes is not suited for train of thought, or any ad-hoc,
analytics. Given the following tables:
create table
car_sales(
delivery_date date,
make varchar(45),
model varchar(45),
model_year char(4),
color varchar(20),
VIN
varchar(45),
selling_dealer varchar(45))
create table
dealer_sales(
VIN
varchar(45),
cost decimal (10,2)
sales_price decimal(10,2),
sales_person varchar(45))
create table
sales_people(
sales_person varchar(45),
dealer varchar(45)
region
varchar(10))
If you want the total of all sales
then you need only access the dealer_sales table. There are two things that
will make this query faster, a materialized view or an index that just contains
only the price column in the dealer_sales table. In this case the rows in table
dealer_sales are 102 bytes wide. If you created an index on the price column,
the index would be roughly 1/8th of the width of the table, so the
query should run roughly 8 times faster.
After seeing this query run a few times, a good DBA will create this
index, but it is too late for the people who already ran their analytic
queries.
Now, what if the regional sales manager
wants to track the total sales by region at any given time. In order to
calculate this you need to join the dealer_sales and car_sales tables, and you
probably would want to create a join index (or an aggregate depending on which
database you are using) to allow that to be done efficiently. The only column
in these tables that is common across both tables is the VIN column, and that
is not one of the columns that you are even referencing in the query (price and
region, and probably date as well so he or she can look at the current day’s
sales, as well as the current week, month, year to date, etc.).
If I were a regional manager I would
also want to do some optimization of the delivery of new cars to the dealers
where they are most likely to sell at the highest profit margin. Now, there are
a number of ways to do this, and each has an increasing potential to drive more
sales, if the current trends continue:
1) Compute the average net profit (sales_price – cost) for
every region and ship more cars to that region than to other regions
· This would require at least two join indexes, and likely
other secondary indexes on the sales_price and cost columns in dealer_sales.
2) Compute the average net profit for each make (i.e. Ford,
Chevy, Dodge, Acura, etc.), and ship more of that make of car to the dealers
for that make in that region than to other regions
· This would require the same indexes as above, but also an
index on the make column in the car_sales table.
3) Compute the average net profit for each make and model, and
ship more of that model of car to the dealers for that make in that region than
to other regions
· This would require the same indexes as above, but also an
index on the make and model columns in the car_sales table.
4) But, maybe different regions also have a preference for
color, like in the north eastern US where white does not show the dirt from the
sand and salt on the roads in the winter. If we combine the color preference
with the make and model preference from above, then we can tailor the inventory
of each dealer to their customer set.
· The color preference query would require the same indexes
as above, but also an index on the color column in the car_sales table.
5) Taking this even further, let’s look at inventory optimization.
What if we can eliminate the need to keep cars on the lot for months before
they sell. The historical sales might indicate that in Maine there is a three
week period at the beginning of May when their sales are three times their
normal rate for a two week period. Well, then I would want to make sure I have
enough cars on the lot there (optimized by make, model and color of course) to
meet this need, but I do not want to keep this much inventory all year long as
it ties up a lot of money.
· This would require more tuning of the data model with an
index on the delivery_date column in the car_sales table.
Now, you can say that after all of
this, you have all of the indexes and/or aggregates you might ever need. But that
is only true for these queries. What about the local dealership manager who
wants to see which salesman has sold the most cars in the past month, or sold
the car with the highest average profit margin.
This is train of thought analytics,
where one answer drives the next question. It is like going for a drive with no
destination in mind. You cannot use MapQuest or Google Maps, since you do not
know where you are going.
The revolutionary design
of the IBM Netezza data warehouse provides exceptional performance without the
need for indexes. The IBM Netezza architecture is based on a fundamental
computer science principle: when operating on large data sets, do not move data
unless absolutely necessary. IBM Netezza data warehouse appliances fully
exploit this principle by utilizing commodity components called field
programmable gate arrays (FPGAs) to filter out extraneous data as early in the
data stream as possible, and as fast as data streams off the disk. This process
of data elimination close to the data source removes I/O bottlenecks and frees
up downstream components (CPU, memory and network) from processing superfluous
data, and is therefore expected to have a significant multiplier effect on
system performance.
No comments:
Post a Comment