Dwaine Snow's Thoughts on Databases and Data Management

Thursday, June 28, 2012

FUD Competitors are Spreading on Netezza

Recently I was made aware of some FUD (fear, uncertainty, and doubt) that a competitor has been sending to our current and prospective clients. This FUD contained a number of gross inaccuracies, as well as some “points” that really made me scratch my head and wonder how much this competitor really understands data warehousing and analytics.

This competitor claimed that Netezza scanned the entire table for all SQL / Analytic operations.

This claim is absolutely not true. While Netezza does not have indexes that the DBA must create and maintain, it does automatically build and maintain zone maps for all tables in the database. These zone maps contain the minimum and maximum value for all columns in every extent within each table. So, before the query starts reading from disk, it looks at the predicates in the query, and compares them to the zone maps to determine which table extents can be skipped and which need to be read.

For example, if you want to calculate the total sales of Red Ford Mustangs in June 2011, Netezza can skip any table extent that does not have data for June 2011. So, for a database with 7 years of sales history, it can skip any extent that has a maximum that is less, or a minimum that is greater than, June 2011. This eliminates 98% or more of the I/O required.

Our competitor claims that their range partitioning would automatically eliminate processing on all of the months other than June 2011 and is a better solution. While their range partitioning will eliminate the I/O like Netezza, there is a whole bunch of effort that partitioning brings to their solution that they do not talk about. In their solution you create a table, and then you create "inheritance children", one per partition. So for a table with 7 years of data, that is 84 monthly partitions, and 84 tables (the base table plus the 83 inheritance children). That might not seem too bad, but there's more. If you have a primary key, foreign key, index, constraint, or permission on the table, you need to apply it to the table and each of its inheritance children, it is not a global operation. So, for these 84 tables with 3 user groups with different permissions, a primary key, a constraint, and one non-unique index, that would be 84 * (3 + 1 + 1 + 1) or 504 DDL statements to set this up, and to maintain over time.

And on top of that, their bulk loader is not partition aware, so you need to write a rule for each table/inheritance child, adding 84 more DDL statements to the list.

In Netezza you write one statement to create the table, vs. 588 DDL statements for the exact same table and data in this competitor.

I’ll respond to some more of the claims this competitor has been making over my next few posts.

Sunday, May 20, 2012

Argumentative Colleagues and IBM Netezza reducing runtime from 7 hours to 8 seconds....

I had an “interesting discussion” with someone during my IBM Netezza Futures presentation last week at the IDUG (International DB2 User’s Group) conference. At the end of the presentation, one of my IBM colleagues was trying to argue that Netezza’s streaming architecture must not be able to handle multi-table joins because “it does all of its work in the FPGA”. I tried to explain that he was incorrect, and the way that Netezza does joins so efficiently, but my colleague didn’t seem to listen, and asked the same thing again.

Although some people reading my post might think that I am writing this to try to explain to my IBM colleagues how they should act in customer events, I am not.

The best part of this discussion was when my colleague asked the same question a second time, one of our customers stood up and said “What Dwaine said about the performance of IBM Netezza is exactly right”. The customer then went on to say how impressed his company is with the proof of concept (PoC) they just completed with the IBM DB2 Analytics Accelerator. I will explain the IBM DB2 Analytics Accelerator soon, but what the customer went on to say is that they have queries that were taking 7 hours on their DB2 for z/OS system (and using MIPS all that time) that are now running on the IBM DB2 Analytics Accelerator in 8 seconds and using basically no MIPS. I could not have paid for a better ending to my session.

The IBM DB2 Analytics Accelerator is an IBM Netezza 1000 system piggy backing along side a DB2 for z/OS system. The DB2 for z/OS optimizer will offload large, complex queries from the DB2 system onto the IBM DB2 Analytics Accelerator, much like a materialized view or materialize query table, but keep the short transactional queries running on DB2 for z/OS.

So with absolutely no change to their application, data loading procedures, or any extra tuning, this customer was able to run their analytics over 3,000 times faster. That’s the power of the IBM Netezza appliance and the IBM DB2 Analytics Accelerator.

And, it would still be nice if my IBM colleagues learned how to act at customer events.

Tuesday, May 15, 2012

Increasing Profits with the "Best Offer"

My colleague James Kobielus has been talking about “next best” action recently[1][2], and I also talked about this in my post on the importance of agility for analytic applications[3]. I have been talking to a number of customers recently who want to take this beyond next best, to the best offer.

In my blog I talked about a person who had made a number of purchases, and the retailer was trying to influence the customers next purchase by providing them with a coupon for an item that they might be interested in based on their purchasing history. But, how much more could a company make if they could influence their customers’ purchases before they are made, pushing them to more profitable items during the ordering process.

Think of a distributor of auto parts as an example. As the body shops and garages are placing orders, wither online or over the phone, as they enter an item’s description or SKU, what if you could offer a different part, that has the same specifications and same look and feel, but it costs less, and has a higher profit margin for you. Well, that would be a win-win situation, the garage gets a less expensive part that they can sell to the customer for less (like this would really happen, right?) but the distributor also makes a higher profit.

This take more than just a data warehouse / analytic application. And this type of example is where IBM is uniquely positioned to help you out. You cannot build this “system” with just a data warehouse.

You will need an analytics warehouse that can determine the most profitable items. You may need IBM InfoSphere BigInsights to examine your parts catalogs to find which parts have the same specifications and can be substituted for each other. You will probably need a rules storage mechanism that will record the compatible items from BigInsights and also from your internal sales team who have the deep knowledge of the items they sell. Then you need to have both your online ordering system and your internal order entry system must be intelligent enough to pop-up an offer whenever it sees one of the item numbers or SKUs that have been identified as "replaceable".

While you might be able to call a customer back if you can save them thousands of dollars or more, normally these offers need to be made as the order is being entered/taken, not afterwards.

Bring the power if the IBM Big Data family and our other Information Management and WebSphere solutions to your business, and drive more profit.

[1] http://thinking.netezza.com/blog/next-best-action-healthcare-save-improve-lives-through-applied-analytics

[2] http://thinking.netezza.com/blog/next-best-action-smarter-planet-think-global-optimize-local

[3] http://dsnowondb2.blogspot.com/2012/05/importance-of-agility-for-analytic.html

Monday, May 14, 2012

Thoughts on "in memory" data warehousing / analytics

“In Memory” seems to be the latest buzz term in the database and analytics arena. If in memory was able to live up to the hype it would be a great thing, but there is still a lot of hype involved.

In memory database vendors claim that “Reporting requirements have a large impact on I/O time and network resources when dealing with large volumes of data. In order to minimize this, a company needs optimized file and database access methods. In-memory overcomes this process by shifting all the data and workload into memory.” [1]

But this is not the only way to eliminate the I/O bottleneck that can plague traditional databases when they are used for data warehousing and analytics. IBM Netezza solves the I/O bottleneck by pushing the work down to the I/O channel, where it acts on the data as it comes off disk, and only sends the data that matches the query across the server inter-connect.

Because IBM Netezza operates on the data as it streams off of the disks, it is not plagued data not fitting in memory. In an in-memory database, the entire database must fit in memory, and any temporary tables, scratch tables, overflowed sorts, etc. must also fit entirely within the memory the system has been configured to use. As data volume grow, and the “warehouse” becomes more and more visible, there will be more users, and as the number of users increases, the chance that 2, 3, or even 20 or more people will be running queries at the same time, and may be sorting or ordering data using temporary tables. Or, a new user may point and click their way around Business Objects and build a huge query that joins 5 or 10 tables. All of these will cause large temporary tables to be created, all at once, potentially filling up the memory on the system.

So, what happens if an in memory databases uses up all of the system memory (RAM) on a server? Well according to an SAP employee, the system will crash.[2] Does that seem like a good solution to you?

Because of issues like this, most in memory databases are used to run reports and/or analytics on a sample of the data, to restrict the data size to something manageable and affordable. And, as I explained a couple weeks ago, using data sampling for analytics is a bad idea, that will severely limit, if not completely eliminate the value you can derive from your data. Data that you store (some where else, since it cannot fit in memory) and do not analyze is a liability, only when you analyze your data does it become an asset. IBM Netezza ensures that all of your data is a business asset that you can derive the most value from.

There are a number of other points that I want to make on “in memory” databases, but I’ll talk about them in my upcoming posts.

[1] In Memory Analytuics, Retrieved 5/14/2012 from http://www.yellowfinbi.com/Document.i4?DocumentId=104879

[2] SAP Community Network, retrieved 5/14/2012 from http://forums.sdn.sap.com/thread.jspa?threadID=2125871

Wednesday, May 09, 2012

Come see me at IDUG next week in Denver

I will be presenting IBM Netezza futures next week at the IDUG North America conference. Drop by and say Hi.... Visit the IDUG Site for more info.

Tuesday, May 08, 2012

New Paper - Teradata and IBM Netezza Compared

Check out this paper comparing Teradata and IBM Netezza.

Click here to download the paper

Friday, May 04, 2012

It's Time to End Node Rage

Check out this paper and the videos to see why IBM Netezza appliance help "End Node Rage"

End Node Rage

Thursday, May 03, 2012

Performance in an Ad-Hoc Environment

I start off my Introduction to Netezza (or Netezza 101) presentations with the question “Would you still use Google if it took 7 days and 3 people o get your answer”, and most people scratch their head at first wondering why I ask this. Let me explain.

In my opinion, Google is a lot like today’s analytic tools/applications. When you use Google, you give it some search terms, and see what it comes up with. If you are lucky, the information you want is on the first page, but maybe it is on page two or three. Or maybe the search results are not what you were looking for, so you change the search terms completely, or add/remove some terms and search again. Eventually you find what you are looking for.

Today’s train of thought analytics is a lot like Google. You never know what someone is going to ask. You don't know where the result of one query will lead, or what the results will ultimately be. Unlike the report that gets the total sales for the past month, there is no “map” to guide train of thought analytics. IBM Netezza provides the unique ability to move beyond just reporting, and allows users to uncover the answers they are looking for.

IBM Netezza is the best solution for train of thought analytics for a number of reasons:

IBM Netezza does NOT require indexes to perform optimally
IBM Netezza allows analytics across all of your data, not just a subset (I talked about this in an earlier post[1])

In my opinion a database that relies on indexes (or even worse, 19 different types of indexes) and aggregates to perform well is ill suited for train of though analytics. Now, these databases can run without indexes, as long as you take a very small sample of the data to run against… But then you run into incorrect and/or poor results that will not help drive more value, and may in fact hurt the bottom line.

Let’s use an example to show why a system that needs indexes is not suited for train of thought, or any ad-hoc, analytics. Given the following tables:

create table car_sales(

delivery_date date,

make varchar(45),

model varchar(45),

model_year char(4),

color varchar(20),

VIN varchar(45),

selling_dealer varchar(45))

create table dealer_sales(

VIN varchar(45),

cost decimal (10,2)

sales_price decimal(10,2),

sales_person varchar(45))

create table sales_people(

sales_person varchar(45),

dealer varchar(45)

region varchar(10))

If you want the total of all sales then you need only access the dealer_sales table. There are two things that will make this query faster, a materialized view or an index that just contains only the price column in the dealer_sales table. In this case the rows in table dealer_sales are 102 bytes wide. If you created an index on the price column, the index would be roughly 1/8^th of the width of the table, so the query should run roughly 8 times faster. After seeing this query run a few times, a good DBA will create this index, but it is too late for the people who already ran their analytic queries.

Now, what if the regional sales manager wants to track the total sales by region at any given time. In order to calculate this you need to join the dealer_sales and car_sales tables, and you probably would want to create a join index (or an aggregate depending on which database you are using) to allow that to be done efficiently. The only column in these tables that is common across both tables is the VIN column, and that is not one of the columns that you are even referencing in the query (price and region, and probably date as well so he or she can look at the current day’s sales, as well as the current week, month, year to date, etc.).

If I were a regional manager I would also want to do some optimization of the delivery of new cars to the dealers where they are most likely to sell at the highest profit margin. Now, there are a number of ways to do this, and each has an increasing potential to drive more sales, if the current trends continue:

1) Compute the average net profit (sales_price – cost) for every region and ship more cars to that region than to other regions

· This would require at least two join indexes, and likely other secondary indexes on the sales_price and cost columns in dealer_sales.

2) Compute the average net profit for each make (i.e. Ford, Chevy, Dodge, Acura, etc.), and ship more of that make of car to the dealers for that make in that region than to other regions

· This would require the same indexes as above, but also an index on the make column in the car_sales table.

3) Compute the average net profit for each make and model, and ship more of that model of car to the dealers for that make in that region than to other regions

· This would require the same indexes as above, but also an index on the make and model columns in the car_sales table.

4) But, maybe different regions also have a preference for color, like in the north eastern US where white does not show the dirt from the sand and salt on the roads in the winter. If we combine the color preference with the make and model preference from above, then we can tailor the inventory of each dealer to their customer set.

· The color preference query would require the same indexes as above, but also an index on the color column in the car_sales table.

5) Taking this even further, let’s look at inventory optimization. What if we can eliminate the need to keep cars on the lot for months before they sell. The historical sales might indicate that in Maine there is a three week period at the beginning of May when their sales are three times their normal rate for a two week period. Well, then I would want to make sure I have enough cars on the lot there (optimized by make, model and color of course) to meet this need, but I do not want to keep this much inventory all year long as it ties up a lot of money.

· This would require more tuning of the data model with an index on the delivery_date column in the car_sales table.

Now, you can say that after all of this, you have all of the indexes and/or aggregates you might ever need. But that is only true for these queries. What about the local dealership manager who wants to see which salesman has sold the most cars in the past month, or sold the car with the highest average profit margin.

This is train of thought analytics, where one answer drives the next question. It is like going for a drive with no destination in mind. You cannot use MapQuest or Google Maps, since you do not know where you are going.

The revolutionary design of the IBM Netezza data warehouse provides exceptional performance without the need for indexes. The IBM Netezza architecture is based on a fundamental computer science principle: when operating on large data sets, do not move data unless absolutely necessary. IBM Netezza data warehouse appliances fully exploit this principle by utilizing commodity components called field programmable gate arrays (FPGAs) to filter out extraneous data as early in the data stream as possible, and as fast as data streams off the disk. This process of data elimination close to the data source removes I/O bottlenecks and frees up downstream components (CPU, memory and network) from processing superfluous data, and is therefore expected to have a significant multiplier effect on system performance.

[1] http://dsnowondb2.blogspot.com/2012/05/importance-of-agility-for-analytic.html

Wednesday, May 02, 2012

IBM Analytics help SUNY MS Research Team Improve Patient Care

http://thinking.netezza.com/press_release/leading-global-multiple-sclerosis-research-center-taps-ibm-analytics-improve-patient

IBM Netezza Delivers the Best TCO for Analytics Applications

More that 150 customers have compared IBM Netezza to Teradata, as well as other competitors, and they chose Netezza for a number of reasons. But the one underlying factor that most of these customers have told us, is how fast the IBM Netezza system paid for itself. This payback or return on investment (ROI) can come form many sources, including:

· A low up front cost

· Drastically reduced administration on the IBM Netezza system

· The ability to run analytics on all of your data (not a sample)

· The ability to run queries/reports that never finished on your existing system

While many people focus on the upfront cost of an item, in my opinion it is far more important to examine the total cost of ownership for that item. Think of it in the context of buying a new car. If Car-T costs $22,000 and Car-N costs $26,000 then that is a difference of $4,000. But, there is more to the costs of owning a car than just the purchase price. You also have insurance (think of that like upgrade protection/yearly maintenance when comparing it to a data warehouse system), and maintenance (the cost of the DBAs to keep the system running), as well as the cost of gas (the cost of electricity, cooling, etc.). If, for example, Car-T gets 12 mile per gallon (mpg) and Car-N gets 28 mpg, and you usually drive 15,000 mile per year. Then at a cost of $3.95 per gallon, it would cost you $4937 per year for gas for Car-T and only $2116 per year for Car-N. So, in this case Car-N would pay for the price difference in under 2 years, and every year after that you would reap the benefits of that choice.

Now, let’s make this car example match more closely to what data warehouse customers have to consider. In this case Car-T would be a 2-seater, cost $25,000 to buy, get 10 mpg, require premium gas, and have a top speed of 85 miles per hour (mph). Car-N would cost $15,000, have room for a family of 5, get 30 mpg using regular gas, and have a top speed of 300 mph.

If you can get the answers you want 3 times or more faster, at about half the up front cost and at one third of the ongoing costs, why would you choose anything else? Customers like MediaMath examined technologies form a number of different vendors, and chose Netezza because “Netezza’s appliance approach would be less costly than other options.”[1] And analysts (the consumer reports of the IT industry) agree. In their report, The Total Economic Impact Of IBM’s Netezza Data Warehouse Appliance With Advanced Analytics[2], Forrester examined a Netezza customer’s use of the IBM Netezza appliance, and based on their calculations the system paid for itself in under a year.

With the cost of DBAs (and gas) escalating, don't keep feeding your gas guzzler, take a test drive of a new IBM Netezza appliance and see how fast it really is, and how quickly it will pay for itself.

[1] http://www.google.com/url?sa=t&rct=j&q=nucleus%20research%20mediamath&source=web&cd=1&ved=0CDEQFjAA&url=http%3A%2F%2Fthinking.netezza.com%2Fsites%2Fdefault%2Ffiles%2Fdocument%2FNucleus_Research_ROI.pdf&ei=USyET4DgLIbV0QGO8sTnBw&usg=AFQjCNGqcbnE5tJxTsiwMB71uTMc5rCXOw&cad=rja

[2] http://thinking.netezza.com/pdf/total-economic-impact-ibms-netezza-data-warehouse-appliance-advanced-analytics-forrester-total

Tuesday, May 01, 2012

The Importance of Agility for Analytic Applications

In the context of data warehousing, agility means that the system can quickly and easily adapt to and accommodate; changes in data volumes, new data sources, new subject areas, new applications and/or new users. In order for a data warehouse to be able to do this, it needs to be able to run any query against any data model/schema. It must also be able to process all of the data, no mater what the query or what table(s) are being accessed, and do so quickly – without impacting the other users of the system.

You can accomplish this is a number of ways, including:

· Over-building the original data warehouse to be able to handle some incremental growth

· Restricting access to data or data sampling

· Adding more resources for the new users, data and applications as they come on board.

But normally there are a number of issues with these approaches,

· Clients do not want to spend 2-3 times as much as they need to up front so they can accommodate some future growth that may or may not occur

· Data sampling means that there is a good chance that the important data may be missed

· Adding more resources to an exiting Teradata system can be a long, arduous, and costly process[1]

In our opinion, it is far more effective, from a cost and effort, as well as overall performance of the system perspective, to augment the Teradata system with IBM Netezza data warehouse appliances where you can run the new applications without impacting the current users at all. Rather than wait for weeks for the new system to arrive, the data model to be tweaked for the new application, the data to be moved, and the database to be tuned, why not roll in an IBM Netezza appliance, copy the data model (schema) as is, load the data, and be up and running in hours?

As my colleague Nancy Kopp-Hensley discussed in her article "Consolidate Smarter with the Data Warehouse Ecosystem", we had a client that became challenged with query performance with their applications, and yet they were anxious to roll out some new applications in their sales and marketing divisions. Over time, they became challenged with query performance on their applications, and yet they were anxious to roll out some new applications in sales and marketing. And to top it off the business needed these new applications on-line right away. Rather than frustrate the business with a long timeline, which would have included first tuning the EDW to fix the existing problems before even starting the expansion, they chose to offload the new applications to a Netezza appliance. The result? Queries ran 24 times faster and they were able to achieve a much lower total cost of ownership (TCO).

You could also move your deep analytic applications to the IBM Netezza platform, and run against the entire data set, not just the last week’s data, or a sample of the data from the last year like in the EDW. This will provide more accurate results and predictions that will help drive more value to the organization. Consider an example, you are trying to predict what a shopper can be influenced to buy, given a coupon. Let’s say that the shopper has bought the following items in the past month:

1. Topographical Map of Alaska

2. The book “Hiking Alaska”

3. Tent

4. Back pack

5. Sleeping bag

6. Compass

7. Portable GPS

In their current shopping expedition they are buying a pair of hiking boots. Looking at the list of what they are buying, we might hazard a guess that they are looking to start hiking, but we do not know where, or know what else they might need. So, let’s sample their historical purchases, and see what we can come up with. Even with a 20% sample (which is much larger than normal) we might retrieve the tent and compass. We still do not know where they are going, so we might offer them a coupon for a sleeping bag. But we see that they already have one.

If the sample had included the book and the backpack instead, we now have an idea they might be going to Alaska, so maybe we should offer them a portable GPS for 20% off. This could be bad in a couple of ways… If the offer is for the same GPS they bought, they are likely to return the one they have and re-buy it, which just cut into the profit. If the offer is for a newer, better GPS and a price close to the price of what they just paid, then they may return the old one, or if they bought it just outside of the 30 day return window, you are likely to have an unhappy customer on your hands. This example shows why it is important to have fast analytics on all of your data, not just a “representative sample”, and this is what you can get by augmenting your EDW with an IBM Netezza data warehouse appliance.

[1] Teradata Customer Story - Overstock.com retrieved 05/17/2011 from http://tinyurl.com/6mdktgm -Since removed from the site