Tuesday, September 25, 2012

Sorry for the lack of posts

I was asked to write a series of posts for the IBM Expert Integrated System blog area.  Check out my first post here 

Monday, August 06, 2012

The Downside of Down Casing

ANSI standard naming in databases is to Upper Case the names of tables and columns in the database. So, for the table users (see the statement below), the table name and the two column names should be stored in the database as USERS, USERID, and NAME.

create table users (UserID int, Name Char(60))

Now, to make things “easy”, ANSI standard databases also Upper Case references to tables and columns automatically. So the statement select userid from users would be automatically converted to select USERID from USERS as it is optimized, so that it will not fail.

In DB2 and Netezza if you run select userid from users or select USERID from users, or select UserID from Users , you get all users in the table. No matter what mix of case you use for the column named UserID, you get the same results, unless you enclose the table name or column name in quotes. If the name is enclosed in quotes, then the case is preserved, and must match exactly.

So, for the table users2 created like

create table “USERS2” (“USERID” int, “NAME” char(60))

Could be accessed in DB2 and Netezza using any of the following SQL statement, because of the way DB2 adheres to ANSI standards and Upper Cases the names.

         select USERID from USERS2
         select userid from users2
         select “USERID” from “USERS2”
         select “USERID” from users2

For databases that down case the table and column names, 3 of the above 4 statements would fail on the USERS2 table, and only the statement select “USERID” from “USERS2” would run.

Isn’t the way that DB2 and Netezza work a lot more intuitive, and a lot easier? And since you do not need to worry about the way that the SQL was written in your existing application, this is a lot less work to make your existing applications and BI report run.

Why cause more work for yourself?

Tuesday, July 17, 2012

Nucleus research reports a 241% ROI using Big Data to enable larger, more complex analytics

Monday, July 16, 2012

Adding a 4th V to BIG Data - Veracity

I talked a week or so ago about IBM’s 3 V’s of Big Data. Maybe it is time to add a 4th V, for Veracity.

Veracity deals with uncertain or imprecise data. In traditional data warehouses there was always the assumption that the data is certain, clean, and precise. That is why so much time was spent on ETL/ELT, Master Data Management, Data Lineage, Identity Insight/Assertion, etc.

However, when we start talking about social media data like Tweets, Facebook posts, etc. how much faith can or should we put in the data. Sure, this data can be used as a count toward your sentiment, but you would not count it toward your total sales and report on that.

Two of the now 4 V’s of Big Data are actually working against the Veracity of the data. Both Variety and Velocity hinder the ability to cleanse the data before analyzing it and making decisions.

Due to the sheer velocity of some data (like stock trades, or machine/sensor generated events), you cannot spend the time to “cleanse” it and get rid of the uncertainty, so you must process it as is - understanding the uncertainty in the data. And as you bring multi-structured data together, determining the origin of the data, and fields that correlate becomes nearly impossible.

When we talk Big Data, I think we need to define trusted data differently than we have in the past. I believe that the definition of trusted data depends on the way you are using the data and applying it to your business. The “trust” you have in the data will also influence the value of the data, and the impact of the decisions you make based on that data.

Friday, July 13, 2012

A discussion of data models for analytics

Some of our competitors recommend that you use 3rd Normal Form (3NF) for their data structures, as they believe that is the optimal architecture for the ad hoc queries that form the basis for decision support and analytical processing of today. While 3NF can save storage space, it makes queries harder to write, and slower to execute. A big down side of 3NF for data warehousing is that it causes the database to join tables for most queries. “Joins” can be a performance pitfall because they force large volumes of data to be moved around the system. To speed up these queries, DBAs using these other databases create and maintain aggregates and/or indexes across tables. In fact, some tables can have 3, 4, 5 or even more aggregates/ indexes if they are joined to tables using different columns. It is important to realize that these aggregates/indexes require knowledge of the queries, reports and analytics that are going to be run within the system, now and in the future.
Think of it this way—if you are heading out for a Sunday drive, and have no destination in mind, how can you use a navigation system to give you directions?

Another issue that many of our competitors’ customers report is that they need to create a duplicate copy of their data in a dimensional model in order to meet their desired response times. In many cases this dimensional model will contain many star schemas, as well as a number of copies of the fact tables containing different levels of data. The issue with this approach is that the application developer and the business users must be fully aware of what data these new tables and aggregates really contain. If they aren’t aware, they can inadvertently make a business decision based on an old snapshot of the data, or a small, skewed sample of the data. In addition, if a user mistakenly picks the wrong “table,” that query might take 10 or even 100 times longer to run that it would if they completely understood the model.

Because IBM Netezza appliances do not need indexes or pre-computed aggregates to perform well, there is no need to tune the data model for IBM Netezza appliances. In fact, many customers simply copy the data model directly from their old data mart/warehouse or from their transactional system into the IBM Netezza appliance and leave it as-is. They then simply load a single copy of the data and start running their queries.

Wednesday, July 11, 2012

Why Analytics is like Wine Tasting

I mentioned yesterday that I was in the Finger Lakes last week. Well, we stopped by some wineries while we were there and it is always interesting to read the tasting notes on the back of the bottle, or on the tasting sheet that the winery gives to you.

I was thinking last night about this and how great it would be if all of the data we, as an organization, create or consume came with its own “tasting notes”.  Just imagine a new set of transactions arriving from your OLTP systems, and they come with a tag that says something like “This data shows a correlation between the purchase of steaks and seasoning salt.” Wouldn’t that make the job of the data scientist / data analyst so much easier?

We also went to one winery in particular, and noted that the wine maker did not have any tasting notes, and only described his wine as “Like your favorite pair of slippers” or something like that. After talking about this for a while, we found that we actually liked this approach better. Rather than tasting and searching for what the wine maker told us he or she tasted, we were able to develop our own impression, and detect tastes on our own. Without being directed to a particular smell or taste, we used our own nose and palette to decide what we tasted, and what we liked. In the end we bought more bottles from this winery than we did from any of the other wineries that we visited.

You might be scratching your head and wonder where I am going here, so let me explain. I believe that analytics needs to be more like the second case above. You should not start with any preconceived notions on your data based on what others tell you. Analyze the data, detect correlations/patterns/trends on your own, and then check the tasting notes if you want to.  

 The goal of analytics should be to find NEW information that you can act upon, not simply find the same thing that someone else already found.

Tuesday, July 10, 2012

IBM's Big Data Platform - Saving One Life at a Time

Having been working on parts of IBM’s Big Data platform for the past year or more, I am continually impressed with the value that IBM brings to our clients.

When we talk Big Data at IBM, we talk about the three V’s: Variety, Velocity, and Volume.

Volume is pretty simple. We all understand that were going from the terabytes to petabytes and into a zettabytes world. I think most of us understand today just how much data is out there now and whats coming over the next few years.

The variety aspect is something kind of new to us in the data warehousing world, and it means that analytics are no longer just be for structured data, and on top of that, analytics on structured data doesn’t have to be in a traditional database any longer. The Big Data era is characterized by the absolute need and desire to explore and analyze all of the data that organization produce. Because most of the data we produce today is unstructured, we need to fold in unstructured data analytics as well as structured.

If you look at a Facebook post or a Tweet, they may come in a structured format (JSON), but the true value, and the part we need to analyze, is in the unstructured part. And that unstructured part is the text of your tweet or your Facebook status/post.

Finally, theres velocity. We at IBM consider velocity as being how fast the data arrives at the enterprise, and of course, its going to lead to the question, and how long does it take you to analyze it and act on it?

It is important to keep in mind that a Big Data problem could involve only one of these characteristics, or all of them. And in fact, most of our clients see that a closed loop mechanism, normally involving more than one of our Big Data solutions, is the best way to tackle their problem.

The neonatal ward at a well known Hospital is a prime example of this. Hospital equipment issues an alert when a vital sign goes out of range – prompting the hospital staff to take action immediate.  However many life threatening conditions take hours or days to reach critical levels, delaying possible life saving treatments. Often signs that something is wrong begin to appear long before the situation becomes serious enough to trigger an alert, and even a skilled nurse or physician might not be able to spot and interpret these trends in time to avoid serious complications. Complicating this is the fact that many of these warning indicators are hard to detect and it’s next to impossible to understand their interaction and implications until a threshold has been breached.

For example, nosocomial infection, a life threatening illness contracted in hospitals. Research has shown that signs of this infection can appear 12-24 hours before overt trouble/distress is spotted and normal ranges exceeded. Making things more complex, in a baby where this infection has set in, heart rates stay completely normal (i.e. it doesn’t rise and fall throughout the day like it does for a healthy baby). In addition, the pulse also stays within acceptable limits. The information needed to detect the information is present, it is very subtle and hard to detect. In a neonatal ward, the ability to absorb and reflect upon all of the data being presented is beyond human capacity, there is just too much data.

By analyzing historical data, and developing correlations and understanding of the indicators of this and other heath conditions, the Doctors and researchers were able to develop a set of rules (or set of conditions) that indicate a patient is suffering from a specific malady, like nosocomial infection. The monitors (which can produce 1,000+ reading per second) feed their reading into IBM’s InfoShpere Streams where it is checked on the fly. The data is checked against healthy ranges, and also against other values for the past 72 hours, and if there are any rules that are breached, an alert is generated. For example, if a child’s heart rate has not changed for the past 4 hours and their temperature is above 99 degrees, then that is a good indicator that they may be suffering from nosocomial infection.

And as the researchers continue to study more and more historical data in their data warehouse and Hadoop clusters, when they detect more correlations, they can dynamically update the rules that are being checked on the real time streaming data.

Monday, July 09, 2012

Indexes do NOT make a warehouse agile

I took a few days off to visit the Finger Lakes in New York (you really should go there if you like hiking and/or wine) and came back to an overflowing email inbox. Some of these emails were from clients that have been receiving more correspondence from our competitors with claims of their superiority over IBM Netezza.

One of these claims was that their solution is more nimble and able to handle broader workloads because they have indexes and aggregates. This one made me think for a few minutes, but I still feel that the Netezza approach where there is no need for indexes is still a far better solution.

While an index or aggregate can be used to improve/optimize performance for one or more queries, the upfront table, aggregate, and index design/build phase will cause other systems to take MANY TIMES longer to get up and running efficiently than IBM Netezza appliances. In fact, some of our competitors’ customers have openly talked about months long implementation cycles, while IBM Netezza customers talk about being up and running in 24 hours…
Instead of days or weeks of planning their data models, IBM Netezza appliance customers simply copy their table definitions from their existing tables, load the data, and start running their queries/reports/analytics. There is absolutely no need to create any indexes (and then have to choose between up to 19 different types of indexes) or aggregates. Where other data warehouse solutions require weeks of planning, IBM Netezza appliances are designed to deliver results almost immediately.
Today’s data warehouse technologies made it possible to collect and merge very large amounts of data. Systems that require indexes are fine for creating historical reports because you simply run the same report over and over again. But today’s business users need answers promptly. The answer to one question will determine the next question that they are going to ask, and that answer the next, and so on. This thought process is known as “train of thought” analysis, and this can lead to competitive advantages required in the economy of today and tomorrow.
Outside of IBM Netezza data warehouse appliances, standard operating procedure is for users to extract a small sample of data, move it out of the data warehouse to another server, and then run the analytics against that sample.  This is required because the systems cannot support these ad-hoc analytic reports without completely exhausting the system resources, and impacting all other users. Even if these other systems had the right indexes all of the time, they would still need to move massive amounts of data into memory before processing it. This has led to users with other data warehouse solutions sampling their data and copying a small sample to a dedicated analytics server.
The small sample size allows the analysis to complete in a reasonable amount of time, but by restricting this analysis to a small subset of the data, it becomes harder to spot (and act on) the trends found within it.  We discussed this above with the baseball example, but it applies to everyone. Credit cards are another example, with credit card numbers and identities being stolen every day, the card companies need to detect these misuses immediately in order to limit their liability and prevent loss. While there are quick “indicators” of fraud, like a new credit card being used at a pay phone at an airport for the first time, most indicators come from correlating more than one transaction. For example, a card cannot be used in Kansas and then in Orlando 7 minutes later, unless one of the transactions is a web transaction or the card number was manually entered because it was taken over the phone. So, if a card was physically swiped in Kansas and then in Orlando less than a few hours later, the account must have been compromised. Now, if the fraud detection application only looked at every 10, or 100, transactions, they would miss at least one, if not both, of these transactions most of the time.

Tuesday, July 03, 2012

Why should compression only work for read-only data?

A number of our competitors make bold compression claims, like “we provide 4X, or even 10X+ compression”. What they do not tell you is that you have to choose between multiple different compression levels, and/or compression libraries/types. They also do not mention that you cannot use their compression on tables that have inserts, updates, and/or deletes occurring. Nor do they mention the overhead that you can expect to see if you turn on their compression.

Let’s look at these three points in a little more detail.

In today’s world of reduced budgets, one of the easiest ways to save money with a data warehouse is to reduce the amount of disk space required to store the data. To this end, nearly all data warehouse solutions offer some form of database compression. In order to use compression with many of our competitors the table must be an Append-only table. 

Append-only tables have the following limitations:
            •Cannot UPDATE rows in an append-only table
•Cannot DELETE rows from an append-only table
•Cannot ALTER TABLE...ADD COLUMN to an append-only table
•Cannot add indexes to an append-only table.

These limitations are because of the way these vendors have implemented their ‘database compression’. While Netezza has built in algorithms specifically designed for database usage, others uses a compression library that “compresses” the data file as it is being written to disk, and then uses the same library to un-compress the file as it reads it from disk. For anyone that has used a tool like gzip, WinZip, pkzip, WinRar, etc. you all know how slow these tools are, and how much CPU cycles they use. This is the same consideration and overhead you will have with these other vendors if you use their compression. In fact this overhead can be so bad that some customers who have presented at our competitors’ conferences have talked about tests where a single query running on a table with compression used up over 95% of the CPU, while the same query against the same data in a table that was not compressed used less than 5% of the CPU.

On top of the performance impact, there is also the DBA overhead. With one competitor the DBA has to choose between 3 types of compression (i.e. compression libraries) and 9 different levels of compression, each of which work better for different data types. That is 27 different combinations/permutations that the DBA has to choose between, for each table.

With IBM Netezza compression is always on, there is no “better algorithm” for different tables, and because of the way the Netezza architecture works, when you get 4X compression with your data in Netezza, you see an associated 4X improvement in performance, for all types of workloads, not just reads. 

Monday, July 02, 2012

Addressing more crazy competitor claims

Among some of the other “claims” that our competitor made about Netezza is that is can only load at the rate of 2TB/hr. First off, this is false. The current generation of the Netezza platform can load at over 5TB/hr. But, the real question I ask is "Does this really matter after you have your system up and running?”

After the initial loading of the database from the existing system(s), very few companies load more than a couple hundred GB to a couple TB per day, and most do not even approach these daily or even monthly load volumes. Even Netezza’s biggest customers who have PetaByte sized systems do not find Netezza’s load speed to be an issue.

Now let’s look at the claims this competitor is making in more detail, and peel back the layers of the onion.  This competitor claims that they can load at double the Netezza’s load speed of 5TB/hr. But they leave out a number of important factors when they make this claim.

What about compression?
Netezza can load at a rate of 5TB/hr and compress the data at the same time. This competitor can only load at their claimed compression rate if compression is not used. So, if you want to compress the data, how fast can you really load on their platform?  They use a library based compression algorithm that uses CPU cycles to basically “zip” the data pages as they are written to disk, using significant CPU cycles in the system that cannot then be used to format the data into pages, build indexes, etc.

What about partitioned tables?
This competitor needs tables to be partitioned in order to provide good performance, but in order to load a table this competitor has to have a “rule” for each data partition, and then each row that is being loaded must be compared to the rules to know which partition it should be loaded into. If the row should be loaded into one of the first couple of ranges, then there is little extra processing, but all of the latest data will have to be checked against many rules, slowing down the processing of these rows, and definitely slowing down the load process.

What about indexes?
This competitor admits in their manuals that they also need indexes in order to perform well. But, each index incrementally slows down load performance.

Netezza does not need indexes to perform well, so does not suffer from decreased load speed because of indexes, or table partitioning

What about pre-processing?
Netezza can load at the same 5TB/hr rate with no pre-processing of the input data file. This same competitor can only load at their claimed faster rate if their appliance includes an option “integration” or ETL module where the servers pre-process the data and then send it to the data modules to be loaded. Without the integration module, the load file would need to be placed on the shared file system (accessible from all modules in their appliance) and then the load speed is really only 2TB/hr based on published validation reports of their architecture and procedures. And again, this 2TB/hr is without compression, or table partitioning.

Friday, June 29, 2012

Uncovering the truth about more outrageous competitor claims

Another of the claims this competitor made is that Netezza cannot handle the data ingest from a point of sale (POS) type system because so it could not handle the data being sent from the hundreds or thousands of POS systems at the same time. They go on to claim that Netezza cannot handle more than a specific, ridiculously small, number of concurrent writes to try to give credence to this argument.

In my opinion, this point shows a lack of knowledge of real data warehousing and analytics. IBM Netezza has a large number of retail customers who feed their POS data into Netezza all the time. But, this data normally always comes into the data center in some non-database type of communication protocol. The data is then extracted from these POS feeds, and has meta data management and identity assertion algorithms applied against it, because the data may be coming from many different stores, even different “named” stores where the same item may have different SKUs. Only then is the cleansed data loaded into the warehouse, it is not loaded directly from the hundreds/thousands of POS applications/registers.

IBM Netezza absolutely supports trickle feed and real/near-real time updates from this type of data stream process, as well as direct replication from other relational databases.

And, if you are looking for the ultimate in real time reporting and analytics on your POS data, IBM has the system for you. The IBM DB2 Analytics Accelerator is an IBM Netezza system connected directly to a DB2 for z/OS system using the zEnterprise connection. In this configuration, the transactional applications still run against the tried and true DB2 for z/OS system, and reporting/analytic queries get transparently routed through the DB2 for z/OS system to the Netezza system, to offload the processing and resource usage, and ultimately run much faster. DB2 for z/OS systems run many of the world’s largest scale OLTP applications, and this brings them the power of real time analytics without the need to create extra indexes, aggregates, etc. in their DB2 for z/OS system which are needed to allow the reports/analytics to run quickly enough, but also have a detrimental effect on the transactional performance.

Thursday, June 28, 2012

FUD Competitors are Spreading on Netezza

Recently I was made aware of some FUD (fear, uncertainty, and doubt) that a competitor has been sending to our current and prospective clients. This FUD contained a number of gross inaccuracies, as well as some “points” that really made me scratch my head and wonder how much this competitor really understands data warehousing and analytics.

This competitor claimed that Netezza scanned the entire table for all SQL / Analytic operations.

This claim is absolutely not true. While Netezza does not have indexes that the DBA must create and maintain, it does automatically build and maintain zone maps for all tables in the database. These zone maps contain the minimum and maximum value for all columns in every extent within each table. So, before the query starts reading from disk, it looks at the predicates in the query, and compares them to the zone maps to determine which table extents can be skipped and which need to be read.

For example, if you want to calculate the total sales of Red Ford Mustangs in June 2011, Netezza can skip any table extent that does not have data for June 2011. So, for a database with 7 years of sales history, it can skip any extent that has a maximum that is less, or a minimum that is greater than, June 2011. This eliminates 98% or more of the I/O required.

Our competitor claims that their range partitioning would automatically eliminate processing on all of the months other than June 2011 and is a better solution. While their range partitioning will eliminate the I/O like Netezza, there is a whole bunch of effort that partitioning brings to their solution that they do not talk about. In their solution you create a table, and then you create "inheritance children", one per partition. So for a table with 7 years of data, that is 84 monthly partitions, and 84 tables (the base table plus the 83 inheritance children). That might not seem too bad, but there's more. If you have a primary key, foreign key, index, constraint, or permission on the table, you need to apply it to the table and each of its inheritance children, it is not a global operation. So, for these 84 tables with 3 user groups with different permissions, a primary key, a constraint, and one non-unique index, that would be 84 * (3 + 1 + 1 + 1) or 504 DDL statements to set this up, and to maintain over time.

And on top of that, their bulk loader is not partition aware, so you need to write a rule for each table/inheritance child, adding 84 more DDL statements to the list.

In Netezza you write one statement to create the table, vs. 588 DDL statements for the exact same table and data in this competitor.

I’ll respond to some more of the claims this competitor has been making over my next few posts.

Sunday, May 20, 2012

Argumentative Colleagues and IBM Netezza reducing runtime from 7 hours to 8 seconds....

I had an “interesting discussion” with someone during my IBM Netezza Futures presentation last week at the IDUG (International DB2 User’s Group) conference. At the end of the presentation, one of my IBM colleagues was trying to argue that Netezza’s streaming architecture must not be able to handle multi-table joins because “it does all of its work in the FPGA”.  I tried to explain that he was incorrect, and the way that Netezza does joins so efficiently, but my colleague didn’t seem to listen, and asked the same thing again.

Although some people reading my post might think that I am writing this to try to explain to my IBM colleagues how they should act in customer events, I am not. 

The best part of this discussion was when my colleague asked the same question a second time, one of our customers stood up and said “What Dwaine said about the performance of IBM Netezza is exactly right”. The customer then went on to say how impressed his company is with the proof of concept (PoC) they just completed with the IBM DB2 Analytics Accelerator. I will explain the IBM DB2 Analytics Accelerator soon, but what the customer went on to say is that they have queries that were taking 7 hours on their DB2 for z/OS system (and using MIPS all that time) that are now running on the IBM DB2 Analytics Accelerator in 8 seconds and using basically no MIPS. I could not have paid for a better ending to my session.

The IBM DB2 Analytics Accelerator is an IBM Netezza 1000 system piggy backing along side a DB2 for z/OS system. The DB2 for z/OS optimizer will offload large, complex queries from the DB2 system onto the IBM DB2 Analytics Accelerator, much like a materialized view or materialize query table, but keep the short transactional queries running on DB2 for z/OS. 

So with absolutely no change to their application, data loading procedures, or any extra tuning, this customer was able to run their analytics over 3,000 times faster. That’s the power of the IBM Netezza appliance and the IBM DB2 Analytics Accelerator.

And, it would still be nice if my IBM colleagues learned how to act at customer events. 

Tuesday, May 15, 2012

Increasing Profits with the "Best Offer"

My colleague James Kobielus has been talking about “next best” action recently[1][2], and I also talked about this in my post on the importance of agility for analytic applications[3]. I have been talking to a number of customers recently who want to take this beyond next best, to the best offer.

In my blog I talked about a person who had made a number of purchases, and the retailer was trying to influence the customers next purchase by providing them with a coupon for an item that they might be interested in based on their purchasing history. But, how much more could a company make if they could influence their customers’ purchases before they are made, pushing them to more profitable items during the ordering process.

Think of a distributor of auto parts as an example. As the body shops and garages are placing orders, wither online or over the phone, as they enter an item’s description or SKU, what if you could offer a different part, that has the same specifications and same look and feel, but it costs less, and has a higher profit margin for you. Well, that would be a win-win situation, the garage gets a less expensive part that they can sell to the customer for less (like this would really happen, right?) but the distributor also makes a higher profit.

This take more than just a data warehouse / analytic application. And this type of example is where IBM is uniquely positioned to help you out. You cannot build this “system” with just a data warehouse.

You will need an analytics warehouse that can determine the most profitable items. You may need IBM InfoSphere BigInsights to examine your parts catalogs to find which parts have the same specifications and can be substituted for each other. You will probably need a rules storage mechanism that will record the compatible items from BigInsights and also from your internal sales team who have the deep knowledge of the items they sell. Then you need to have both your online ordering system and your internal order entry system must be intelligent enough to pop-up an offer whenever it sees one of the item numbers or SKUs that have been identified as "replaceable".

While you might be able to call a customer back if you can save them thousands of dollars or more, normally these offers need to be made as the order is being entered/taken, not afterwards.

Bring the power if the IBM Big Data family and our other Information Management and WebSphere solutions to your business, and drive more profit.

[1] http://thinking.netezza.com/blog/next-best-action-healthcare-save-improve-lives-through-applied-analytics
[2] http://thinking.netezza.com/blog/next-best-action-smarter-planet-think-global-optimize-local
[3] http://dsnowondb2.blogspot.com/2012/05/importance-of-agility-for-analytic.html