Nucleus research reports a 241% ROI using Big Data to enable larger, more complex analytics
Hi, and welcome to my blog. I have been working for IBM and working with DB2 for the past 22 years, and I recently started to work with our new colleagues from Netezza. Although I work for IBM, the views expressed are my own and not necessarily those of IBM and its affiliates. The views and opinions expressed by visitors to this blog are theirs and do not necessarily reflect mine.
Tuesday, July 17, 2012
Monday, July 16, 2012
Adding a 4th V to BIG Data - Veracity
I talked a week or so ago about IBM’s 3 V’s of Big Data. Maybe
it is time to add a 4th V, for Veracity.
Veracity deals with uncertain or imprecise data. In traditional
data warehouses there was always the assumption that the data is certain,
clean, and precise. That is why so much time was spent on ETL/ELT, Master Data
Management, Data Lineage, Identity Insight/Assertion, etc.
However, when we start talking about social media data
like Tweets, Facebook posts, etc. how much faith can or should we put in the
data. Sure, this data can be used as a count toward your sentiment, but you
would not count it toward your total sales and report on that.
Two of the now 4 V’s of Big Data are actually working
against the Veracity of the data. Both Variety and Velocity hinder the ability
to cleanse the data before analyzing it and making decisions.
Due to the sheer velocity of some data (like stock
trades, or machine/sensor generated events), you cannot spend the time to
“cleanse” it and get rid of the uncertainty, so you must process it as is -
understanding the uncertainty in the data. And as you bring multi-structured
data together, determining the origin of the data, and fields that correlate
becomes nearly impossible.
When we talk Big Data, I think we need to define trusted
data differently than we have in the past. I believe that the definition of
trusted data depends on the way you are using the data and applying it to your
business. The “trust” you have in the data will also influence the value of the
data, and the impact of the decisions you make based on that data.
Friday, July 13, 2012
A discussion of data models for analytics
Some of our
competitors recommend that you use 3rd Normal Form (3NF) for their data
structures, as they believe that is the optimal architecture for the ad hoc
queries that form the basis for decision support and analytical processing of
today. While 3NF can save storage space, it makes queries harder to write, and
slower to execute. A big down side of 3NF for data warehousing is that it
causes the database to join tables for most queries. “Joins” can be a
performance pitfall because they force large volumes of data to be moved around
the system. To speed up these queries, DBAs using these other databases create
and maintain aggregates and/or indexes across tables. In fact, some tables can
have 3, 4, 5 or even more aggregates/ indexes if they are joined to tables
using different columns. It is important to realize that these aggregates/indexes
require knowledge of the queries, reports and analytics that are going to be
run within the system, now and in the future.
Think of it
this way—if you are heading out for a Sunday drive, and have no destination in
mind, how can you use a navigation system to give you directions?
Another
issue that many of our competitors’ customers report is that they need to
create a duplicate copy of their data in a dimensional model in order to meet
their desired response times. In many cases this dimensional model will contain
many star schemas, as well as a number of copies of the fact tables containing
different levels of data. The issue with this approach is that the application
developer and the business users must be fully aware of what data these new tables
and aggregates really contain. If they aren’t aware, they can inadvertently
make a business decision based on an old snapshot of the data, or a small,
skewed sample of the data. In addition, if a user mistakenly picks the wrong
“table,” that query might take 10 or even 100 times longer to run that it would
if they completely understood the model.
Because IBM
Netezza appliances do not need indexes or pre-computed aggregates to perform
well, there is no need to tune the data model for IBM Netezza appliances. In
fact, many customers simply copy the data model directly from their old data
mart/warehouse or from their transactional system into the IBM Netezza
appliance and leave it as-is. They then simply load a single copy of the data
and start running their queries.
Wednesday, July 11, 2012
Why Analytics is like Wine Tasting
I mentioned yesterday that I was in the Finger Lakes last
week. Well, we stopped by some wineries while we were there and it is always
interesting to read the tasting notes on the back of the bottle, or on the
tasting sheet that the winery gives to you.
I was thinking last night about this and how great it would
be if all of the data we, as an organization, create or consume came with its
own “tasting notes”. Just imagine a new
set of transactions arriving from your OLTP systems, and they come with a tag that
says something like “This data shows a correlation between the purchase of
steaks and seasoning salt.” Wouldn’t that make the job of the data scientist /
data analyst so much easier?
We also went to one winery in particular, and noted that the
wine maker did not have any tasting notes, and only described his wine as “Like
your favorite pair of slippers” or something like that. After talking about
this for a while, we found that we actually liked this approach better. Rather
than tasting and searching for what the wine maker told us he or she tasted, we
were able to develop our own impression, and detect tastes on our own. Without
being directed to a particular smell or taste, we used our own nose and palette
to decide what we tasted, and what we liked. In the end we bought more bottles
from this winery than we did from any of the other wineries that we visited.
You might be scratching your head and wonder where I am
going here, so let me explain. I believe that analytics needs to be more like
the second case above. You should not start with any preconceived notions on
your data based on what others tell you. Analyze the data, detect
correlations/patterns/trends on your own, and then check the tasting notes if
you want to.
The goal of analytics
should be to find NEW information that you can act upon, not simply find the
same thing that someone else already found.
Tuesday, July 10, 2012
IBM's Big Data Platform - Saving One Life at a Time
Having been working on parts of IBM’s Big Data platform for
the past year or more, I am continually impressed with the value that IBM
brings to our clients.
When we talk Big Data at IBM, we talk about the three V’s:
Variety, Velocity, and Volume.
Volume is pretty simple. We all understand that we’re going from the terabytes to petabytes and into a
zettabytes world. I think most of us understand today just how much data is out
there now and what’s
coming over the next few years.
The variety aspect is something kind of new
to us in the data warehousing world, and it means that analytics are no longer
just be for structured data, and on top of that, analytics on structured data
doesn’t have to be in a traditional database any longer. The Big Data era is characterized
by the absolute need and desire to explore and analyze all of the data that
organization produce. Because most of the data we produce today is
unstructured, we need to fold in unstructured data analytics as well as
structured.
If you look
at a Facebook post or a Tweet, they may come in a structured format (JSON), but
the true value, and the part we need to analyze, is in the unstructured part.
And that unstructured part is the text of your tweet or your Facebook status/post.
Finally,
there’s velocity. We at IBM consider velocity
as being how fast the data arrives at the enterprise, and of course, it’s going to lead to the question, and
how long does it take you to analyze it and act on it?
It is important to keep in mind that a Big Data problem
could involve only one of these characteristics, or all of them. And in fact,
most of our clients see that a closed loop mechanism, normally involving more
than one of our Big Data solutions, is the best way to tackle their problem.
The neonatal ward at a well known Hospital is a prime
example of this. Hospital equipment issues an alert when a vital sign goes out
of range – prompting the hospital staff to take action immediate. However many life threatening conditions take
hours or days to reach critical levels, delaying possible life saving
treatments. Often signs that something is wrong begin to appear long before the
situation becomes serious enough to trigger an alert, and even a skilled nurse
or physician might not be able to spot and interpret these trends in time to
avoid serious complications. Complicating this is the fact that many of these
warning indicators are hard to detect and it’s next to impossible to understand
their interaction and implications until a threshold has been breached.
For example, nosocomial infection, a life threatening
illness contracted in hospitals. Research has shown that signs of this
infection can appear 12-24 hours before overt trouble/distress is spotted and
normal ranges exceeded. Making things more complex, in a baby where this
infection has set in, heart rates stay completely normal (i.e. it doesn’t rise
and fall throughout the day like it does for a healthy baby). In addition, the
pulse also stays within acceptable limits. The information needed to detect the
information is present, it is very subtle and hard to detect. In a neonatal
ward, the ability to absorb and reflect upon all of the data being presented is
beyond human capacity, there is just too much data.
By analyzing historical data, and developing correlations
and understanding of the indicators of this and other heath conditions, the
Doctors and researchers were able to develop a set of rules (or set of
conditions) that indicate a patient is suffering from a specific malady, like nosocomial
infection. The monitors (which can produce 1,000+ reading per second) feed
their reading into IBM’s InfoShpere Streams where it is checked on the fly. The
data is checked against healthy ranges, and also against other values for the
past 72 hours, and if there are any rules that are breached, an alert is
generated. For example, if a child’s heart rate has not changed for the past 4
hours and their temperature is above 99 degrees, then that is a good indicator
that they may be suffering from nosocomial infection.
And as the researchers continue to study more and more historical
data in their data warehouse and Hadoop clusters, when they detect more
correlations, they can dynamically update the rules that are being checked on
the real time streaming data.
Monday, July 09, 2012
Indexes do NOT make a warehouse agile
I took a
few days off to visit the Finger Lakes in New York (you really should go there
if you like hiking and/or wine) and came back to an overflowing email inbox.
Some of these emails were from clients that have been receiving more
correspondence from our competitors with claims of their superiority over IBM
Netezza.
One of
these claims was that their solution is more nimble and able to handle broader
workloads because they have indexes and aggregates. This one made me think for
a few minutes, but I still feel that the Netezza approach where there is no
need for indexes is still a far better solution.
While an
index or aggregate can be used to improve/optimize performance for one or more
queries, the upfront table, aggregate, and index design/build phase will cause
other systems to take MANY TIMES longer to get up and running efficiently than
IBM Netezza appliances. In fact, some of our competitors’ customers have openly
talked about months long implementation cycles, while IBM Netezza customers
talk about being up and running in 24 hours…
Instead of days or weeks of planning their data models, IBM Netezza
appliance customers simply copy their table definitions from their existing
tables, load the data, and start running their queries/reports/analytics. There
is absolutely no need to create any indexes (and then have to choose between up
to 19 different types of indexes) or aggregates. Where other data warehouse
solutions require weeks of planning, IBM Netezza appliances are designed to
deliver results almost immediately.
Today’s data
warehouse technologies made it possible to collect and merge very large amounts
of data. Systems that require indexes are fine for creating historical reports
because you simply run the same report over and over again. But today’s
business users need answers promptly. The answer to one question will determine
the next question that they are going to ask, and that answer the next, and so
on. This thought process is known as “train of thought” analysis, and this can
lead to competitive advantages required in the economy of today and tomorrow.
Outside of IBM
Netezza data warehouse appliances, standard operating procedure is for users to
extract a small sample of data, move it out of the data warehouse to another
server, and then run the analytics against that sample. This is required because the systems cannot
support these ad-hoc analytic reports without completely exhausting the system
resources, and impacting all other users. Even if these other systems had the
right indexes all of the time, they would still need to move massive amounts of
data into memory before processing it. This has led to users with other data warehouse
solutions sampling their data and copying a small sample to a dedicated
analytics server.
The small sample
size allows the analysis to complete in a reasonable amount of time, but by
restricting this analysis to a small subset of the data, it becomes harder to
spot (and act on) the trends found within it.
We discussed this above with the baseball example, but it applies to everyone.
Credit cards are another example, with credit card numbers and identities being
stolen every day, the card companies need to detect these misuses immediately
in order to limit their liability and prevent loss. While there are quick
“indicators” of fraud, like a new credit card being used at a pay phone at an
airport for the first time, most indicators come from correlating more than one
transaction. For example, a card cannot be used in Kansas and then in Orlando 7
minutes later, unless one of the transactions is a web transaction or the card
number was manually entered because it was taken over the phone. So, if a card
was physically swiped in Kansas and then in Orlando less than a few hours
later, the account must have been compromised. Now, if the fraud detection
application only looked at every 10, or 100, transactions, they would miss at
least one, if not both, of these transactions most of the time.
Tuesday, July 03, 2012
Why should compression only work for read-only data?
A number of our competitors make bold compression claims,
like “we provide 4X, or even 10X+ compression”. What they do not tell you is
that you have to choose between multiple different compression levels, and/or
compression libraries/types. They also do not mention that you cannot use their
compression on tables that have inserts, updates, and/or deletes occurring. Nor
do they mention the overhead that you can expect to see if you turn on their
compression.
Let’s look at these three points in a little more detail.
In today’s world of reduced
budgets, one of the easiest ways to save money with a data warehouse is to
reduce the amount of disk space required to store the data. To this end, nearly
all data warehouse solutions offer some form of database compression. In order
to use compression with many of our competitors the table must be an
Append-only table.
Append-only tables have the following limitations:
•Cannot UPDATE rows in an append-only table
•Cannot
DELETE rows from an append-only table
•Cannot
ALTER TABLE...ADD COLUMN to an append-only table
•Cannot
add indexes to an append-only table.
These limitations are because
of the way these vendors have implemented their ‘database compression’. While
Netezza has built in algorithms specifically designed for database usage,
others uses a compression library that “compresses” the data file as it is
being written to disk, and then uses the same library to un-compress the file
as it reads it from disk. For anyone that has used a tool like gzip, WinZip,
pkzip, WinRar, etc. you all know how slow these tools are, and how much CPU
cycles they use. This is the same consideration and overhead you will have with
these other vendors if you use their compression. In fact this overhead can be
so bad that some customers who have presented at our competitors’ conferences
have talked about tests where a single query running on a table with
compression used up over 95% of the CPU, while the same query against the same
data in a table that was not compressed used less than 5% of the CPU.
On top of the performance impact, there is also the DBA
overhead. With one competitor the DBA has to choose between 3 types of compression (i.e.
compression libraries) and 9 different levels of compression, each of which
work better for different data types. That is 27 different
combinations/permutations that the DBA has to choose between, for each table.
With
IBM Netezza compression is always on, there is no “better algorithm” for
different tables, and because of the way the Netezza architecture works, when
you get 4X compression with your data in Netezza, you see an associated 4X
improvement in performance, for all types of workloads, not just reads.
Monday, July 02, 2012
Addressing more crazy competitor claims
Among some of the other “claims”
that our competitor made about Netezza is that is can only load at the rate of
2TB/hr. First off, this is false. The current generation of the Netezza
platform can load at over 5TB/hr. But, the real question I ask is "Does this really matter after you have your system up and
running?”
After the initial loading of the
database from the existing system(s), very few companies load more than a
couple hundred GB to a couple TB per day, and most do not even approach these
daily or even monthly load volumes. Even Netezza’s biggest customers who have
PetaByte sized systems do not find Netezza’s load speed to be an issue.
Now let’s look at the claims this
competitor is making in more detail, and peel back the layers of the onion. This competitor claims that they can load at
double the Netezza’s load speed of 5TB/hr. But they leave out a number of
important factors when they make this claim.
What about compression?
Netezza can load at a rate of
5TB/hr and compress the data at the same time. This competitor can only load at
their claimed compression rate if compression is not used. So, if you want to
compress the data, how fast can you really load on their platform? They use a library based compression
algorithm that uses CPU cycles to basically “zip” the data pages as they are
written to disk, using significant CPU cycles in the system that cannot then be
used to format the data into pages, build indexes, etc.
What about partitioned tables?
This competitor needs tables to be
partitioned in order to provide good performance, but in order to load a table
this competitor has to have a “rule” for each data partition, and then each row
that is being loaded must be compared to the rules to know which partition it
should be loaded into. If the row should be loaded into one of the first couple
of ranges, then there is little extra processing, but all of the latest data
will have to be checked against many rules, slowing down the processing of
these rows, and definitely slowing down the load process.
What about indexes?
This competitor admits in their
manuals that they also need indexes in order to perform well. But, each index
incrementally slows down load performance.
Netezza does not need indexes to
perform well, so does not suffer from decreased load speed because of indexes,
or table partitioning
What about pre-processing?
Netezza can load at the same 5TB/hr
rate with no pre-processing of the input data file. This same competitor can
only load at their claimed faster rate if their appliance includes an option
“integration” or ETL module where the servers pre-process the data and then
send it to the data modules to be loaded. Without the integration module, the
load file would need to be placed on the shared file system (accessible from
all modules in their appliance) and then the load speed is really only 2TB/hr
based on published validation reports of their architecture and procedures. And
again, this 2TB/hr is without compression, or table partitioning.
Subscribe to:
Posts (Atom)