Tuesday, July 17, 2012
Monday, July 16, 2012
I talked a week or so ago about IBM’s 3 V’s of Big Data. Maybe it is time to add a 4th V, for Veracity.
Veracity deals with uncertain or imprecise data. In traditional data warehouses there was always the assumption that the data is certain, clean, and precise. That is why so much time was spent on ETL/ELT, Master Data Management, Data Lineage, Identity Insight/Assertion, etc.
However, when we start talking about social media data like Tweets, Facebook posts, etc. how much faith can or should we put in the data. Sure, this data can be used as a count toward your sentiment, but you would not count it toward your total sales and report on that.
Two of the now 4 V’s of Big Data are actually working against the Veracity of the data. Both Variety and Velocity hinder the ability to cleanse the data before analyzing it and making decisions.
Due to the sheer velocity of some data (like stock trades, or machine/sensor generated events), you cannot spend the time to “cleanse” it and get rid of the uncertainty, so you must process it as is - understanding the uncertainty in the data. And as you bring multi-structured data together, determining the origin of the data, and fields that correlate becomes nearly impossible.
When we talk Big Data, I think we need to define trusted data differently than we have in the past. I believe that the definition of trusted data depends on the way you are using the data and applying it to your business. The “trust” you have in the data will also influence the value of the data, and the impact of the decisions you make based on that data.
Friday, July 13, 2012
Some of our competitors recommend that you use 3rd Normal Form (3NF) for their data structures, as they believe that is the optimal architecture for the ad hoc queries that form the basis for decision support and analytical processing of today. While 3NF can save storage space, it makes queries harder to write, and slower to execute. A big down side of 3NF for data warehousing is that it causes the database to join tables for most queries. “Joins” can be a performance pitfall because they force large volumes of data to be moved around the system. To speed up these queries, DBAs using these other databases create and maintain aggregates and/or indexes across tables. In fact, some tables can have 3, 4, 5 or even more aggregates/ indexes if they are joined to tables using different columns. It is important to realize that these aggregates/indexes require knowledge of the queries, reports and analytics that are going to be run within the system, now and in the future.
Think of it this way—if you are heading out for a Sunday drive, and have no destination in mind, how can you use a navigation system to give you directions?
Another issue that many of our competitors’ customers report is that they need to create a duplicate copy of their data in a dimensional model in order to meet their desired response times. In many cases this dimensional model will contain many star schemas, as well as a number of copies of the fact tables containing different levels of data. The issue with this approach is that the application developer and the business users must be fully aware of what data these new tables and aggregates really contain. If they aren’t aware, they can inadvertently make a business decision based on an old snapshot of the data, or a small, skewed sample of the data. In addition, if a user mistakenly picks the wrong “table,” that query might take 10 or even 100 times longer to run that it would if they completely understood the model.
Because IBM Netezza appliances do not need indexes or pre-computed aggregates to perform well, there is no need to tune the data model for IBM Netezza appliances. In fact, many customers simply copy the data model directly from their old data mart/warehouse or from their transactional system into the IBM Netezza appliance and leave it as-is. They then simply load a single copy of the data and start running their queries.
Wednesday, July 11, 2012
I mentioned yesterday that I was in the Finger Lakes last week. Well, we stopped by some wineries while we were there and it is always interesting to read the tasting notes on the back of the bottle, or on the tasting sheet that the winery gives to you.
I was thinking last night about this and how great it would be if all of the data we, as an organization, create or consume came with its own “tasting notes”. Just imagine a new set of transactions arriving from your OLTP systems, and they come with a tag that says something like “This data shows a correlation between the purchase of steaks and seasoning salt.” Wouldn’t that make the job of the data scientist / data analyst so much easier?
We also went to one winery in particular, and noted that the wine maker did not have any tasting notes, and only described his wine as “Like your favorite pair of slippers” or something like that. After talking about this for a while, we found that we actually liked this approach better. Rather than tasting and searching for what the wine maker told us he or she tasted, we were able to develop our own impression, and detect tastes on our own. Without being directed to a particular smell or taste, we used our own nose and palette to decide what we tasted, and what we liked. In the end we bought more bottles from this winery than we did from any of the other wineries that we visited.
You might be scratching your head and wonder where I am going here, so let me explain. I believe that analytics needs to be more like the second case above. You should not start with any preconceived notions on your data based on what others tell you. Analyze the data, detect correlations/patterns/trends on your own, and then check the tasting notes if you want to.
The goal of analytics should be to find NEW information that you can act upon, not simply find the same thing that someone else already found.
Tuesday, July 10, 2012
Having been working on parts of IBM’s Big Data platform for the past year or more, I am continually impressed with the value that IBM brings to our clients.
When we talk Big Data at IBM, we talk about the three V’s: Variety, Velocity, and Volume.
Volume is pretty simple. We all understand that we’re going from the terabytes to petabytes and into a zettabytes world. I think most of us understand today just how much data is out there now and what’s coming over the next few years.
The variety aspect is something kind of new to us in the data warehousing world, and it means that analytics are no longer just be for structured data, and on top of that, analytics on structured data doesn’t have to be in a traditional database any longer. The Big Data era is characterized by the absolute need and desire to explore and analyze all of the data that organization produce. Because most of the data we produce today is unstructured, we need to fold in unstructured data analytics as well as structured.
If you look at a Facebook post or a Tweet, they may come in a structured format (JSON), but the true value, and the part we need to analyze, is in the unstructured part. And that unstructured part is the text of your tweet or your Facebook status/post.
Finally, there’s velocity. We at IBM consider velocity as being how fast the data arrives at the enterprise, and of course, it’s going to lead to the question, and how long does it take you to analyze it and act on it?
It is important to keep in mind that a Big Data problem could involve only one of these characteristics, or all of them. And in fact, most of our clients see that a closed loop mechanism, normally involving more than one of our Big Data solutions, is the best way to tackle their problem.
The neonatal ward at a well known Hospital is a prime example of this. Hospital equipment issues an alert when a vital sign goes out of range – prompting the hospital staff to take action immediate. However many life threatening conditions take hours or days to reach critical levels, delaying possible life saving treatments. Often signs that something is wrong begin to appear long before the situation becomes serious enough to trigger an alert, and even a skilled nurse or physician might not be able to spot and interpret these trends in time to avoid serious complications. Complicating this is the fact that many of these warning indicators are hard to detect and it’s next to impossible to understand their interaction and implications until a threshold has been breached.
For example, nosocomial infection, a life threatening illness contracted in hospitals. Research has shown that signs of this infection can appear 12-24 hours before overt trouble/distress is spotted and normal ranges exceeded. Making things more complex, in a baby where this infection has set in, heart rates stay completely normal (i.e. it doesn’t rise and fall throughout the day like it does for a healthy baby). In addition, the pulse also stays within acceptable limits. The information needed to detect the information is present, it is very subtle and hard to detect. In a neonatal ward, the ability to absorb and reflect upon all of the data being presented is beyond human capacity, there is just too much data.
By analyzing historical data, and developing correlations and understanding of the indicators of this and other heath conditions, the Doctors and researchers were able to develop a set of rules (or set of conditions) that indicate a patient is suffering from a specific malady, like nosocomial infection. The monitors (which can produce 1,000+ reading per second) feed their reading into IBM’s InfoShpere Streams where it is checked on the fly. The data is checked against healthy ranges, and also against other values for the past 72 hours, and if there are any rules that are breached, an alert is generated. For example, if a child’s heart rate has not changed for the past 4 hours and their temperature is above 99 degrees, then that is a good indicator that they may be suffering from nosocomial infection.
And as the researchers continue to study more and more historical data in their data warehouse and Hadoop clusters, when they detect more correlations, they can dynamically update the rules that are being checked on the real time streaming data.
Monday, July 09, 2012
I took a few days off to visit the Finger Lakes in New York (you really should go there if you like hiking and/or wine) and came back to an overflowing email inbox. Some of these emails were from clients that have been receiving more correspondence from our competitors with claims of their superiority over IBM Netezza.
One of these claims was that their solution is more nimble and able to handle broader workloads because they have indexes and aggregates. This one made me think for a few minutes, but I still feel that the Netezza approach where there is no need for indexes is still a far better solution.
While an index or aggregate can be used to improve/optimize performance for one or more queries, the upfront table, aggregate, and index design/build phase will cause other systems to take MANY TIMES longer to get up and running efficiently than IBM Netezza appliances. In fact, some of our competitors’ customers have openly talked about months long implementation cycles, while IBM Netezza customers talk about being up and running in 24 hours…
Instead of days or weeks of planning their data models, IBM Netezza appliance customers simply copy their table definitions from their existing tables, load the data, and start running their queries/reports/analytics. There is absolutely no need to create any indexes (and then have to choose between up to 19 different types of indexes) or aggregates. Where other data warehouse solutions require weeks of planning, IBM Netezza appliances are designed to deliver results almost immediately.
Today’s data warehouse technologies made it possible to collect and merge very large amounts of data. Systems that require indexes are fine for creating historical reports because you simply run the same report over and over again. But today’s business users need answers promptly. The answer to one question will determine the next question that they are going to ask, and that answer the next, and so on. This thought process is known as “train of thought” analysis, and this can lead to competitive advantages required in the economy of today and tomorrow.
Outside of IBM Netezza data warehouse appliances, standard operating procedure is for users to extract a small sample of data, move it out of the data warehouse to another server, and then run the analytics against that sample. This is required because the systems cannot support these ad-hoc analytic reports without completely exhausting the system resources, and impacting all other users. Even if these other systems had the right indexes all of the time, they would still need to move massive amounts of data into memory before processing it. This has led to users with other data warehouse solutions sampling their data and copying a small sample to a dedicated analytics server.
The small sample size allows the analysis to complete in a reasonable amount of time, but by restricting this analysis to a small subset of the data, it becomes harder to spot (and act on) the trends found within it. We discussed this above with the baseball example, but it applies to everyone. Credit cards are another example, with credit card numbers and identities being stolen every day, the card companies need to detect these misuses immediately in order to limit their liability and prevent loss. While there are quick “indicators” of fraud, like a new credit card being used at a pay phone at an airport for the first time, most indicators come from correlating more than one transaction. For example, a card cannot be used in Kansas and then in Orlando 7 minutes later, unless one of the transactions is a web transaction or the card number was manually entered because it was taken over the phone. So, if a card was physically swiped in Kansas and then in Orlando less than a few hours later, the account must have been compromised. Now, if the fraud detection application only looked at every 10, or 100, transactions, they would miss at least one, if not both, of these transactions most of the time.
Tuesday, July 03, 2012
A number of our competitors make bold compression claims, like “we provide 4X, or even 10X+ compression”. What they do not tell you is that you have to choose between multiple different compression levels, and/or compression libraries/types. They also do not mention that you cannot use their compression on tables that have inserts, updates, and/or deletes occurring. Nor do they mention the overhead that you can expect to see if you turn on their compression.
Let’s look at these three points in a little more detail.
In today’s world of reduced budgets, one of the easiest ways to save money with a data warehouse is to reduce the amount of disk space required to store the data. To this end, nearly all data warehouse solutions offer some form of database compression. In order to use compression with many of our competitors the table must be an Append-only table.
Append-only tables have the following limitations:
•Cannot UPDATE rows in an append-only table
•Cannot DELETE rows from an append-only table
•Cannot ALTER TABLE...ADD COLUMN to an append-only table
•Cannot add indexes to an append-only table.
These limitations are because of the way these vendors have implemented their ‘database compression’. While Netezza has built in algorithms specifically designed for database usage, others uses a compression library that “compresses” the data file as it is being written to disk, and then uses the same library to un-compress the file as it reads it from disk. For anyone that has used a tool like gzip, WinZip, pkzip, WinRar, etc. you all know how slow these tools are, and how much CPU cycles they use. This is the same consideration and overhead you will have with these other vendors if you use their compression. In fact this overhead can be so bad that some customers who have presented at our competitors’ conferences have talked about tests where a single query running on a table with compression used up over 95% of the CPU, while the same query against the same data in a table that was not compressed used less than 5% of the CPU.
On top of the performance impact, there is also the DBA overhead. With one competitor the DBA has to choose between 3 types of compression (i.e. compression libraries) and 9 different levels of compression, each of which work better for different data types. That is 27 different combinations/permutations that the DBA has to choose between, for each table.
With IBM Netezza compression is always on, there is no “better algorithm” for different tables, and because of the way the Netezza architecture works, when you get 4X compression with your data in Netezza, you see an associated 4X improvement in performance, for all types of workloads, not just reads.
Monday, July 02, 2012
Among some of the other “claims” that our competitor made about Netezza is that is can only load at the rate of 2TB/hr. First off, this is false. The current generation of the Netezza platform can load at over 5TB/hr. But, the real question I ask is "Does this really matter after you have your system up and running?”
After the initial loading of the database from the existing system(s), very few companies load more than a couple hundred GB to a couple TB per day, and most do not even approach these daily or even monthly load volumes. Even Netezza’s biggest customers who have PetaByte sized systems do not find Netezza’s load speed to be an issue.
Now let’s look at the claims this competitor is making in more detail, and peel back the layers of the onion. This competitor claims that they can load at double the Netezza’s load speed of 5TB/hr. But they leave out a number of important factors when they make this claim.
What about compression?
Netezza can load at a rate of 5TB/hr and compress the data at the same time. This competitor can only load at their claimed compression rate if compression is not used. So, if you want to compress the data, how fast can you really load on their platform? They use a library based compression algorithm that uses CPU cycles to basically “zip” the data pages as they are written to disk, using significant CPU cycles in the system that cannot then be used to format the data into pages, build indexes, etc.
What about partitioned tables?
This competitor needs tables to be partitioned in order to provide good performance, but in order to load a table this competitor has to have a “rule” for each data partition, and then each row that is being loaded must be compared to the rules to know which partition it should be loaded into. If the row should be loaded into one of the first couple of ranges, then there is little extra processing, but all of the latest data will have to be checked against many rules, slowing down the processing of these rows, and definitely slowing down the load process.
What about indexes?
This competitor admits in their manuals that they also need indexes in order to perform well. But, each index incrementally slows down load performance.
Netezza does not need indexes to perform well, so does not suffer from decreased load speed because of indexes, or table partitioning
What about pre-processing?
Netezza can load at the same 5TB/hr rate with no pre-processing of the input data file. This same competitor can only load at their claimed faster rate if their appliance includes an option “integration” or ETL module where the servers pre-process the data and then send it to the data modules to be loaded. Without the integration module, the load file would need to be placed on the shared file system (accessible from all modules in their appliance) and then the load speed is really only 2TB/hr based on published validation reports of their architecture and procedures. And again, this 2TB/hr is without compression, or table partitioning.