Monday, July 16, 2012
Adding a 4th V to BIG Data - Veracity
I talked a week or so ago about IBM’s 3 V’s of Big Data. Maybe it is time to add a 4th V, for Veracity.
Veracity deals with uncertain or imprecise data. In traditional data warehouses there was always the assumption that the data is certain, clean, and precise. That is why so much time was spent on ETL/ELT, Master Data Management, Data Lineage, Identity Insight/Assertion, etc.
However, when we start talking about social media data like Tweets, Facebook posts, etc. how much faith can or should we put in the data. Sure, this data can be used as a count toward your sentiment, but you would not count it toward your total sales and report on that.
Two of the now 4 V’s of Big Data are actually working against the Veracity of the data. Both Variety and Velocity hinder the ability to cleanse the data before analyzing it and making decisions.
Due to the sheer velocity of some data (like stock trades, or machine/sensor generated events), you cannot spend the time to “cleanse” it and get rid of the uncertainty, so you must process it as is - understanding the uncertainty in the data. And as you bring multi-structured data together, determining the origin of the data, and fields that correlate becomes nearly impossible.
When we talk Big Data, I think we need to define trusted data differently than we have in the past. I believe that the definition of trusted data depends on the way you are using the data and applying it to your business. The “trust” you have in the data will also influence the value of the data, and the impact of the decisions you make based on that data.