Monday, May 14, 2012
Thoughts on "in memory" data warehousing / analytics
“In Memory” seems to be the latest buzz term in the database and analytics arena. If in memory was able to live up to the hype it would be a great thing, but there is still a lot of hype involved.
In memory database vendors claim that “Reporting requirements have a large impact on I/O time and network resources when dealing with large volumes of data. In order to minimize this, a company needs optimized file and database access methods. In-memory overcomes this process by shifting all the data and workload into memory.” 
But this is not the only way to eliminate the I/O bottleneck that can plague traditional databases when they are used for data warehousing and analytics. IBM Netezza solves the I/O bottleneck by pushing the work down to the I/O channel, where it acts on the data as it comes off disk, and only sends the data that matches the query across the server inter-connect.
Because IBM Netezza operates on the data as it streams off of the disks, it is not plagued data not fitting in memory. In an in-memory database, the entire database must fit in memory, and any temporary tables, scratch tables, overflowed sorts, etc. must also fit entirely within the memory the system has been configured to use. As data volume grow, and the “warehouse” becomes more and more visible, there will be more users, and as the number of users increases, the chance that 2, 3, or even 20 or more people will be running queries at the same time, and may be sorting or ordering data using temporary tables. Or, a new user may point and click their way around Business Objects and build a huge query that joins 5 or 10 tables. All of these will cause large temporary tables to be created, all at once, potentially filling up the memory on the system.
So, what happens if an in memory databases uses up all of the system memory (RAM) on a server? Well according to an SAP employee, the system will crash. Does that seem like a good solution to you?
Because of issues like this, most in memory databases are used to run reports and/or analytics on a sample of the data, to restrict the data size to something manageable and affordable. And, as I explained a couple weeks ago, using data sampling for analytics is a bad idea, that will severely limit, if not completely eliminate the value you can derive from your data. Data that you store (some where else, since it cannot fit in memory) and do not analyze is a liability, only when you analyze your data does it become an asset. IBM Netezza ensures that all of your data is a business asset that you can derive the most value from.
There are a number of other points that I want to make on “in memory” databases, but I’ll talk about them in my upcoming posts.