Dwaine Snow's Thoughts on Databases and Data Management: 2011

Wednesday, October 19, 2011

Not All In-Database Analytics Are Created Equal

Leading organizations differentiate themselves by analyzing massive amounts of interrelated data to predict business outcomes. High-volume, complex data analytics requires detailed data (not summaries) because influencing an individual’s actions requires that you track and analyze their unique interactions with your company. Traditional analytic systems and traditional databases cannot meet today’s need for predictive analytics on massive amounts of data.

It’s easy to overlook data movement when thinking about analytics and analytic performance. However, as data volumes increase, the simple act of moving data to an analytic engine dramatically decreases overall performance. To illustrate, a major credit card company takes two weeks to build its analysis files while an insurance company needs six days to perform this task. For many large-data analyses moving data consumes far more time than all other activities combined. I will compare traditional systems, comprised of physically separate database and compute servers, and various forms of contemporary analytic data warehouses. I will also note the types of analytics typically available to each analytic system.

Recognizing that database servers were not built for complex analytics, vendors paired a compute server with the database server. These traditional two-server analytic systems extract data from the database (either the data warehouse or the transactional database system) and move them onto another server, where they perform model building, model validation, and scoring processes. Moving a big data set from the database server to the analytic server is very inefficient and results in a large lag between the time data are created and their analysis. Beyond performance, this architecture has many challenges, including increased network load, overhead for analyst, demand for redundant infrastructure, data governance and synchronization issues, and data security concerns.

The next generation of analytic servers was driven by the need to minimize data movement. Most data warehouse vendors have built what they call in-database analytics. The main innovation was collocation of the compute and database engine to eliminate the need to copy data to another server for analysis. However, data must still be moved from disk to memory before the real analytics can happen. Moreover, the data transfers are not optimized – these systems must move entire tables even if only a subset of rows and columns is necessary to perform the analysis. And in many cases these data warehouses only offer SQL-based in-database analytics based like MIN, MAX, AVERAGE, and MEDIAN.

In terms of performance, the in-database stream processing architecture rises to the top. This architecture, found in the IBM Netezza data warehouse, eliminates the need to copy data to memory as data are analyzed as they stream off disk - minimizing data movement and data volume prior to scoring. Data minimization is accomplished with three capabilities: zone map technology and two types of filter technology. Zone map acceleration exploits the natural ordering of rows in a data warehouse to avoid scanning rows that are not relevant to the analytic query. Next, project and restrict engines eliminate columns and rows, respectfully. The IBM Netezza data warehouse appliance delivers unbeatable performance because it performs complex analytics as data streams off disk.

Many large-data analytics processes lack performance due to data movement. Traditional two-server solutions must move data within the database server and then over a network to the analytic server. General purpose data warehouses eliminate data movement across a network by collocating database and analytics servers, but are still hampered by copying data from disk to memory before scoring. High performance analytics servers take advantage of a stream processing architecture to eliminate unnecessary data movement. In other words, by using the IBM Netezza stream processing architecture, the credit card and insurance companies would immediately recognize performance gains of two weeks and six days, respectively. I hope this blog post helps you see that not all in-database analytics solutions are optimized for large-scale data. I welcome your feedback and I’m happy to field questions.

Friday, June 17, 2011

If you had to go to the store to buy something important and you had to choose between a store with one cashier or one with 1000, which would you choose?

You might think this analogy is a little crazy when thinking about computers and computer software, but let me explain a little more.

In today’s global, dynamic environment, agile and pervasive analytics and business intelligence is critical to success. Whether you are a retailer who wants to cut shrinkage by finding employees that are using discarded receipts to do returns without merchandise, an insurance firm who wants to limit liability by not insuring too many properties within flood regions, a financial firm that wants to detect fraudulent charges quickly, or a line of business executive that wants to find all of the open opportunities in your territory, you want access to the information you need, when you need it. You do not want to have to wait in line behind everyone else in your organization just to run your query/report.

But wait is what you will have to do if you buy one of the new HP/Microsoft Data Warehouse Appliances. Over the past couple of months Microsoft and HP announced three new Data Warehouse Appliances and have given them cool names like the “Business Decision Appliance” and “Business Data Warehouse Appliance”. Unless you are the only employee in your business, these appliances are not for you. Microsoft and HP’s own web sites say that these appliances are optimal for “light concurrency”.

Why spend your time loading data into one of these appliances, and then have to wait in line to get the results you need to run your business. If you buy a data warehouse appliance that supports only light concurrency, you're stuck waiting in line to get answers to your questions, which can often take over 24 hours to run. If you want a high concurrency appliance, where these same queries run in minutes, consider Netezza.

IBM Netezza’s high-performance data warehouse appliances are purpose built to make advanced analytics on data simpler, faster and accessible to everyone. These data warehouse appliances are designed specifically to allow people across the enterprise to run complex analytics on very large data volumes, orders of magnitude faster than competing solutions. Customers are able to easily and cost effectively, scale their business intelligence and analytical infrastructure, to leverage deeper insights from growing data volumes, throughout the organization.

Nielsen gathers information from multiple sources and offers their clients a complete understanding of what consumers watch, listen to, browse and buy. Their analytics infrastructure is based on Netezza, and their end-user clients run close to a million queries a day, 50 times faster than on their previous systems. As The Neilsen Company’s Senior VP of Application Development has said, “when you’re able to get deep insights in 10 seconds instead of 24 hours, you can do remarkable things with the business”.

Sunday, May 15, 2011

Oracle Throws Another Jab at HP

If you are running your business on HP Itanium servers and Oracle software, what can you do? Do you have to move to Oracle/Sun servers and Oracle Exadata?

Oracle drops Itanium support at customers’ expense

In 2008 Larry Ellison announced the new Oracle Database Machine and Exadata Storage Servers based on HP hardware. In January of 2010, after Oracle’s acquisition of Sun, they immediately dropped support for HP hardware and told all customers that they had to move to the Sun/Oracle Exadata System. In March of 2011 Oracle threw another jab at HP with the announcement that Oracle was stopping all future support for their software on Itanium processors, the base on many of HP’s most popular servers.

You decide what software and hardware you want to run (not Oracle)

Because of Oracle’s track record with HP servers and storage, many customers are concerned about the future of the systems and applications that they are using to run their businesses. Oracle would have these customers believe that they need to move the application servers, databases, etc. to Sun hardware so that they can continue to run their applications, but that is absolutely not true, and I’ll tell you why.

Fear Not the Oracle

IBM WebSphere and DB2 software both run on Itanium processors. WebLogic works great with DB2 and WebSphere supports Oracle Database. So you have a number of options, and none of them require you to immediately rip and replace all of your servers. If you are running Oracle WebLogic or the Oracle Database on an Itanium based server, you could:

Replace WebLogic with IBM WebSphere which supports Itanium processors and continue to run on the same servers
Replace Oracle Database with DB2 which supports Itanium processors and continue to run on the same servers
Replace WebLogic with IBM WebSphere and Oracle Database with DB2 and consolidate them onto a single Power7 server, reducing your data center footprint and increasing performance

You don’t even need to do this in a big bang approach. You choose which part of your application landscape to leave on HP Itanium and which part you might consider moving to another platform. You choose which application server and database to use and what platforms you want to run them on. Most importantly, you can make the right moves and not disrupt your entire business. (Read the executive take on these options here).

And if you must change server platform, consider IBM

If you are being forced to change server platform, then consider IBM one of your options. IBM offers the industry’s leading server platforms (Sun comes in a distant and dwindling third place in market share). When you combine IBM’s commitment to meetings its client needs, with its pace-setting performance and reliability, you provide your organization with the best option for future stability and growth. Running IBM software on IBM servers is the best option of all!

IBM can help take the pain away

Migration to IBM WebSphere and DB2 is painless and very low risk. Even if you were to move to an x86 based HP or Sun server and not change any of the software, you would need to recompile and rebuild your application. Take a look at your options, the cost and risk associated with each, and then look at the track record of the companies involved. Why not assess the predicament that you are in and ask why you are here. It looks to me like Oracle unilaterally put your company into this situation. It’s time to distance yourself from the culprit.

Tuesday, May 10, 2011

When IBM Innovates, Everyone Benefits - Oracle Make Everyone Pay

When Oracle beta customers were testing Oracle Database 11gR2 many were praising the new fangled columnar compression that helped reduce their databases to a more manageable level. Many of these customers used the beta code on their existing test systems and their test data to see what benefits they would get when they upgraded to the latest release.

Imagine their surprise when Oracle thanked them all for their loyalty and testing, and restricted the use of hybrid columnar compression to "Exadata Only" systems, when the beta showed that this capability is built into the Oracle Database software, and has no reliance on Exadata at all.

Oracle will even let you backup a table space on Exadata that has data that is columnar compressed and restore it to a non-Exadata server. You cannot query or access this data after the restore, but if you alter the table and "un-compress" it you can. This also shows that the Oracle Database can read and understand the columnar compressed data.

IBM on the other hand makes enhancements available to existing customers, on their existing platforms. When index, XML, and temporary table compression were introduced in DB2 9.7, all existing DB2 customers could immediately take advantage of these enhancements when they upgraded to this release.

Who would you rather do business with? The company that innovates and makes these enhancements available to all customers, or the one that adds features, but restricts access to only those that buy new hardware and specialized software licenses that are not even needed for the feature?

Monday, May 02, 2011

IBM benchmarks against today's latest and greatest. Oracle benchmarks too - against yesterday's best

Be careful what you believe – Google is your friend

Before you take what you read to heart, check the facts. In the past month or so, Oracle has been making a lot of noise about Linkshare’s migration from a DB2 data warehouse to Oracle Exadata. While Linkshare did not explicitly mention that they had been running on an older DB2 system (with older hardware), the articles do say that “A Google search of past LinkShare coverage turned up several article references to a conventional DB2 database deployment in a clustered Linux environment.”

If you read Oracle’s press releases when they discuss the performance of Exadata, you would be led to believe that “Exadata met that benchmark out of the gate”. But, if we dig a little deeper into this, Google shows us that Linkshare employed the services of the Pythian Group to help the migration. And the Pythian group provided “LinkShare with consulting and technical expertise for the planning, configuration, deployment, management, administration and ongoing operational support of their migration project. This includes re-engineering the database, adjusting the data model, redefining table structures, creating new indexing schemes and re-writing and tuning SQL queries, among other tasks.”

I might be in the minority here, but out of the gate does not mean after paying a highly skilled consulting team for months to re-engineer the whole database schema to work on Oracle RAC / Exadata, and re-writing / tuning queries that would not run fast enough.

At IBM we know that our workload optimized systems are the easiest to use, and the fastest in the industry. We compare ourselves to the latest and greatest competitive offerings all the time, not to 5 year old systems running software that is 3 or more releases behind the times. Check out this link for an interview with Steve Mills where he discusses one of these tests.

In my opinion the proof is in the pudding. Do not trust press releases, do not let vendors run benchmarks on their site… Always

Wednesday, March 23, 2011

Oracle Drops Development on Itanium

A couple weeks ago I wrote about Oracle price increase on Itanium servers. Today Oracle announced they will stop development on Itanium processors all together.

If you are running Oracle on Itanium, (whether you are using HP/UX or Linux) you have an option.... Move to DB2, stay on your Itanium based servers, pay less for database licensing, and run faster.

Check out my previous post on saving money by moving to DB2 here.

DB2 9.7 offers out of the box Oracle PL/SQL and SQL*Plus compatibility so that you can simply move your application off of Oracle and onto DB2. You can read more about this capability here.

Wednesday, March 02, 2011

Get the Facts - IBM Beats Oracle Head to Head

Tuesday, March 01, 2011

Are you paying too much for Oracle - You bet...

Today, IBM is launching a new IBM DB2 vs. Oracle Database advertising campaign. This campaign will run in both print and online media.

I like it :-)

Monday, February 28, 2011

Why You Need to Partition the Database and Applications To Scale with Oracle RAC/Exadata

On Friday I talked about the fundamental difference between Oracle RAC / Exadata and DB2 pureScale. And now I want to dive deeper into why RAC applications need to be cluster-aware to perform and scale well.

Let’s use a small example to show the differences. Let’s consider a 2-server (node/member if you are RAC or pureScale) cluster and a database that is being accessed by applications connecting to these servers.

In the RAC case, if a user sends a request to server 1 to update a row, say for customer Smith, it must get that row from the database into it’s own memory, then it can work on that row (i.e. apply the transaction). Then another user sends a request to server 2 asking it to update the row for customer Jones in the database. First server 2 must read that row into memory and then it can work on it. So far there are no issues, but let’s go on.

Now what happens if another user wants to update the data for customer Jones, but is routed to server 1? In this case server 1 doesn’t have the row, it only has the row for customer Smith. So server one sends a message over to server two asking it to send the row for customer Jones over to server 1. Once server 1 has a copy of the row for customer Jones it can then work on that transaction. Now server 1 has both rows (Jones and Smith) so if a transaction affecting either customer comes to it, it can be processed right away.

The problem now is that any transaction (for customer Smith or Jones) that goes to server 2 requires that server to go to server 1 to get the resource since it has no rows that it can work on directly.

As transactions are randomly distributed amongst the two servers (in order to balance workload) the rows for the customers must be sent back and forth between the two servers. This results in very inefficient use of resources (too much network traffic and a lot of messages between the two servers to coordinate access to data). This limits the scalability of a RAC cluster and also impacts performance. To make RAC scale you have to find the bottlenecks and remove them. In most cases the bottlenecks are too much data being shipped back and forth between nodes (difficult to find in the first place because you now have to look in many different places across the cluster to find the hot spots). To solve the problem you have to repartition your application and your database to make it scale.

DB2 and pureScale on the other hand provide near linear scalability our to over 100 members (servers) with no partitioning of the application or the database.

Friday, February 25, 2011

Hey Oracle Customers - Moving to DB2 and pureScale is easier and cheaper than moving to Exadata

So, what is the best upgrade path from a single instance of Oracle?

Oracle says moving to Exadata is as easy as 1-2-3!

If you are an existing Oracle customer, you have probably been getting a lot of pressure to move to Oracle’s shiny new toy, Exadata. You have probably been hearing that you can consolidate all of your databases onto a single Exadata system. But, it is not as easy as it seems!!!

The Oracle upgrade is harder than just moving data

If your existing applications are running on a single instance of Oracle (i.e. not on Real Application Clusters – aka RAC), then there is a lot more involved than simply moving your data. In order to get good performance on Oracle RAC (and Exadata is an Oracle RAC cluster with specialized I/O servers) you need to modify your database schemas and applications to make them RAC-aware.

DB2 pureScale makes it quick and easy to upgrade

DB2 pureScale on the other hand provides transparent application scalability, so you can quickly move your data and applications to DB2, and not have to worry about making changes to the schema and application to make them cluster aware.

The difference between RAC and DB2 pureScale

The reason that RAC requires cluster awareness and DB2 pureScale does not is due to the fundamental differences in their architectures. While both use a shared disk mechanism for scale out, that is the only real similarity. Oracle uses a distributed locking mechanism in RAC, while DB2 uses a centralized locking mechanism in pureScale.

Actual work involved to move to Exadata versus pureScale

Exadata	Tasks and Time Required	pureScale
Equal	Move database and schema	Equal
Days to weeks	Re-partition the database	Not Required
Weeks to Months	Modify the application to partition data access	Not required
Not Required	SQL Remediation	Couple of days
Equal	Test and Tune	Equal
Multiple weeks	Total Time	Days

The data movement, test and tuning time will be similar, but the time to “fix” the application will be significantly longer with Oracle RAC and Exadata than with DB2 pureScale.

On Monday I will dig into the details of why you need to partition your database and application to make it RAC-aware.

Tuesday, February 15, 2011

HP Itanium Customers can save money by switching to DB2

Oracle expects HP Itanium customers to pay more for running the Oracle database software, but reduced the price for their own Sun Hardware.

Customers can save money on license and maintenance fees by moving their applications off Oracle to DB2 9.7 or SQL Server.

What is the most cost effective way to move off the Oracle database?

In December Oracle hiked the price of their database software on HP Itanium based servers, leaving customers with two choices: Pay up, or pay to move to a different database. Since HP has partnered so closely with Microsoft over the past few years, you would think that SQL Server might be a natural place for these customers to move. But many of these customers are running Linux, not Windows, and require an enterprise ready database server. In addition, the migration from one database to another has been a long and arduous path that can cost as much or more than customers might save in the lower license costs.

Enter DB2 9.7 for Linux, UNIX, and Windows the world of database migrations has undergone a paradigm shift. DB2 9.7 can run Oracle PL/SQL and Sybase T-SQL with little to no change since it includes a “compatibility layer”. But, this compatibility layer is not a translation; this support is built directly into the DB2 database engine itself, so that there is no loss of speed due to translation.

Now, a customer that wants to consolidate multiple databases, or move off of a database due to skyrocketing costs can quickly evaluate their application to determine what, if any, statements they might need to change to run on DB2. They can then quickly move the database schema and data into DB2, turn on DB2’s self-tuning memory to tune the system, and start running on DB2.

Not only does DB2 9.7 drastically lower the cost of migration, and allow you to migrate in days or weeks rather than months, it also reduces risk since very little code needs to be changed, resulting in very little change to existing test cases, and little chance to introduce bugs since the developers are still coding in the tools and language that they are used to.

Don't just take my word for it, Analysts like Forrester and Gartner agree.