Oracle Security in the News - Again
There is a lot of discussion lately about Oracle’s attention (or lack of attention) to security. In his blog,
George Ou believes that Oracle is in denial about their security issues. On the 31st of August 2004, Oracle released a security update (
Alert 68 info from Oracle) to address a large number of security flaws in their database server product. The patches had been a long time in coming even one of Oracle’s staunchest backers
Lisa Vaas says that Oracle customers fully expected that these patches would actually fix the problems. Unfortunately this is not the case, and a lot of people’s secret information is being left unsecured because of this.
In an open letter to Oracle,
David Litchfield expresses his opinion on the patch set, and what is not included. And just last week Oracle security was again in the news when Joshua Wright of the SANS Institute and Dr Carlos Cid of the Information Security Group at the Royal Holloway, University of London published a paper stating that
Oracle’s Password System makes it "straightforward" to recover user's password.
I know that no software is perfect, but I would hope that when a company is told that their software can leave user’s data vulnerable, that they would do everything in their power to fix the problems.
Illuminata says - DB2 is The Database for the Masses
In this
report dated October 4, 2005, analyst Jonathan Eunice at Illuminata shows why DB2 stands alone for enterprises' data coordination, integration and analytic requirements.
Turning off Comments Since Most are SPAM
It seems that most comments on my blog so far are SPAM, so I am going to hide the comments for now. If you have comments about what I have said, or suggestions for future topics for me to write about, please feel free to
email me.
Neat Freeware DB2 Monitoring Tool
Here is a link to a neat
DB2 performance monitoring tool that shows active connections with lock information, SQL statements in the statement cache, database snapshots, database object information, database history, etc. This tool can help resolve locking problems, sorting issues, and identify the worst performing queries
Survey Says ... DB2 Provides the Best Development Tools
Recently
Evans Data Corp. conducted a survey in which they asked database developers to rate the tools of each of the top databases that they use. Eight databases (both open source and proprietary) were evaluated on the basis of eleven different tools (such as modeling tools, debuggers, audit tools, and so on) as part of Evans Data’s new Developers' Choice series.
DB2 got the best overall scores, ahead of Oracle AND Microsoft SQL Server. You can find
the report here.
DBA Checklist Part 6 - Monthy Activities
Look for Indicators of exceptional growthReview your tables and table spaces to see how much they have grown in the past month. By knowing how fast the tables and table spaces are growing, and how much space is still available, you can detect potential space issues before they happen.
You can retrieve the size of the table space and the amount of space available using the statement below.
select substr(tablespace_name,1,120) as TBSPC_NAME, used_pages, free_pages, from table (snapshot_tbs_cfg ('sample', -1) ) as snapshot_tbs_cfg You can examine how big each of your tables is by looking at the system catalog tables. As long as your statistics are current, this information will be accurate. To get the size of your tables use the statement:
select tabname, npages from syscat.tables where tabname not like ‘SYS%’NOTE: If statistics have not been captured for a table, it will have a value of –1 for npages.
If you create a history table to store this information over time so you can examine the space usage for your tables and table spaces over time.
Project Future Performance based on projected growthYou should compare the information you have been gathering on the system level CPU, memory, network, and disk utilization as well as the DB2 object information that you have been gathering to identify trends that could lead to contention or a shortage of any of these resources in the future.
Based on your analysis of the above information, you can then plan for these situations before they happen and take actions to prevent these situations from occurring.
DBA Checklist Part 5 - Weekly Procedures
Look for new objectsIt is important to know if people are creating new tables, indexes, stored procedures, etc. in your production database. New objects typically indicate that a new application has been installed on the server and any new applications and/or objects will impact the operational characteristics of your system.
In addition, new objects will consume space within the database, so it is important to identify these objects before they grow too large and could potentially fill a table space. If these objects are not created by a DBA, they very likely may have been created in the wrong table space, which can cause space and/or performance issues.
There are a few alternatives available to check for any new objects within the system:
- Run db2look and write the report to a file every week.
- Every week check for differences between the new output and the previous week’s output.
- Select object names from SYSCAT.TABLES, SYSCAT.INDEXES, SYSCAT.PROCEDURES
- Every week check for differences between the new output and the previous week’s output.
For any differences, you should then determine the CREATOR of the object from the catalog table and track the information back to the person that created the object.
Look for new or changed applicationsOnce you have optimized your database based on your current workload, there is nothing more frustrating than getting a call that the database is not performing well, and finding that the poor performance was caused by a new application, or changes to existing applications that no one told you about. Unfortunately, this happens all to often. By monitoring your database for new and/or changed applications you can hopefully detect these changes before they cause performance problems.
To look for new applications you can use the
list applications show detail command. If you redirect the output of this command to a file and keep these files for a period of time, you can examine the files every week to see if a new application name suddenly appears in the output.
To look for changes applications you can examine the SQL that is running on your system over time, and look for new SQL that has not been run previously. To do this you can create a table as follows:
create table SQLstmts ( stmt varchar(200), tstamp timestamp not null with default)You can then retrieve the SQL statements from the current package cache and insert them into a table for analysis using the following statement:
insert into SQlstmts (stmt)select substr(stmt_text,1,200) as SQL_Stmtfrom table (snapshot_dyn_sql ('sample', -1) ) as snapshot_dyn_sqlYou can then examine this table for any SQL statements that have not been executed previously using the statement:
select distinct stmt, count(stmt),tstamp from sqlstmtsgroup by stmt, tstampIn the output of this statement, any statement with a count of 1, and the timestamp column showing the current date is one that has not been run previously.
Look for tables and indexes needing REORGAs you insert, update and delete rows in your tables, the data in the tables may need to be reorged to:
- Re-cluster the data in the order of your most important index
- Remove free space interspersed throughout the table
- Remove over flow records
The
reorgchk tool will examine your tables and indicate which tables may need to be
reorged. You can run the
reorgchk tool against a single table, all user tables, all tables in a specific schema, or all system catalog tables. You can also indicate whether the tool should use the current statistics in the catalog tables as a basis for its examination, or gather new statistics first.
To run the
reorgchk tool against all of your tables, and ensure you are using the current statistics, you can use the command:
reorgchk update statistics on table userYou should redirect the output of this command to a file for further examination.
When examining the output of the
reorgchk tool you should examine the F1, F2 and F3 columns for your tables, and the F4, F5, F6, F7, and F8 columns for your indexes. If there is an asterisks (*) in one of these columns, that indicates that DB2 has calculated that your current table and/or indices currently breach that threshold.
It is important to note that for tables, if you see an asterisks in any of the columns, then you typically should
reorg the table. However, since many tables have more than one index, by definition if one of them is 100% clustered, then the other indices will not be clustered. Therefore you need to examine the index portion of the
reorgchk output in more detail and consider all of the indexes on the table when determining whether or not to
reorg the index.
The calculations for the measures used by
reorgchk are:
F1 examines the percentage of rows that are overflow records. When this is greater than 5% there will be an asterisks (*) in the F1 column of the output.
F2 examines the percentage of used space on the data pages. When this is less than 70% there will be an asterisks (*) in the F2 column of the output.
F3 examines the percentage of pages that contain data that contain some records. When this is less than 80% there will be an asterisks (*) in the F3 column of the output.
F4 examines the cluster ratio, i.e. the percentage of rows in the table that are in the same order as the index. When this is less than 80% there will be an asterisks (*) in the F4 column of the output.
F5 examines the percentage of space that is used on each index page used for index keys. When this is less than 50% there will be an asterisks (*) in the F6 column of the output.
F6 examines the number of keys that can be stored on each index level. When this is less than 100 there will be an asterisks (*) in the F6 column of the output.
F7 examines the percentage of record IDs (keys0 on a page that have been marked as deleted. When this is more than 20% there will be an asterisks (*) in the F7 column of the output.
F8 examines the percentage of empty leaf pages in the index. When this is more than 20% there will be an asterisks (*) in the F8 column of the output.
When reorganizing a table you can optionally specify which index that DB2 should cluster the data based on. To
reorg the ORG table based on the ORGX index, you would use the command
reorg table org index orgx Look for tables and indexes needing RUNSTATSThe DB2 optimizer uses the database statistics to determine the optimal access plans for your SQL statements. When you make significant changes to the amount of data in your tables, or to the data within the table, you should use the
runstats tool to capture new statistics and store them in the system catalogs. You should also make sure that you capture statistics for any new table or index.
To capture statistics for the ORG table, and its index described above, you can use the command
runstats on table <schema>.org with distribution and detailed indexes allYou can check for any tables without statistics using the statement:
select tabname from syscat.tables where stats_time is nullYou can check for any indices without statistics using the statement:
select indname from syscat.indexes where stats_time is nullYou can also look for tables and indexes that have not statistics that are over 30 days old using the statements:
select tabname from syscat.tables where stats_time < current timestamp – 30 daysselect indname from syscat.indexes where stats_time < current timestamp – 30 daysArchive all Alert Logs and DB2DIAG.LOG files On Windows you can save the event log to another file in the Event Viewer by selecting the Action menu, and then choosing the Save Log File As… option. You can then purge the entries from the log by selecting the Action menu, and then choosing the Clear All Events option.
For the DB2DIAG.LOG file as well as the administration notification log file on Linux and UNIX, you should compress these files, and name then with the current date in the file name as well.
On Linux or UNIX, you can tar the notify.* and db2diag.log files together, and then use either gzip or compress to reduce the size of the resulting file.
Check for Software Updateshttp://www.ibm.com
DBA Checklist Part 4 - Daily Tasks
Verify that all instances are up and runningThis can be done in a number of ways:
- Use Health Center
- export/set DB2INSTANCE=instancename
- and run db2start
- Attach to all instances
- On UNIX or Linux, run ps –ef | grep db2sysc
- Verify there is one db2sysc process for each instance
- On Windows, check that the service for all DB2 instances are started
The attach method can be easily scripted as long as all of the instances (i.e. NODEs) are cataloged on your workstation.
To use the ps command on UNIX and Linux you will first need to telnet into each of the servers.
Verify that all databases are active and/or consistentThe definition of consistent can be confusing, and how it is reported by the GET DB CFG command often causes questions.
By definition, a database is consistent when all committed transactions have been written to disk, and any uncommitted transactions are not on disk. When a database is running with applications connected to it, there will be transactions that have changed pages, that may have committed, but the changed pages may not have been flushed from the buffer pool to disk. Also, there may be transactions that were rolled back but their changes were flushed to disk. In this case the GET DB CFG will report that the database is inconsistent, but in fact it is fine. Therefore simply getting the database configuration information for all databases is not sufficient.
A good method, because it will also make inconsistent databases consistent and therefore reduce the times for future connect requests, is to connect to all databases and make sure that you can get a connection. This also can be easily scripted, as long as all of the databases are cataloged on your workstation.
Look for any new Notification Log and/or DB2DIAG.LOG entriesIt is also important to make sure that there were no problems that occurred over night. In Version 7 all errors and messages were written to the DB2DIAG.LOG. Because of this, many of the messages in the log file were not useful to most DBAs. In Version 8 the messages have now been separated into two logs. The notification log contains messages intended for DBAs and the DB2DIAG.LOG file is intended for the DB2 service team in the event that you need to report a problem with DB2.
On Windows, the Notification log is written to the Application Event Log and can be viewed using the Event Viewer by choosing the Application log and looking for events written by the application named DB2.
On Linux and UNIX the log is written to a file named <instance_ID>.nfy that is located in the directory specified by the DIAGPATH instance level configuration parameter. To view the notification log you can:
- Connect to each of the servers using telnet or remote terminal services.
- For each instance, go to the DIAGPATH directory.
- At the command prompt:
- Run the tail command on the notification log to dump the last 100 entries
- Edit the file and examine the most recent entries at the bottom of the file.
Check that the previous night’s backups were successfulThere is nothing worse than having a problem on your system, and deciding to restore the most recent backup, and finding that the backup was not taken or is not complete. Therefore it is important to check that the previous night’s backup(s) were successful and that they have been stored in a safe location.
The first step is to ensure that the backups were successful. This can be done using the List History command as follows:
list history backup all for <db_name>This can be scripted so that is run for all databases after the backups complete, and the report emailed to you. You can then simply verify the report each morning.
In the event that the whole server goes down for a sustained period of time, you may need to revert to your disaster recovery plan, and restore the database to another server, and maybe in another location. Therefore it is important that the backup images be stored in a safe site, not only on the server where the backup is taken. This can be easily accomplished by copying the backup image to a LAN drive, an NFS mounted drive or to a tape device.
Verify database logs have been archived successfullyIf your database is read only, or can be rebuilt from scratch easily, you likely do not have log retain enabled so you can skip this step. However for those transactional databases where you can not lose any committed transactions, it is important to make sure log retain is enabled, and that the logs are being archived successfully so that the database can be rebuilt and the transactions replayed in the event of a disaster.
While recovery may be the primary reason for verifying the logs are being archived successfully, there is another important reason. If the logs are not archived, they will remain in the LOGPATH. Since the LOGPATH is normally on a file system with a set size, if the log files are not being archived, as new logs are created the file system will be used up. Eventually the file system may fill up and when this occurs DB2 will be unable to create any more log files and will therefore stop.
When a userexit is called to archive a log file, it will write information to two places. The first place will be the userexit audit log where an entry will be written for every archive log request received by the userexit. In the event of an error during the userexit processing, a message will be written to the userexit error log file as well. These log files are in the LOGPATH and are named ARCHIVE.LOG and USEREXIT.ERR respectively.
To examine these logs you can easily write a script to grab the last 50 to 100 lines from these files (using the tail command) for all instances and email them to you. Then you can examine them along with the recovery history information each morning.
Study DB2Nothing is more valuable in the long run than that a DBA who is widely experienced, and as widely read as possible. This study should include DBA manuals, magazines, newsgroups and mailing lists.
The comp.databases.ibm-db2 news group is a great place to learn from, and share information with, your fellow DBAs.
DBA Checklist Part 3 - OS Tools
Operating System ToolsDatabase tools / snapshots by themselves typically will not give you the complete picture of your system performance. For example, a database may be 100% optimally tuned, but will not be able to perform well if I/O contention is occurring on the server. Therefore it is important to look at the complete picture to make sure the entire
system is performing well.
Below is a list of Operating System tools and what they can help you to monitor.
ps eww <pid> list the environment variables for the specified process
sar –r <interval> <number> view paging/swapping information
sar –u <interval> <number> view CPU activity
sar –P ALL <interval> <number> view CPU activity for each CPU individually for an SMP machine.
vmstat <interval> <number> view process, CPU, memory usage statistics
vmstat –s view system summary information
iostat –d view disk usage info, since the system was last rebooted
iostat hdisk0 <interval><count> view disk usage info for a specific disk, since the system was last rebooted
svmon –I <interval> <count> Capture info on the virtual memory usage
filemon detail file access information for the system
filemon –o <output.file> -v Monitor filesystem performance, write to output file
trcoff Turn off the trace above.
ipcs –a –m –q List all IPCs (memory Q’s, and shared memory)
ipcrm –q <QID> Remove the specified message queue
ipcrm –m <Memory ID> Remove the specified shared memory segment
slibclean Removed unused modules from the kernel and library memory
tprof –p ksh –x sleep 100 Monitor system for 100 seconds
tprof –p db2sysc –x sleep 100 Monitor db2 engine for 100 seconds
dump –H <library> Dump loader headers for the executable
dump –a <library> Dump archive headers
xmperf A graphical system monitoring tool for AIX Performance toolbox.
Vmtune Tune kernel memory usage
Schedtune Set scheduler parameters, i.e. timeslicing, etc.
lsdev -Cc memory Display the memory on the system
genkld view the list of libraries loaded into shared segments
DBA Checklist - Part 2 DB2 Tools
In Version 8, DB2 introduced two new features to help you monitor the health of your DB2 systems: the Health Monitor and the Health Center. These tools add a
management by exception capability to DB2 Universal Database by alerting you to potential system health issues. This enables you to address health issues before they become real problems that affect your system’s performance.
The Health Monitor runs on the DB2 server and continually monitors the health of the DB2 instance and databases. If the Health Monitor detects that a user-defined threshold has been exceeded (for example, the available log space has dropped below a set percentage of the total space available), or if it detects an abnormal state for an object (for example, the DB2 instance is no longer running), the Health Monitor will raise an alert.
When an alert is raised two things can occur:
- The alert notification will be sent.
- This can be sent by e-mail or to a pager
- Pre-configured actions can be taken.
- A CLP script or a Task Center task can be executed.
A
health indicator is a system characteristic that the Health Monitor checks. The
Health Monitor comes with a set of predefined thresholds for these health indicators. The Health Monitor checks the state of your system against these health-indicator thresholds when determining whether to issue an alert. Using the Health Center, commands, or APIs, you can customize the threshold settings of these health indicators, and define who should be notified and what script or task should be run if an alert is issued.
The Health Center provides the graphical interface to the Health Monitor. You use it to configure the Health Monitor, and to see the rolled up alert state of your instances and database objects. Using the Health Monitor’s drill-down capability, you can access details about current alerts and obtain a list of recommended actions that describe how to resolve the alert.
- Snapshot Monitors / SQL Snapshot Functions
DB2 maintains data about its operation, its performance, and the applications that are accessing it. This data is maintained as the database manager runs, and can provide important performance and troubleshooting information. For example, you can find out:
- The number of applications connected to a database, their status, and which SQL statements each application is executing, if any.
- Information that shows how well the database manager and database are configured, and helps you to tune them.
- When deadlocks occurred for a specified database, which applications were involved, and which locks were in contention.
- The list of locks held by an application or a database. If the application cannot proceed because it is waiting for a lock, there is additional information on the lock, including which application is holding it.
- The list of SQL statements executed against a particular database, how many times they were executed, how many sorts were performed on behalf of the statement and the total amount of CPU time used by each statement.
- The number of sorts that have occurred, and the number currently in progress.
Because the monitors do add some overhead to the system, the
monitor switches can be enabled or disabled independently. They can also be set for the entire instance, and all databases in the instance, or can be set within a database session. If the monitor switches are enabled within a session, they are only ‘active” for that session, and a snapshot taken from another session will not capture the monitor information. If the switches are enabled using the DB2 instance configuration parameters, they are enabled for all sessions, unless explicitly turned off within a session.
To set the monitor switches within a session, use the UPDATE MONITOR SWITCHES command or the sqlmon() API.
For example, to enable buffer pool monitoring, turn on the monitor switch using the following command:
update monitor switches using bufferpool onYou can access the data that the database manager maintains either by taking a snapshot or by using an event monitor. You can take a snapshot in one of the following ways:
- Using the GET SNAPSHOT command from the command line
- Calling the SQL Snapshot function
- Using the Control Center
- Write your own application, that makes the sqlmonss() API call.
Once an event monitor has been created and activated, it will collect information about the database and any database applications when the specified event occurs. An event is basically a change in database activity such as:
- A database connection / disconnect
- A deadlock, or lock timeout
- A statement execution
- A transaction start or finish
An event monitor is created based on the type of event or event that you want it to detect and record. For example, a
deadlock event monitor waits for a deadlock to occur; and when one does occur it will collect and record information about the applications and locks involved in the deadlock condition.
Event monitors are created using the CREATE EVENT MONITOR statement and will collect event information only when they are active. An event monitor is activated and deactivated using the SET EVENT MONITOR STATE statement. The EVENT_MON_STATE function will return the current state of the specified event monitor.
When the CREATE EVENT MONITOR statement is executed, the definition of the event monitor is created and stored in the system catalog tables.
- SYSCAT.EVENTMONITORS: Event monitors defined for the database.
- SYSCAT.EVENTS: Events types being monitored for the database.
- SYSCAT.EVENTTABLES: The names of the target tables for table event monitors.
The DB2 DBA Checklist
While databases are becoming more and more self-aware and self-healing, they still require some monitoring to keep them running as efficiently as possible. Just like your car, a database requires a tune-up every once in a while to keep it running optimally.
Over the next few days I will talk about checks or tasks that should be run against the databases at different intervals. The first set of checks or tasks that I will talk about should be run every day to make sure there are no current or imminent problems. The second set should be run weekly to check for issues or problems that may have occurred during the week or are likely to occur in the coming week(s). The final set of checks or tasks need not be run every day or week, but should be run monthly to keep the system running without problems, and to prevent further issues in the event that a problem does occur.
Monitoring the SystemThere are a number of reasons that you should monitor your databases, however, the main reason is to ensure that no problems currently exist with the system or are imminent. It is always better to detect a problem and take actions to prevent it from happening that to have to react to a problem once it has happened. By monitoring your DB2 database systems as described in this article, you should be able to detect many problems before they happen, and maintain the performance of your system.
Monitoring Tools AvailableYou will typically need to combine DB2 and operating system monitoring in order to get the complete picture of what is happening on the database server. DB2 tools alone normally do not give the complete picture.
When capturing the information for analysis, make sure that the DB2 and operating system information is captured at the same time, as you cannot correlate information captured at different times.
When monitoring the system, you should also capture/take the snapshots over a period of time. Taking them for only a one or two minute period will not give a real view of the system activity. I suggest that you take the snapshot every one to 2 minutes, for a period of at least one hour. For example, to capture the CPU, memory and other operating system usage information we would use the tool
vmstat.
The parameters of the
vmstat command are as follows.
Parameter 1 The interval, in seconds, at which the tools captures the system information
Parameter 2 The number of times that the tool should capture the system information
To capture the vmstat information every minute for one hour and write the output to the file named vmstat.out, use the following command:
vmstat 60 60 > vmstat.outTo capture the iostat information every two minutes for one hour and write the output to the file named iostat.out, use the following command:
iostat 120 30 > iostat.outAlso make sure to capture the snapshots are normal/average workload times as well as peak workload times. While it is important to ensure the normal workloads are handled efficiently, it is also important to ensure that the system can handle the peak workloads without overloading the server.
DB2 ToolsDB2 has a number of tools that can be used to monitor the activity of the databases and instances. These include:
- The Health Monitor / Health Center
- Snapshot Monitors / SQL Snapshot Functions
- Event Monitors
There are also other tools and logs available that provide information about the databases and instances including:
- The administration notification log
- This is a separate file on Linux and UNIX and incorporated into the Event Log on Windows
- DB2DIAG.LOG
- Memory Visualizer
- db2pd
Tomorrow I will talk in more detail about the database tools, and them after that will talk about the Operating systems tools.
Joining in DB2 UDB
Joining in DB2 UDBIn previous version of DB2, the DB2 optimizer would normally choose between two different join methods: a nested loop join and a merge join. When the DB2_HASH_JOIN registry variable was set to YES, the optimizer was also able to consider using a hash join when optimizing the access plan. Since hash joins can significantly improve the performance of certain queries, especially in DSS environments where the queries are normally quite large and complex, hash joins are always available in DB2 UDB Version 8, but are only considered when the optimization level is five or higher.
Join methodsWhen joining two tables, no matter which join method is being used, one table will be selected to be the outer table and the other table will be the inner table. The optimizer decides which will be the outer table and which will be the inner table based on the calculated cost and the join method selected. The outer table will accessed first and will only be scanned once. The inner table may be scanned multiple times, depending on the type of join and the indexes that are present on the tables. It is also important to remember that even though an SQL statement may join more than two tables, the optimizer will only join two tables at a time and keep the intermediate results if necessary.
Nested loop joinWhen performing a nested loop join, for each qualifying row in the outer table there are two methods that can be used to find the matching rows in the inner table:
- Scan the entire inner table.
- Read every row in the inner table and for each row determine if it should be joined with the row from the outer table.
- Perform an index lookup of the joined column(s) on the inner table.
- This is possible when the predicate used for the join includes a column that is contained in an index on the inner table, and can dramatically reduce the number of rows accessed in the inner table.
In nested loop join the decision on which is the outer table and which is the inner table is very important because the outer table is only scanned once, and the inner table is accessed once for every qualifying row in the outer table. The optimizer uses a cost based model to decide which table will play which role in the join. Some of the factors taken into account by the optimizer when making this decision include:
- Table size
- Buffer pool size
- Predicates
- Ordering requirements
- The presence of indexes
Note: The joined columns cannot be Long Varchar or LOB columns.Merge joinMerge joins require that the SQL statement contain an equality join predicate (i.e., a predicate of the form table1.column = table2.column). A merge scan join also requires that the input tables be sorted on the joined columns. This can be achieved by scanning an existing index or by sorting the tables before proceeding with the join.
With a merge join, both of the joined tables will be scanned at the same time looking for matching rows. In a merge join both the outer and inner tables will be scanned only once unless there are duplicate values in the outer table, in which case some parts of the inner table may be scanned again for each duplicated value. Because the tables are generally scanned only once, the decision on which is the outer and which is the inner table is somewhat less important than with a nested loop join, however because of the possibility of duplicate values, the optimizer will attempt to choose the table with fewer duplicate values as the outer table.
Note: The joined columns cannot be Long Varchar or LOB columns.Hash joinHash joins require one or more equality join predicates between the joined tables in which the data types for each of the joined columns must be the same. In the case of CHAR data types, even the length of the column must be the same. In the case of DECIMAL data types, the precision and scale must be the same. The fact that hash joins can handle more than one equality predicate between the joined tables is a distinct advantage over merge joins, which can handle only one equality predicate.
For hash joins, the inner table (also known as the build table) is scanned first, and the qualifying rows are copied into memory buffers. These buffers are divided into partitions based on a hash code computed from the column(s) in the join predicate(s). If there is not enough space in memory to hold the entire build table, some partitions will be written into temporary tables. Then the outer table (also known as the probe table) is scanned. For each row in the probe table the same hashing algorithm is applied to the join column(s). If the hash code obtained matches the hash code of a row in the build table, the actual values in the join columns will be compared. If the partition that matches the probe table row is in memory the comparison can take place immediately. If the partition was written to a temporary table, the probe row will also be written into a temporary table. Finally, the temporary tables containing rows from the same partitions are processed to match the rows. Because of the advantages of keeping the build table in memory, the optimizer will normally choose the smaller table as the build table to avoid having to create a temporary table.
In intra-partition parallelism is enabled the hash join may be executed in parallel. When a hash join is executed in parallel the build table is dynamically partitioned into multiple parallel tuple streams, and each stream is processed by a separate task to enter the build tuples into memory. At the end of processing the build table streams, the hash join process adjusts the contents of memory and performs any needed movement of partitions into or out of memory. Next, multiple parallel tuple streams of the probe table are processed against the in memory partitions and may be spilled for any tuples from hash join partitions that were spilled to temporary tables. Finally, the spilled partitions are processed in parallel, with each task processing one or more of the spilled partitions.
Webcast Availabile from IDUG
The International DB2 Users Group (IDUG®) is pleased to partner with Computer Associates to deliver a free webcast by award-winning presenter Maria Sarikos as she helps convert mainframe DB2 DBA skills to DB2 for Linux, UNIX, and Windows DBA skills. This live web cast will run approximately 60 minutes and participants will have the opportunity to pose questions to the presenter during the Webcast. Capacity is limited, so register early.