DataStage and Slowly Changing Dimensions

Slowly Changing Dimensions describe the behavior of changing information within the dimension tables of a data warehouse. Business users may or may not decide to preserve history in the data warehouse tables. To preserve information within the data warehouse, each data warehouse element (eg, Customer First Name, Customer Telephone Number, Salesman ID, Product Department) is described as one of three flavors. Ralph Kimball in The Data Warehouse Toolkit (Wiley, 1996) describes slowly changing dimensions in the following way:

Type 1: Overwriting the old values in the dimension record and thereby losing the ability to track the old history.

Type 2: Creating an additional dimension record at the time of the change with the new attribute values, and thereby segmenting history very accurately between the old description and the new description.

Type 3: Creating new “current” fields within the original dimension record to record the new attribute values, while keeping the original attribute values as well, thereby being able to describe history both forward and backward from the change either in terms of the original attribute values or in terms of the current attribute values.

DataStage easily handles all three types of Slowly Changing Dimensions within the DataStage transform. Because of this simplicity, no special features or “gizmos” are required for the basic functionality and the “road is clear” to add the more complex functionality that is often required for other transformations.

SCD’s can require a number of relatively complex data management decisions. DataStage affords the luxury of being able to implement complex strategies for handling not only the first three SCD’s, but also other SCD processes that have been postulated in recent data warehousing literature.

A PRACTICAL EXAMPLE

We will analyze slowly changing dimensions through a simple, practical example. Three records need to be loaded into a data warehouse dimension table. The first two update old records; one has never been loaded. The data modeler mixes all three versions of SCD’s throughout the dimension. He said last week that if there is an SCD2, don’t even touch the old records for SCD 1’s or 3’s. This week he has changed his mind and wants old 1’s and 3’s, but not the old values of 1’s and 3’s, changed.

You start a simple DataStage job and add the stages shown below. It has a source stage for your three new records, a transform stage for the logic and lookups, and a write back to the database dimension table (Insert or update). The job looks like this:

You take a few seconds to point the Source_Records stage at the incoming file and see the three records. DataStage View Data provides quick insight into the source while ensuring that you can connect. The records in this example have descriptive variables for each type of SCD. Some have new values and some have old.

You point the Dimension_Lookup stage at your database table. This happens to be an Oracle native OCI database stage; it could be native connections to DB2, Informix, Sybase, etc., ODBC, or fast, flexible DataStage hash tables. View Data tests the connection and shows the first 1,000 (in this case 2) records at the same time.

HOW TO DO THE THREE SCD’s

There are a lot of options in maintaining SCD Types 1, 2, and 3. End users, data administrators, and business requirements all may dictate one or multiple actions, even for the same data. The dimension record is identified in four ways:

Natural Key: The natural key comes from the OLTP system. It is the lookup by the incoming input record.

Dim Key Artificial: A surrogate key assigned by the DataStage routine for each new record. Used as a unique key and parent key for fact tables.

Dim Key Art Orig: Tracks the original surrogate key for the particular dimension record. Used to select or update all of the dimension records as a set.

Current Flag: A simple flag to denote the most current dimension record, including the lookup.

Your job demonstrates a simple form of dimensional maintenance. A data input (a sequential file in this case--database connections, FTP, other transformation process, or pipes would be just as easy) inputs new or updated information into the dimension transformation. You drag-and-drop to join the stages with data links:

1. Dimension Source Link: A data input record is read from the source.

2. Old_Dimension_Lookup Link: A lookup (native database connection in this case) seeks an existing dimension record with the same natural key as the input.

3. Type_3_or_1_Update Only: If no Type 2 SCD variables are changed and there are changes to Type 1 or Type 3 SCD’s, send the record down this Update link. No CURRENT_FLAGs change.

4. Type_2_Current_Flag_Update_to_N: If Type 2 SCD variables are changed (causing an insert of a new CURRENT record), update the existing records to change the existing CURRENT flags to “N”. We have chosen to also update any Type 1 and current Type 3 SCD’s in old records for consistency.

5. Type_2_or_New_Insert Link: If no existing dimension record is found or if one changed field is a Type 2 SCD, send the record down this Insert link. Mark the new record’s CURRENT_FLAG to “Y”

6. Database_Errors: If a database error is detected in the Type_2_or_New_Insert Link, save the entire record with the surrogate key and the database error codes in a flat file.

The lookup work is performed in the Process_SCDs Transform stage shown below. The designer’s transformation job is made simple by separating the logic of what information is stored and what action is to be taken. Step 1: Mapping, Step 2: database logic.

1. How do the Columns map from source to target? First, the columns are mapped from the sources on the left to the targets on the right. The incoming source (Dimension_Source link) looks up the Old Dimension record using the Natural_Key. The Derivations on the right show from which source and column each field is taken to send out the Output Links to the target. It will either Update or Insert a new record depending upon whether a lookup record is found and which values have changed.

This mapping is very visual and shows replacement by a Type 1 SCD and Type 3 SCD along with the storage of the old value for the Type 3 SCD in the variable SCD_3_LastVal. In the case of a Type 2 SCD, all columns for the Insert are populated from the source record except for an automatic new key value for the dimension table. Only one or two SQL statements are generated for any particular record.

2. Which columns are SCD’s? Stage variables hold values within the transform stage for further processing. Stage variables can be used to save information from record to record, to calculate a value used in many later calculations, or to serve as a local accumulator. Stage variables easily provide the logic for what to do with the SCD. Because stage variables support all of the complex logic of DataStage, more complex logic than a simple equality test can be used to determine whether an SCD variable has actually changed. The logic for the four variables is shown below:

Stage Variable	Definition
Type2	Any SCD 2 variable change?
NewRecord	Was a lookup record not found?
Type_3_or_1_Update	Did an SCD 3 or SCD1 variable change?
NewKeyNeeded	A simple test determines whether to ask for a new surrogate key. In this way, surrogate keys are not wasted and DBA’s researching missing record keys are not confused.

Other variables are added into the Type2 or Type 3_or_1_Update OR clauses depending upon the treatment desired. If the record is found and all old and new SCD values are equal, no action is taken.

3. Update vs. Insert: Finally, the hierarchy between INSERTs and UPDATEs needs to be detailed. The DataStage Constraint logic details the simple logic for this process:

The database stage displays update actions for modifying the database either inserting, updating, truncating, insert then update, and so forth. An extra link is included for database errors. For added database integrity, the developer may decide to use DataStage Transaction Control to ensure that either all database operations are applied or all operations are rolled back within a single transaction for each record.

Running the Slowly Changing Dimension Program

The input data, original dimension, and final dimension table data looks as below. The first record is replaced by the third record as an SCD_2. The second record is changed in place by the SCD_1 and SCD_3 values, the third is the replacement for the SCD_2, and the fourth is a brand new record.

That’s it! You have built a powerful dimensional maintenance process that can be crafted to the complex logic that all data warehouses will need without going outside of DataStage.

Simple Dimension Maintenance Programs May not Perform

Now that the process for Slowly Changing Dimensions has been built, issues of production support and performance must be considered. DataStage jobs are easily tuned for production throughput and power. The developer can easily move links to other stage types to take advantage of several DataStage features that increase power and decrease demands on the database.

The changes are detailed in the job below. The steps to develop it are shown:

Limit database I/O processes for dimensional lookups: The dimension table itself is loaded into a high-performance hash table local to the DataStage ETL server. Only one SELECT is sent against the database to select records into the index. Because the hash index is persistent, this step may be performed at any time before the main job runs or as a dependent step for this job.

Use the index, not the table: UPDATES and INSERTS are performed on the index, not on the dimension table itself. The original job must have INSERTed records available for later records with the same key (to avoid inserting duplicate records), a highly inefficient process for a database insert process.

Load the dimension in bulk: A rapidly-performing bulk load database process is used to update the dimension table after the process has run. Separate bulk load processes will update or insert different sets of records. This is an optional step; DataStage can also maintain production speed by loading the dimension table in parallel while the ETL process continues to use the indexes for fact records.

Build restart or reversing capabilities: The ability to restart the job by the row number can be built into DataStage job parameters. When the job is run, Job Parameter values can be changed in the GUI Director, Scheduler, or from the operating system command line.

Conclusion

DataStage dimension maintenance is very straightforward. Working with SCD’s is a simple, three-step process within the DataStage development environment.

The power of the basic DataStage process makes the use of “gizmos” to manage SCD’s superfluous and actually counterproductive when the developer wants to process slowly changing dimensions “out of the box” using other alternatives. A recent customer put it succinctly: “It would have taken a huge amount of effort to do what we wanted when the Wizard only did what it wanted to!”

Some of the ways to do Type 2 SCDs include updating the key in the dimension and changing the fact table. An alternative is to keep the key in the fact table and add and adjust the keys in the dimension (There are multiple ways to do this). Alternatives may be chosen because of volume, end user requirements/preference, the source of the dimension information, etc.

DataStage jobs are also easily modified from basic prototype jobs to robust production applications. Features such as Viewing Data in sources and targets and always-available Inline Debugging are invaluable in making these applications production-ready.

Big Data - DW & BI

Translate to your Language

Labels

Sunday, September 26, 2010

DataStage and Slowly Changing Dimensions

0 comments:

Popular Posts

Pin It