Data Generation Using DataStage

This note assumes some familiarity with the DataStage transformation engine. DataStage is normally used to process multiple input files and/or table selections, perform lookups for related data, and process data transformations for loading to one or more target database tables or files. Support for data generation is provided through the following process:

· The DataStage Transformer Stage needs either an input link OR Stage Variables! The stage automatically stops when no rows are output.

· Use the Transformer Stage as a source stage with 1 or more output links. No input links should be specified.

· Setup a stage variable (Just needs to be declared, not necessarily used)

· Within the Transformer, place a Constraint on the output link(s) using @OUTROWNUM for a specified number of rows or, optionally the stage variable is you increment/decrement it.

· Define your derivations with your test data.

Simple Demonstration

Below is a simple Job to produce test data rows. The DataStage Transformer stage on the left (GenerateSomeData) contains the logic for data field generation, while the Sequential File Stage on the right (“Target”) represents a very simple target.

Double-clicking the Transformer stage opens the screen below. Normally, input data sources would be shown on the left—but there are no inputs required! There needs to be a stage variable in place for data generation to work, but it does not need to be used or referenced. The number of rows created is controlled by the Transformer Constraint. As shown in the WriteGeneratedData Derivation below, the Constraint for the WriteGeneratedData data link is @OUTROWNUM < 1000—or, stop when the outgoing row number counter reaches 1000.

Include Internal Data in Test Suite

Here is a job with a Data Generation stage driving another Transformer stage to lookup specific row data. The TestFile hash index contains 10 rows 0 – 9 where field1 = FirstName, field2 = LastName The goal is for every input row up to 999 take the last character 0-9 and use it as the key to “lookup” the FirstName/LastName.

Here is the actual data in the hash lookup file;

This is the second Transformer Stage where the lookup is being done (notice the substring: [Len(WriteGeneratedData.Id),1] in the key derivation to pick off the last character of the input row for the lookup key.

Random Number Generation

A critical feature for a data generation tool is the ability to generate random numbers for data entered into a DataStage job.

The NumericRandomGenerator routine generates random numbers for a range of numbers.

NumericRandomGenerator(Offset,NumberRange,RepeatSeed)

Pseudorandom ranges of numbers can be very important in building test suites of data. A pseudorandom range of numbers is multiple sets of numbers in different ranges, but random within that range. An example would be

1-25	26-50	51-75	76-100
4,13,7,22	44,29,27,31	67,65,54,69	98,79,84,85

Suites of pseudorandom number ranges can be created using the Offset variable and Stage variables. A transformation using this process is shown below:

DataStage allows the developer to control the complexity of random and pseudorandom

The DataStage Aggregator and Pivot stages were used in a job to test the randomness of the NumericRandomGenerator. A test was developed to run high numbers of random numbers and test the parameters of the process.

The output of the job is shown below:

Random Date Generation

A related feature to random number generation is random date generation. Random date generation uses a range of dates between which random dates are output. The DataStage Routine DateRandomGenerator uses and creates dates in any of the following formats:

Database Date Type	Date Format	Example Data
DataStage Internal Numeric	D or Internal	55555
ODBC	ODBC	2001-01-01 00:00:00.000
Oracle OCI	OCI	55555
Sybase OC	OC	55555
Informix CLI	CLI	55555
DB2 DB-Connect	DB2	55555

The syntax is DateRandomGenerator(StartDate,EndDate,Offset,DateFormat,RepeatSeed)

Argument	Description	Example
StartDate	Starting Date according to above format	2001-01-01 or 12846
EndDate	Ending Date according to above format	2001-01-01 or 12846
Offset	Number of days offset forward from dates	365
DateFormat	Date Format Designator according to above chart	ODBC
RepeatSeed	Random number repeating start seed. Should only be used once at beginning of process. 0 if no repeating randomness desired.	0

Example Data Generation: Building a Data Warehouse Fact Table

One example requiring random number generation is building a data warehouse fact table within a star schema. A star schema fact table represents the joining of several dimension tables with measures such as quantity and extended price.

In building test data, a distribution across products, time, and customers approximating a realistic slice of reality is important. The test data generator may need to create more sales for weekdays than weekends but may randomly distribute sales across weekdays.

DataStage has several features that will help model complex data generation requirements:

· Powerful hash index features that allow the data modeler to build a representative mix without changing data sources.

· Ability to rerun with slightly different parameters—a separate model for large purchasers from small purchasers. This allows each model to be simple and verifiable while building a larger model.

· Combining random numbers with hash indexes allow assigning real customers and products to the generated test data. Cross-reference tables can be used to match customers with their product preferences or other products as cross-selling opportunities.

Hash tables can be used in a simple example of product selection. If the average quantity of a product sold is stored in the hash table, it can be used as the basis for a random selection. A function such as

NumericRandomGenerator(Offset,NumberRange,RepeatSeed)

NumericRandomGenerator(AvgQuantity,AvgQuantity/2,0)

Is used to provide a range of Quantity numbers from 50% to 150% of the AvgQuantity.

Several copies of a product can be stored in the hash table for different models of usage. If Product A is ordered twice as often as Product B, it can be stored twice in the product table if both are selected at the same time. If 90% of the orders only receive one quantity while 10% get hundreds, 10 records can be stored with different average quantities. If a product is only ordered in conjunction with another product, the product table can be split into two hash indexes and the second only be accessed in conjunction with the product selected in the first.

A sample job is shown below:

The Generate_Random_Keys stage creates random entries with customer keys and product keys in the range of 0-[the total number], where [the total number] is the maximum number for each customer or product.

The hash indexes have a unique number added in addition to the natural key. As discussed above, there may be multiple copies of popular products and customers to increase their frequency in the resulting file.

Only the First_Product selection is selected in the “random” hash table search. Products related to the first, such as a cable set for a computer, is selected from keys brought from the First_Product search.

This process can be made as complex as the ETL processes DataStage typically models. More stages can be added to select products up to an average amount per customer order, selecting typical product purchases for a customer, modeling growth, or adding more functional processes.

1 comment:

UnknownDecember 1, 2015 at 10:58 AM
I would like to know if you could help me out. I have a a file with 2 columns (TABLENAME and CLOB) and in the TABLENAME there are "Providers". I would like to know how to set them at number 5000 and then increment them, per provider. So if I have 5 providers per file, it would be 5000, 5001, 5002, 5003, 5004. This is a daily file so we will get a file everyday. So Day 2, the 1st provider must be 5005 since 5004 was the last file from the example. Any help would be great. Thanks

Big Data - DW & BI

Translate to your Language

Labels

Sunday, September 26, 2010

Data Generation Using DataStage

Simple Demonstration

Include Internal Data in Test Suite

Random Number Generation

Random Date Generation

Argument

Description

Example

1 comment:

Popular Posts

Pin It