This note assumes some familiarity with the DataStage transformation engine. DataStage is normally used to process multiple input files and/or table selections, perform lookups for related data, and process data transformations for loading to one or more target database tables or files. Support for data generation is provided through the following process:
· The DataStage Transformer Stage needs either an input link OR Stage Variables! The stage automatically stops when no rows are output.
· Use the Transformer Stage as a source stage with 1 or more output links. No input links should be specified.
· Setup a stage variable (Just needs to be declared, not necessarily used)
· Within the Transformer, place a Constraint on the output link(s) using @OUTROWNUM for a specified number of rows or, optionally the stage variable is you increment/decrement it.
· Define your derivations with your test data.
Simple Demonstration
Below is a simple Job to produce test data rows. The DataStage Transformer stage on the left (GenerateSomeData) contains the logic for data field generation, while the Sequential File Stage on the right (“Target”) represents a very simple target.
Double-clicking the Transformer stage opens the screen below. Normally, input data sources would be shown on the left—but there are no inputs required! There needs to be a stage variable in place for data generation to work, but it does not need to be used or referenced. The number of rows created is controlled by the Transformer Constraint. As shown in the WriteGeneratedData Derivation below, the Constraint for the WriteGeneratedData data link is @OUTROWNUM < 1000—or, stop when the outgoing row number counter reaches 1000.
Include Internal Data in Test Suite
Here is a job with a Data Generation stage driving another Transformer stage to lookup specific row data. The TestFile hash index contains 10 rows 0 – 9 where field1 = FirstName, field2 = LastName The goal is for every input row up to 999 take the last character 0-9 and use it as the key to “lookup” the FirstName/LastName.
Here is the actual data in the hash lookup file;
This is the second Transformer Stage where the lookup is being done (notice the substring: [Len(WriteGeneratedData.Id),1] in the key derivation to pick off the last character of the input row for the lookup key.
Random Number Generation
A critical feature for a data generation tool is the ability to generate random numbers for data entered into a DataStage job.
The NumericRandomGenerator routine generates random numbers for a range of numbers.
NumericRandomGenerator(Offset,NumberRange,RepeatSeed)
Pseudorandom ranges of numbers can be very important in building test suites of data. A pseudorandom range of numbers is multiple sets of numbers in different ranges, but random within that range. An example would be
1-25
|
26-50
|
51-75
|
76-100
|
4,13,7,22
|
44,29,27,31
|
67,65,54,69
|
98,79,84,85
|
Suites of pseudorandom number ranges can be created using the Offset variable and Stage variables. A transformation using this process is shown below:
DataStage allows the developer to control the complexity of random and pseudorandom
The DataStage Aggregator and Pivot stages were used in a job to test the randomness of the NumericRandomGenerator. A test was developed to run high numbers of random numbers and test the parameters of the process.
The output of the job is shown below:
Random Date Generation
A related feature to random number generation is random date generation. Random date generation uses a range of dates between which random dates are output. The DataStage Routine DateRandomGenerator uses and creates dates in any of the following formats:
Database Date Type
|
Date Format
|
Example Data
|
DataStage Internal Numeric
|
D or Internal
|
55555
|
ODBC
|
ODBC
|
2001-01-01 00:00:00.000
|
Oracle OCI
|
OCI
|
55555
|
Sybase OC
|
OC
|
55555
|
Informix CLI
|
CLI
|
55555
|
DB2 DB-Connect
|
DB2
|
55555
|
The syntax is DateRandomGenerator(StartDate,EndDate,Offset,DateFormat,RepeatSeed)
Argument | Description | Example |
StartDate
|
Starting Date according to above format
|
2001-01-01 or 12846
|
EndDate
|
Ending Date according to above format
|
2001-01-01 or 12846
|
Offset
|
Number of days offset forward from dates
|
365
|
DateFormat
|
Date Format Designator according to above chart
|
ODBC
|
RepeatSeed
|
Random number repeating start seed. Should only be used once at beginning of process. 0 if no repeating randomness desired.
|
0
|
One example requiring random number generation is building a data warehouse fact table within a star schema. A star schema fact table represents the joining of several dimension tables with measures such as quantity and extended price.
In building test data, a distribution across products, time, and customers approximating a realistic slice of reality is important. The test data generator may need to create more sales for weekdays than weekends but may randomly distribute sales across weekdays.
DataStage has several features that will help model complex data generation requirements:
· Powerful hash index features that allow the data modeler to build a representative mix without changing data sources.
· Ability to rerun with slightly different parameters—a separate model for large purchasers from small purchasers. This allows each model to be simple and verifiable while building a larger model.
· Combining random numbers with hash indexes allow assigning real customers and products to the generated test data. Cross-reference tables can be used to match customers with their product preferences or other products as cross-selling opportunities.
Hash tables can be used in a simple example of product selection. If the average quantity of a product sold is stored in the hash table, it can be used as the basis for a random selection. A function such as
NumericRandomGenerator(Offset,NumberRange,RepeatSeed)
NumericRandomGenerator(AvgQuantity,AvgQuantity/2,0)
Is used to provide a range of Quantity numbers from 50% to 150% of the AvgQuantity.
Several copies of a product can be stored in the hash table for different models of usage. If Product A is ordered twice as often as Product B, it can be stored twice in the product table if both are selected at the same time. If 90% of the orders only receive one quantity while 10% get hundreds, 10 records can be stored with different average quantities. If a product is only ordered in conjunction with another product, the product table can be split into two hash indexes and the second only be accessed in conjunction with the product selected in the first.
A sample job is shown below:
The Generate_Random_Keys stage creates random entries with customer keys and product keys in the range of 0-[the total number], where [the total number] is the maximum number for each customer or product.
The hash indexes have a unique number added in addition to the natural key. As discussed above, there may be multiple copies of popular products and customers to increase their frequency in the resulting file.
Only the First_Product selection is selected in the “random” hash table search. Products related to the first, such as a cable set for a computer, is selected from keys brought from the First_Product search.
This process can be made as complex as the ETL processes DataStage typically models. More stages can be added to select products up to an average amount per customer order, selecting typical product purchases for a customer, modeling growth, or adding more functional processes.
I would like to know if you could help me out. I have a a file with 2 columns (TABLENAME and CLOB) and in the TABLENAME there are "Providers". I would like to know how to set them at number 5000 and then increment them, per provider. So if I have 5 providers per file, it would be 5000, 5001, 5002, 5003, 5004. This is a daily file so we will get a file everyday. So Day 2, the 1st provider must be 5005 since 5004 was the last file from the example. Any help would be great. Thanks
ReplyDelete