Big Data - DW & BI: Datastage

Showing posts with label Datastage. Show all posts

Tuesday, June 9, 2015

How to FTP a File to Mainframe Using DataStage Parallel

by Unknown | in DI at 1:09 PM

I had a requirement where we need to send more than one set of files to Mainframe using Datastage, the catch was, each file had different set columns and length.

I have spent quite a bit of time coming up with a job which can process more than one file with different layout and length

Here you go….

Sequence File Stage :-

Transformer :-

FTP Plug in Stage:-

You must have been confused as why I have defined as delimiter when all mainframe files are Fixed width, the reason is, if you define as fixed width then you have to give the exact length instead I made it as delimited and put delimiter as #$%#(you can even more char) which is not a delimiter and it makes the job to read line by line as it and it send the file to Mainframe

Note :-- I didn’t do any data manipulation during FTP, that’s why I read the entire file as it and send it to FTP, if you want to do data manipulation then you need to bring file layout and it will make the job for processing a particular layout file and not suitable for other layout.

Sunday, February 24, 2013

Datastage DSX Files Export Script

by Unknown | in DI at 8:04 PM

@echo off
:: -----------------------------------------------------------------
:: DataStageExport.bat
:: -----------------------------------------------------------------
:: This batch script is used to Export all the DSX files in a Directory
:: must be run from a DataStage client machine and the parameters below should be
:: modified to fit your environment. Use of parameters was avoided to simplify Export
:: allow the command to be customized to a particular environment.
:: -----------------------------------------------------------------
:: Required Components:
:: dsExport.exe (Windows Version)
:: -----------------------------------------------------------------
:: Host is server name
:: User is username to use to attach to DataStage
:: Password is password to use to attach to DataStage
:: ExportDir is the directory above the backed up project
:: ProjectDir is the directory below ExportDir
:: DSExportCmd is directory of the Export command on client
:: -----------------------------------------------------------------
:: Ensure that everything that are set here are not permanent.
:: -----------------------------------------------------------------
SETLOCAL
:: -----------------------------------------------------------------
:: Test to ensure that command line is provided semi-correctly.
:: -----------------------------------------------------------------
IF "%1"=="" GOTO Syntax
IF "%2"=="" GOTO Syntax
IF "%3"=="" GOTO Syntax
IF "%4"=="" GOTO Syntax

SET Host=%1
SET Project=%2
SET User=%3
SET Password=%4

SET BaseDir=c:\DataStage\DirD
SET ExportDir=%BaseDir%\Backups
SET ProjectDir=%ExportDir%\%Project%
SET JobList=%ExportDir%\%Project%\JobList.txt

SET DSExportCmd=C:\Progra~1\Ascential\DataStage7.5.1\DsExport.exe

:: -----------------------------------------------------------------
:: Get the current Date
:: -----------------------------------------------------------------
FOR /f "tokens=2-4 delims=/ " %%a in ('DATE/T') do SET DsxDate=%%c%%a%%b
:: -----------------------------------------------------------------
:: Get the current Time
:: -----------------------------------------------------------------
FOR /f "tokens=1* delims=:" %%a in ('ECHO.^|TIME^|FINDSTR "[0-9]"') do (SET DsxTime=%%b)
:: -----------------------------------------------------------------
:: Set delimeters so that current time can be broken down into components
:: then execute FOR loop to parse the DsxTime variable into Hr/Min/Sec/Hun.
:: -----------------------------------------------------------------
SET delim1=%DsxTime:~3,1%
SET delim2=%DsxTime:~9,1%
FOR /f "tokens=1-4 delims=%delim1%%delim2% " %%a in ('echo %DsxTime%') do (
set DsxHr=%%a
set DsxMin=%%b
set DsxSec=%%c
set DsxHun=%%d
)
:: -----------------------------------------------------------------
:: If provided directory is missing an ending \, append it.
:: Validate %ProjectDir%'s existance.
:: -----------------------------------------------------------------
if exist %ProjectDir%\ set ProjectDir=%ProjectDir%\
if NOT exist %ProjectDir% GOTO BadMain
:: -----------------------------------------------------------------
:: Set the log file name to improve readability of code.
:: -----------------------------------------------------------------
SET LogFileName=%ProjectDir%\DataStageExport_bat_%DsxDate%_%DsxHr%_%DsxMin%_%DsxSec%.log

:: -----------------------------------------------------------------
:: Main
:: -----------------------------------------------------------------
SET ArchiveDir=Archive_%DsxDate%_%DsxHr%_%DsxMin%_%DsxSec%

cd %ProjectDir%
mkdir %ArchiveDir%
MOVE *.dsx %ArchiveDir%
MOVE *.log %ArchiveDir%

:: -----------------------------------------------------------------
:: Announce to log of this program's run.
:: -----------------------------------------------------------------
ECHO. > %LogFileName%
ECHO DataStage Export ran on %DsxDate% %DsxHr%:%DsxMin%:%DsxSec% with the following parameters >> %LogFileName%
ECHO Host=%Host% >> %LogFileName%
ECHO User=%user% >> %LogFileName%
ECHO ExportDir=%ExportDir% >> %LogFileName%
ECHO ProjectDir=%ProjectDir% >> %LogFileName%
ECHO DSExportCmd=%DSExportCmd% >> %LogFileName%
ECHO JobList=%JobList% >> %LogFileName%
ECHO. >> %LogFileName%

:: -----------------------------------------------------------------
:: dsExport.exe /H=hostname /U=username /P=password /O=omitflag /NUA project|/ALL|/ASK dsx_pathname1 dsx_pathname2
:: -----------------------------------------------------------------
for /F "tokens=1" %%i in (%JobList%) do (

ECHO Exporting %%i to Project: %Project% on Host: %Host%

echo "%DSExportCmd% /H=%Host% /U=%User% /P=%Password% /JOB=%%i %Project% %%i.dsx" >> %LogFileName%
%DSExportCmd% /H=%Host% /U=%User% /P=%Password% /JOB=%%i %Project% %%i.dsx >> %LogFileName%
IF NOT %ERRORLEVEL%==0 GOTO ProjFail

ECHO. >> %LogFileName%
ECHO *** Completed Export for Project: %Project% on Host: %Host% >> %LogFileName%
ECHO from File: %ProjectDir%\%%i >> %LogFileName%
ECHO. >> %LogFileName%
)

GOTO EXITPT

GOTO ENDPOINT
:: -----------------------------------------------------------------
:: Report that directory is non-existant.
:: -----------------------------------------------------------------
:BadMain
echo.
echo Bad/Non-existing directory: %ExportDir%
echo.
echo Please ensure that you have permission to access/create directories
echo and files. Also ensure that directory listed exists.
echo.

echo. >> %LogFileName%
echo Bad/Non-existing directory: %ExportDir% >> %LogFileName%
echo. >> %LogFileName%

GOTO EXITPT
:: -----------------------------------------------------------------
:: ERROR: a file failed to be Exported.
:: -----------------------------------------------------------------
:ProjFail
ECHO.
ECHO *** ERROR: Failed to Export Project: %Project% on Host: %Host%
ECHO.
ECHO Please ensure that nobody else is accessing this server while you
ECHO are running this Export script.
ECHO.

ECHO. >> %LogFileName%
ECHO *** ERROR: Failed to Export File: %ProjectDir%\%%i Project: %Project% on Host: %Host% >> %LogFileName%
ECHO. >> %LogFileName%

GOTO ENDPOINT
:: -----------------------------------------------------------------
:EXITPT
ECHO. >> %LogFileName%

:: -----------------------------------------------------------------
:ENDPOINT
ENDLOCAL

Tuesday, January 15, 2013

Datastage Job Report Script

by Unknown | in Datastage at 4:37 AM

@echo off
:: Required Components:
:: dsjob.exe (Windows Version)
:: -----------------------------------------------------------------
:: Command Line Parameters:
:: 1. Host
:: 2. User
:: 3. Password
:: 4. Project
:: -----------------------------------------------------------------
:: Ensure that everything that are set here are not permanent.
:: -----------------------------------------------------------------
SETLOCAL
:: -----------------------------------------------------------------
:: Test for command line parameters.
:: -----------------------------------------------------------------
IF "%1"=="" GOTO Syntax
IF "%2"=="" GOTO Syntax
IF "%3"=="" GOTO Syntax
IF "%4"=="" GOTO Syntax
:: -----------------------------------------------------------------
:: Set paramters.
:: -----------------------------------------------------------------
SET Host=%1
SET User=%2
SET Password=%3
SET Project=%4
:: -----------------------------------------------------------------
:: Hard-coded values. Dependent on each computer.
:: -----------------------------------------------------------------
SET Designer=C:\Progra~1\Ascential\DataStage\dsdesign.exe
SET DsJob=C:\Progra~1\Ascential\DataStage\dsjob.exe
SET JobList=DsJobReportList.txt
SET ProjectList=ProjectList.txt
SET DSLog=DsJobReportLog
SET BackupDir=c:\DataStage\KimD\Jobs\%Project%
:: -----------------------------------------------------------------
:: Get the current Date
:: -----------------------------------------------------------------
FOR /f "tokens=2-4 delims=/ " %%a in ('DATE/T') do SET DsxDate=%%c%%a%%b
:: -----------------------------------------------------------------
:: Get the current Time
:: -----------------------------------------------------------------
FOR /f "tokens=1* delims=:" %%a in ('ECHO.^|TIME^|FINDSTR "[0-9]"') do (SET DsxTime=%%b)
:: -----------------------------------------------------------------
:: Set delimeters so that current time can be broken down into components
:: then execute FOR loop to parse the DsxTime variable into Hr/Min/Sec/Hun.
:: -----------------------------------------------------------------
SET delim1=%DsxTime:~3,1%
SET delim2=%DsxTime:~9,1%
FOR /f "tokens=1-4 delims=%delim1%%delim2% " %%a in ('echo %DsxTime%') do (
set DsxHr=%%a
set DsxMin=%%b
set DsxSec=%%c
set DsxHun=%%d
)
:: -----------------------------------------------------------------
:: If provided directory is missing an ending \, append it.
:: Validate %BackupDir%'s existance.
:: -----------------------------------------------------------------
if exist %BackupDir%\ set BackupDir=%BackupDir%\
if NOT exist %BackupDir% GOTO BadMain
:: -----------------------------------------------------------------
:: Set the log file name to improve readability of code.
SET LogFileName=%BackupDir%%DSLog%-%DsxDate%-%DsxHr%_%DsxMin%_%DsxSec%.log
:: -----------------------------------------------------------------
:: Announce to log of this program's run.
:: -----------------------------------------------------------------
ECHO. > %LogFileName%
ECHO DsJobReport ran on %DsxDate% %DsxHr%:%DsxMin%:%DsxSec% with the following parameters >> %LogFileName%
ECHO Host=%Host% >> %LogFileName%
ECHO User=%User% >> %LogFileName%
ECHO BackupDir=%BackupDir%%DsxDate%\ >> %LogFileName%
ECHO Designer=%Designer% >> %LogFileName%
ECHO DsJob=%DsJob% >> %LogFileName%
ECHO ProjectList=%ProjectList% >> %LogFileName%
ECHO JobList=%JobList% >> %LogFileName%
ECHO DSLog=%DSLog% >> %LogFileName%
ECHO. >> %LogFileName%
:: -----------------------------------------------------------------
:: Pull project information (and validate parameters).
:: -----------------------------------------------------------------
%DsJob% -server %Host% -user %User% -password %Password% -lprojects > %ProjectList%
IF NOT %ERRORLEVEL%==0 GOTO BadParam
:: -----------------------------------------------------------------
:: Pull job list
:: -----------------------------------------------------------------
%DsJob% -server %Host% -user %User% -password %Password% -ljobs %Project% > %JobList%
IF NOT %ERRORLEVEL%==0 GOTO BadParam
:: -----------------------------------------------------------------
:: Report valid projects to log.
:: -----------------------------------------------------------------
ECHO Jobs found on %Host%: >> %LogFileName%
type %JobList% >> %LogFileName%
:: -----------------------------------------------------------------
:: Create subdirectory within Backup Directory
:: Validate ability to create %BackupDir%%DsxDate%\.
:: -----------------------------------------------------------------
if exist %BackupDir%%DsxDate%\ GOTO DirCont
ECHO *** Creating: %BackupDir%%DsxDate%\
mkdir %BackupDir%%DsxDate%\
IF NOT %ERRORLEVEL%==0 GOTO BadDir
ECHO. >> %LogFileName%
ECHO *** Created: %BackupDir%%DsxDate%\ >> %LogFileName%
ECHO. >> %LogFileName%
:: -----------------------------------------------------------------
:DirCont
ECHO.
ECHO Saving Job Report on %Host% ...
ECHO.
:: -----------------------------------------------------------------
:: Begin Backup Loop
::
:: NOTE: %ERRORLEVEL% does not work for some reason.
:: -----------------------------------------------------------------
for /F "tokens=1" %%i in (%JobList%) do (
ECHO Saving Job Report on Job %%i
ECHO %Designer% /H=%Host% /U=%User% /P=%Password% %Project% %%i /R /RP=%BackupDir%%DsxDate% /RX >> %LogFileName%
%Designer% /H=%Host% /U=%User% /P=%Password% %Project% %%i /R /RP=%BackupDir%%DsxDate% /RX >> %LogFileName%
IF NOT %ERRORLEVEL%==0 GOTO ProjFail

ECHO. >> %LogFileName%
ECHO *** Completed Job Report for Job: %%i on Host: %Host% Project: %Project% >> %LogFileName%
ECHO to File: %BackupDir%%DsxDate%\%%i.html >> %LogFileName%
ECHO. >> %LogFileName%
)
:: -----------------------------------------------------------------
:: ECHO *** Export completed successfully for projects:
:: type %TempFile%
:: -----------------------------------------------------------------
GOTO EXITPT
:: -----------------------------------------------------------------
:: a job failed to be exported.
:: -----------------------------------------------------------------
:ProjFail
ECHO.
ECHO *** ERROR: Failed to Export Job: %%i on Host: %Host% on Project: %Project%
ECHO.
ECHO Please ensure that nobody else is accessing this server while you
ECHO are running this backup script.
ECHO.

ECHO. >> %LogFileName%
ECHO *** ERROR: Failed to Export Job: %%i on Host: %Host% on Project: %Project% >> %LogFileName%
ECHO. >> %LogFileName%
GOTO EXITPT
:: -----------------------------------------------------------------
:: Report that paramters are not valid on screen and log file.
:: Note: Password are not reported for better security.
:: -----------------------------------------------------------------
:BadParam
echo.
echo Invalid parameters - Unable to access Server.
echo.
echo. >> %LogFileName%
echo Invalid parameters - Unable to access Server. >> %LogFileName%
echo. >> %LogFileName%
GOTO Syntax
:: -----------------------------------------------------------------
:: Report that directory is non-existant.
:: -----------------------------------------------------------------
:BadMain
echo.
echo Bad/Non-existing directory: %BackupDir%
echo.
echo Please ensure that you have permission to access/create directories
echo and files. Also ensure that directory listed exists.
echo.
echo. >> %LogFileName%
echo Bad/Non-existing directory: %BackupDir% >> %LogFileName%
echo. >> %LogFileName%
GOTO EXITPT
:: -----------------------------------------------------------------
:: Report that program does not have privs to create directory.
:: -----------------------------------------------------------------
:BadDir
echo.
echo Unable to create subdirectory: %BackupDir%%DsxDate%\
echo.
echo Please ensure that you have permission to access/create directories
echo and files.
echo.
echo. >> %LogFileName%
echo Unable to create subdirectory: %BackupDir%%DsxDate%\ >> %LogFileName%
echo. >> %LogFileName%
GOTO EXITPT
:: -----------------------------------------------------------------
:: Report proper syntax usage.
:: -----------------------------------------------------------------
GOTO ENDPOINT
:: -----------------------------------------------------------------
:EXITPT
ECHO. >> %LogFileName%
:: -----------------------------------------------------------------
:ENDPOINT
ENDLOCAL

Tuesday, January 8, 2013

Ton of Data Warehousing and Datastage Interview Questions

by Unknown | in Q&A at 1:32 AM

· Types of Stages in DS? Explain with Examples

· What are active stages and passive stages?

· Can you filter data in hashed file? (No)

· Difference between sequential and hashed file?

· How do you populate time dimension?

· Can we use target hashed file as lookup? (Yes)

· What is Merge Stage?

· What is Job Sequencer?

· What are stages in sequences?

· How do you pass parameters?

· What parameters you used in your project?

· What are log tables?

· What is job controlling?

· Facts and dimension tables?

· Confirmed dimensions?

· Difference between OLTP and OLAP?

· Difference between star schema and snow flake schema?

· What are hierarchies? Examples?

· What are materialized views?

· What is aggregation?

· What is surrogate key? Is it used for both fact and dimension tables?

· Why do you go for oracle sequence generator rather than datastage routine?

· Flow of data in datastage?

· Initial loading and incremental loading?

· What is SCD? Types?

· How do you develop SCD type2 in your project?

· How do you load dimension data and fact data? Which is first?

· Difference between oracle function and procedure?

· Difference between unique and primary key?

· Difference between union and union all?

· What is minus operator?

· What is audit table?

· If there is a large hash file and a smaller oracle table and if you are looking up from

· transformer in different jobs which will be faster?

· Tell me about SCD’s?

· How did you implement SCD in your project?

· What are derivations in transformer?

· How do you use surrogate key in reporting?

· Logs view in datastage, logs in Informatica which is clear?

· How does pivot stage work?

· What is surrogate key? What is the importance of it? How did you implement it in your

· project?

· Totally how many jobs did you developed and how many lookups did you use totally?

· How do constraint in transformer work?

· How will you declare a constraint in datastage?

· How will you handle rejected data?

· Give me some performance tips in datastage?

· Can we use sequential file as a lookup?

· How does hash file stage lookup?

· Why can’t we use sequential file as a lookup?

· What is data warehouse?

· What is ‘Star-Schema’?

· What is ‘Snowflake-Schema’?

· What is difference between Star-Schema and Snowflake-Schema?

· What is mean by surrogate key?

· What is ‘Conformed Dimension’?

· What is Factless Fact Table?

· When will we use connected and unconnected lookup?

· Which cache supports connected and unconnected lookup?

· What is the difference between SCD Type2 and SCD Type3?

· What is difference between data mart and data warehouse?

· What is composite key?

· What is surrogate key? When you will go for it?

· What is dimensional modeling?

· What are SCD and SGT? Difference between them? Example of SGT from your project.

· How do you import your source and targets? What are the types of sources and targets?

· What is Active Stages and Passive Stages means in datastage?

· What is difference between Informatica and DataStage? Which do you think is best?

· What are the stages you used in your project?

· What do you mean by parallel processing?

· What is difference between Merge Stage and Join Stage?

· What is difference between Copy Stage and Transformer Stage?

· What is difference between ODBC Stage and OCI Stage?

· What is difference between Lookup Stage and Join Stage?

· What is difference between Change Capture Stage and Difference Stage?

· What is difference between Hashed file and Sequential File?

· What are different Joins used in Join Stage?

· How you decide when to go for join stage and lookup stage?

· What is partition key? Which key is used in round robin partition?

· How do you handle SCD in datastage?

· What are Change Capture Stage and Change Apply Stages?

· How many streams to the transformer you can give?

· What is primary link and reference link?

· What is routine? What is before and after subroutines? These are run after/before job or

· stage?

· What is Config File? Each job having its own config file or one is needed?

· What is Node?

· What is IPC Stage? What it increase performance?

· What is Sequential buffer?

· What are Link Partioner and Link Collector?

· What are the performance tunning you have done in your project?

· Did you done scheduling? How? Can you schedule a job at the every end date of month?

· How?

· What is job sequence? Had you run any jobs?

· What is status view? Why you clear this? If you clear the status view what internally

· done?

· What is hashed file? What are the types of hashed file? Which you use? What is default?

· What is main advantage of hashed file? Difference between them. (static and dynamic)

· What are containers? Give example from your project.

· What are parameters and parameter file?

· How do you convert columns to rows and rows to columns in datastage? (Using Pivot

· Stage).

· What is Pivot Stage?

· What is execution flow of constraints, derivations and variables in transformer stage?

· What are these?

· How do you eliminate duplicates in datastage? Can you use hash file for it?

· If 1st and 8th record is duplicate then which will be skipped? Can you configure it?

· How do you import and export datastage jobs? What is the file extension? (See each

· component while importing and exporting).

· How do you rate yourself in DataStage?

· Explain DataStage Architecture?

· What is repository? What are the repository items?

· What is difference between routine and transform?

· When you write the routines?

· What is the complex situation you faced in DataStage?

· System variable, what are system variables used your project?

· What are the different datastage functions used in your project?

· Difference between star schema and snow flake schema?

· What is confirmed, degenerated and junk dimension?

· What are confirmed facts?

· Different type of facts and their examples?

· What are approaches in developing data warehouse?

· Different types of hashed files?

· What are routines and transforms? How you used in your project?

· Difference between Data Mart and Data Warehouse?

· What is surrogate key? How do you generate it?

· What are environment variables and global variables?

· How do you improve the performance of the job?

· What is SCD? How do you developed SCD type1 and SCD type2?

· How do you generate surrogate key in datastage?

· What is job sequence?

· What are plug-ins?

· How much data you can get every day?

· What is the biggest table and size in your schema or in your project?

· What is the size of data warehouse (by loading data)?

· How do you improve the performance of the hashed file?

· What is IPC Stage?

· What are the different types of stages and used in your project?

· What are the operations you can do in IPC Stage and transformer stage?

· What is merge stage? How do you merge two flat files?

· What is difference between ODBC and ORACLE OCI stage?

· What difference between sequential file and hashed file?

· Can you use sequential file as source to hashed file? Have you done it? What error it will

· give?

· Why hashed file improve the performance?

· Can aggregator and transformer stage used for sorting data? How

· How many input links you can give to transformer?

· Definition of Slowly Changing Dimensions? Types?

· What is iconv and oconv functions?

· What is the advantage of using OCI stage as compared to ODBC stage

· What is the difference between Interprocess and inprocess? Which one is the best?

Thursday, January 3, 2013

Datastage Interview Questions and Answers

by Unknown | in Q&A at 9:18 PM

Question: Dimension Modeling types along with their significance
Answer:
Data Modelling is broadly classified into 2 types.
A) E-R Diagrams (Entity - Relatioships).
B) Dimensional Modelling.

Question: Dimensional modelling is again sub divided into 2 types.
Answer:
A) Star Schema - Simple & Much Faster. Denormalized form.
B) Snowflake Schema - Complex with more Granularity. More normalized form.

Question: Importance of Surrogate Key in Data warehousing?
Answer:
Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is, it
is independent of underlying database, i.e. Surrogate Key is not affected by the changes
going on with a database.

Question: Differentiate Database data and Data warehouse data?
Answer:
Data in a Database is
A) Detailed or Transactional
B) Both Readable and Writable.
C) Current.

Question: What is the flow of loading data into fact & dimensional tables?
Answer:
Fact table - Table with Collection of Foreign Keys corresponding to the Primary
Keys in Dimensional table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key
values in dimensional table, then data should be loaded into Fact table.

Question: Orchestrate Vs Datastage Parallel Extender?
Answer:
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and
running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version
of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased
Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0
i.e. Parallel Extender.

Question: Differentiate Primary Key and Partition Key?
Answer:
Primary Key is a combination of unique and not null. It can be a collection of key values
called as composite primary key. Partition Key is a just a part of Primary Key. There are
several methods of partition like Hash, DB2, Random etc...While using Hash partition we
specify the Partition Key.

Question: What are Stage Variables, Derivations and Constants?
Answer:
Stage Variable - An intermediate processing variable that retains value during read and
doesn’t pass the value into target column.
Constraint - Conditions that are either true or false that specifies flow of data with a link.
Derivation - Expression that specifies value to be passed on to the target column.

Question: What is the default cache size? How do you change the cache size if
needed?
Answer:
Default cache size is 256 MB. We can increase it by going into Datastage Administrator
and selecting the Tunable Tab and specify the cache size over there.
Question: What is Hash file stage and what is it used for?
Answer:
Used for Look-ups. It is like a reference table. It is also used in-place of ODBC, OCI
tables for better performance.

Question: What are types of Hashed File?
Answer:
Hashed File is classified broadly into 2 types.
A) Static - Sub divided into 17 types based on Primary Key Pattern.
B) Dynamic - sub divided into 2 types
i) Generic
ii) Specific
Default Hased file is "Dynamic - Type Random 30 D"

Question: What are Static Hash files and Dynamic Hash files?
Answer:
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash
files. The Data file has a default size of 2GB and the overflow file is used if the data
exceeds the 2GB size.

Question: What is the Usage of Containers? What are its types?
Answer:
Container is a collection of stages used for the purpose of Reusability.
There are 2 types of Containers.
A) Local Container: Job Specific
B) Shared Container: Used in any job within a project.

Question: How do you execute datastage job from command line prompt?
Answer:
Using "dsjob" command as follows.
dsjob -run -jobstatus projectname jobname

Question: What are the command line functions that import and export the DS
jobs?
Answer:
dsimport.exe - imports the DataStage components.
dsexport.exe - exports the DataStage components.

Question: How to run a Shell Script within the scope of a Data stage job?
Answer:
By using "ExcecSH" command at Before/After job properties.

Question: What are OConv () and Iconv () functions and where are they used?
Answer:
IConv() - Converts a string to an internal storage format
OConv() - Converts an expression to an output format.

Question: How to handle Date convertions in Datastage? Convert mm/dd/yyyy
format to yyyy-dd-mm?
Answer:
We use
a) "Iconv" function - Internal Convertion.
b) "Oconv" function - External Convertion.
Function to convert mm/dd/yyyy format to yyyy-dd-mm is
Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-MDY[2,2,4]")

Question: Types of Parallel Processing?
Answer:
Parallel Processing is broadly classified into 2 types.
a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.

Question: What does a Config File in parallel extender consist of?
Answer:
Config file consists of the following.
a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

Question: Functionality of Link Partitioner and Link Collector?
Answer:
Link Partitioner: It actually splits data into various partitions or data flows using various
Partition methods.
Link Collector: It collects the data coming from partitions, merges it into a single data
flow and loads to target.

Question: What is Modulus and Splitting in Dynamic Hashed File?
Answer:
In a Hashed File, the size of the file keeps changing randomly.
If the size of the file increases it is called as "Modulus".
If the size of the file decreases it is called as "Splitting".

Question: Types of views in Datastage Director?
Answer:
There are 3 types of views in Datastage Director
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last Run
c) Status View - Warning Messages, Event Messages, Program Generated Messages.

Question: Have you ever involved in updating the DS versions like DS 5.X, if so tell
us some the steps you have taken in doing so?
Answer:
Yes.
The following are some of the steps:
1. Definitely take a back up of the whole project(s) by exporting the project as a .dsx file
2. See that you are using the same parent folder for the new version also for your old
jobs using the hard-coded file path to work.
3. After installing the new version import the old project(s) and you have to compile
them all again. You can use 'Compile All' tool for this.
4. Make sure that all your DB DSN's are created with the same name as old ones. This
step is for moving DS from one machine to another.
5. In case if you are just upgrading your DB from Oracle 8i to Oracle 9i there is tool on
DS CD that can do this for you.
6. Do not stop the 6.0 server before the upgrade, version 7.0 install process collects
project information during the upgrade. There is NO rework (recompilation of
existing jobs/routines) needed after the upgrade.

Question: How did you handle reject data?
Answer:
Typically a Reject-link is defined and the rejected data is loaded back into data
warehouse. So Reject link has to be defined every Output link you wish to collect
rejected data. Rejected data is typically bad data like duplicates of Primary keys or nullrows
where data is expected.

Question: What are other Performance tuning you have done in your last project
to increase the performance of slowly running jobs?
Answer:
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the
server using Hash/Sequential files for optimum performance also for data recovery in
case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for
faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better
performance of jobs.
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better
performance of DS queries.
Converted some of the complex joins/business in DS to Stored Procedures on DS for
faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard
logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality
required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to
process. This may be the case if the constraint calls routines or external macros but if
it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will
eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT …
WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders
are generally faster than using ODBC or OLE.

Question: Tell me one situation from your last project, where you had faced
problem and How did u solve it?
Answer:
1. The jobs in which data is read directly from OCI stages are running extremely slow. I
had to stage the data before sending to the transformer to make the jobs run faster.
2. The job aborts in the middle of loading some 500,000 rows. Have an option either
cleaning/deleting the loaded data and then run the fixed job or run the job again from
the row the job has aborted. To make sure the load is proper we opted the former.

Question: Tell me the environment in your last projects
Answer:
Give the OS of the Server and the OS of the Client of your recent most project

Question: How did u connect with DB2 in your last project?
Answer:
Most of the times the data was sent to us in the form of flat files. The data is dumped and
sent to us. In some cases were we need to connect to DB2 for look-ups as an instance
then we used ODBC drivers to connect to DB2 (or) DB2-UDB depending the situation
and availability. Certainly DB2-UDB is better in terms of performance as you know the
native drivers are always better than ODBC drivers. 'iSeries Access ODBC Driver
9.00.02.02' - ODBC drivers to connect to AS400/DB2.

Question: What are Routines and where/how are they written and have you written
any routines before?
Answer:
Routines are stored in the Routines branch of the DataStage Repository, where you can
create, view or edit.
The following are different types of Routines:
1. Transform Functions
2. Before-After Job subroutines
3. Job Control Routines

Question: How did you handle an 'Aborted' sequencer?
Answer:
In almost all cases we have to delete the data inserted by this from DB manually and fix
the job and then run the job again.

Question: What are Sequencers?
Answer:
Sequencers are job control programs that execute other jobs with preset Job parameters.

Question: Read the String functions in DS
Answer:
Functions like [] -> sub-string function and ':' -> concatenation operator
Syntax:
string [ [ start, ] length ]
string [ delimiter, instance, repeats ]
Question: What will you in a situation where somebody wants to send you a file and
use that file as an input or reference and then run job.
Answer:
• Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the
job. May be you can schedule the sequencer around the time the file is expected to
arrive.
• Under UNIX: Poll for the file. Once the file has start the job or sequencer depending
on the file.

Question: What is the utility you use to schedule the jobs on a UNIX server other
than using Ascential Director?
Answer:
Use crontab utility along with dsexecute() function along with proper parameters passed.

Question: How would call an external Java function which are not supported by
DataStage?
Answer:
Starting from DS 6.0 we have the ability to call external Java functions using a Java
package from Ascential. In this case we can even use the command line to invoke the
Java function and write the return values from the Java program (if any) and use that files
as a source in DataStage job.

Question: How will you determine the sequence of jobs to load into data warehouse?
Answer:
First we execute the jobs that load the data into Dimension tables, then Fact tables, then
load the Aggregator tables (if any).

Question: The above might raise another question: Why do we have to load the
dimensional tables first, then fact tables:
Answer:
As we load the dimensional tables the keys (primary) are generated and these keys
(primary) are Foreign keys in Fact tables.

Question: Does the selection of 'Clear the table and Insert rows' in the ODBC stage
send a Truncate statement to the DB or does it do some kind of Delete logic.
Answer:
There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete
from statement. On an OCI stage such as Oracle, you do have both Clear and Truncate
options. They are radically different in permissions (Truncate requires you to have alter
table permissions where Delete doesn't).

Question: How do you rename all of the jobs to support your new File-naming
conventions?
Answer:
Create an Excel spreadsheet with new and old names. Export the whole project as a dsx.
Write a Perl program, which can do a simple rename of the strings looking up the Excel
file. Then import the new dsx file probably into a new project for testing. Recompile all
jobs. Be cautious that the name of the jobs has also been changed in your job control jobs
or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

Question: When should we use ODS?
Answer:
DWH's are typically read only, batch updated on a schedule
ODS's are maintained in more real time, trickle fed constantly

Question: How many places u can call Routines?
Answer:
Four Places u can call
1. Transform of routine
a. Date Transformation
b. Upstring Transformation
2. Transform of the Before & After Subroutines
3. XML transformation
4. Web base transformation
Question: What is the Batch Program and how can generate?
Answer: Batch program is the program it's generate run time to maintain by the
Datastage itself but u can easy to change own the basis of your requirement (Extraction,
Transformation, Loading) .Batch program are generate depends your job nature either
simple job or sequencer job, you can see this program on job control option.

Question: Suppose that 4 job control by the sequencer like (job 1, job 2, job 3, job 4
) if job 1 have 10,000 row ,after run the job only 5000 data has been loaded in target
table remaining are not loaded and your job going to be aborted then.. How can
short out the problem?
Answer:
Suppose job sequencer synchronies or control 4 job but job 1 have problem, in this
condition should go director and check it what type of problem showing either data type
problem, warning massage, job fail or job aborted, If job fail means data type problem or
missing column action .So u should go Run window ->Click-> Tracing->Performance or
In your target table ->general -> action-> select this option here two option
(i) On Fail -- Commit , Continue
(ii) On Skip -- Commit, Continue.
First u check how much data already load after then select on skip option then
continue and what remaining position data not loaded then select On Fail , Continue
...... Again Run the job defiantly u gets successful massage

Question: What happens if RCP is disable?
Answer:
In such case OSH has to perform Import and export every time when the job runs and the
processing time job is also increased...

Question: How do you rename all of the jobs to support your new File-naming
conventions?
Answer: Create a Excel spreadsheet with new and old names. Export the whole project
as a dsx. Write a Perl program, which can do a simple rename of the strings looking up
the Excel file. Then import the new dsx file probably into a new project for testing.
Recompile all jobs. Be cautious that the name of the jobs has also been changed in your
job control jobs or Sequencer jobs. So you have to make the necessary changes to these
Sequencers.

Question: What will you in a situation where somebody wants to send you a file and
use that file as an input or reference and then run job.
Answer: A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and
then run the job. May be you can schedule the sequencer around the time the file is
expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending
on the file

Question: What are Sequencers?
Answer: Sequencers are job control programs that execute other jobs with preset Job
parameters.

Question: How did you handle an 'Aborted' sequencer?
Answer: In almost all cases we have to delete the data inserted by this from DB manually
and fix the job and then run the job again.

Question34: What is the difference between the Filter stage and the Switch stage?
Ans: There are two main differences, and probably some minor ones as well. The two
main differences are as follows.
1) The Filter stage can send one input row to more than one output link. The Switch
stage can not - the C switch construct has an implicit break in every case.
2) The Switch stage is limited to 128 output links; the Filter stage can have a
theoretically unlimited number of output links. (Note: this is not a challenge!)

Question: How can i achieve constraint based loading using datastage7.5.My target
tables have inter dependencies i.e. Primary key foreign key constraints. I want my
primary key tables to be loaded first and then my foreign key tables and also primary key
tables should be committed before the foreign key tables are executed. How can I go
about it?
Ans:1) Create a Job Sequencer to load you tables in Sequential mode
In the sequencer Call all Primary Key tables loading Jobs first and followed by Foreign
key tables, when triggering the Foreign tables load Job trigger them only when Primary
Key load Jobs run Successfully ( i.e. OK trigger)
2) To improve the performance of the Job, you can disable all the constraints on the
tables and load them. Once loading done, check for the integrity of the data. Which does
not meet raise exceptional data and cleanse them.
This only a suggestion, normally when loading on constraints are up, will drastically
performance will go down.
3) If you use Star schema modeling, when you create physical DB from the model, you
can delete all constraints and the referential integrity would be maintained in the ETL
process by referring all your dimension keys while loading fact tables. Once all
dimensional keys are assigned to a fact then dimension and fact can be loaded together.
At the same time RI is being maintained at ETL process level.

Question: How do you merge two files in DS?
Ans: Either use Copy command as a Before-job subroutine if the metadata of the 2 files
are same or create a job to concatenate the 2 files into one, if the metadata is different.

Question: How do you eliminate duplicate rows?
Ans: Data Stage provides us with a stage Remove Duplicates in Enterprise edition. Using
that stage we can eliminate the duplicates based on a key column.

Question: How do you pass filename as the parameter for a job?
Ans: While job development we can create a parameter 'FILE_NAME' and the value can
be passed while

Question: How did you handle an 'Aborted' sequencer?
Ans: In almost all cases we have to delete the data inserted by this from DB manually
and fix the job and then run the job again.

Question: Is there a mechanism available to export/import individual DataStage
ETL jobs from the UNIX command line?
Ans: Try dscmdexport and dscmdimport. Won't handle the "individual job" requirement.
You can only export full projects from the command line.
You can find the export and import executables on the client machine usually someplace
like: C:\Program Files\Ascential\DataStage.

Question: Diff. between JOIN stage and MERGE stage.
Answer:
JOIN: Performs join operations on two or more data sets input to the stage and then
outputs the resulting dataset.
MERGE: Combines a sorted master data set with one or more sorted updated data sets.
The columns from the records in the master and update data set s are merged so that the
out put record contains all the columns from the master record plus any additional
columns from each update record that required.
A master record and an update record are merged only if both of them have the same
values for the merge key column(s) that we specify .Merge key columns are one or more
columns that exist in both the master and update records.

Few more Q&A Here

Big Data - DW & BI

Translate to your Language

Labels