Big Data - DW & BI: DI

Showing posts with label DI. Show all posts

Wednesday, July 22, 2015

Data Warehouse Testing

by Unknown | in DW at 5:24 PM

Testing in data warehouse(DW) is always consider to be an time consuming and teddies process , it requires good understating of the data modeling which we are testing and good understanding of underlying data and business requirement .DW Testing broken into smaller strategies

ETL Phase:-

ETL phase is the most important area where source data get transferred into Target data by implementing business logic, care must be given to test it thoroughly and make sure functional testing needs to be carried out for all transformation logic implemented.

Historic Data Load/ Daily Data Load/Performance Testing:

Historic data load make sure that the system is caught up with current data and it’s ready to accept new upcoming data, testing needs to make sure that the new system is in sync with existing database/system.

The daily data load testing ensure that the system is getting loaded properly on time and no impact to the business, performance testing will be conducted to ensure that the system is capable of processing multiple files and loads data on time.

Regressing Testing:-

Regressing testing should be conducted when you have new feeds coming into the system/changes to existing logic/defect fixing to ensure that new updates have not broken any existing functionality

Best Practices:-

1) Focus on data quality

2) Identify critical business functionality

3) Identity performance bottle necks

4) Identify key business components for frequent regression testing

Testing Tools:

There are no standard DW testing tools available in the market, it’s get done manually in most of the cases. Having said that, QuerySurge has come up in the market and it promised be an automated testing tool for DW testing.

Please visit http://www.querysurge.com/solutions/data-warehouse-testing for more information

Friday, July 10, 2015

Accessing Big Data using DI tools

by Unknown | in DI at 3:04 PM

Companies are investing so much money to understand data they have accumulated for so many years and what value it can can potentially provide.

Hadoop plays an major role in processing/handling Big data ,Hadoop (HDFS) is simply a file system in which the data files are distributed across multiple computer systems (nodes).

A Hadoop cluster is a set of computer systems which function as the file system.

A single file in Hadoop can be spread over an indefinite amount of nodes in the Hadoop cluster.

In theory, there is no limit to the amount of data which the file system can store since it is always possible to add more nodes.

Datastage :-

DataStage has a stage called the Big Data File stage(BDFS) which allows DataStage to read and write from Hadoop.

Before we can utilize this stage in a DataStage job, we have to configure the environment correctly. The following pre-requirements have to be met:

Verify that the Hadoop (BigInsights) cluster is up and running correctly. The status of BigInsights can be checked either from the BigInsights console or from the command line.

Add the BigInsights library path to the dsenv file.

Find out the required connection details to the BigInsights cluster.

BDFS Cluster Host

BDFS Cluster Port Number

BDFS User: User name to access files

BDFS Group: Group name for permissions – Multiple groups can be listed.

The Big Data File stage functions similarly to the Sequential File stage. It can be used as either a source or a target in a job. Other than the required connection properties to the HDFS, the stage has the same exact properties as the Sequential File stage (i.e. First line is column names, Reject mode, Write mode, etc.)

Informatica:-

The informatica has hand full of Big data products which will allows informatica customers to process/access data from Hadoop environment

Power Exchange Connector:-

The power exchange connector has inbuild “hadoop” connector which will allow you to connect to hadoop directly

Informatica Big Data Edition:-

This edition provides an extensive library of prebuilt transformation capabilities on Hadoop, including

data type conversions and string manipulations, high performance cache-enabled lookups, joiners, sorters,

routers, aggregations, and many more

Other functionality provided:-

· Data profiling on Hadoop

· Data Parsing

· Entity Extraction and Data Classification

Tuesday, June 9, 2015

How to FTP a File to Mainframe Using DataStage Parallel

by Unknown | in DI at 1:09 PM

I had a requirement where we need to send more than one set of files to Mainframe using Datastage, the catch was, each file had different set columns and length.

I have spent quite a bit of time coming up with a job which can process more than one file with different layout and length

Here you go….

Sequence File Stage :-

Transformer :-

FTP Plug in Stage:-

You must have been confused as why I have defined as delimiter when all mainframe files are Fixed width, the reason is, if you define as fixed width then you have to give the exact length instead I made it as delimited and put delimiter as #$%#(you can even more char) which is not a delimiter and it makes the job to read line by line as it and it send the file to Mainframe

Note :-- I didn’t do any data manipulation during FTP, that’s why I read the entire file as it and send it to FTP, if you want to do data manipulation then you need to bring file layout and it will make the job for processing a particular layout file and not suitable for other layout.

Monday, May 18, 2015

Talend Interview Questions

by Unknown | in Q&A at 6:34 PM

This is collection which I collected from across net for list of question you can get on Talend

1. Difference between tMap and tJoin component in Talend .

2. Difference between tAggregaterow and tAggregatesortedrow.

3. Difference between tJava,tJavarow,tJavaflex.

4. How to improve the performance of Talend job having complex design?

5. Difference between built in schema and Repository.

6. What is the declaration of method which we define in system routine?

7. What is XMS and XMX parameter in Talend?

8. How to resolve heap space issue in Talend ?

9. How to do the exception handling in Talend?

10. What is Default join for tMap.

11. What are the different lookup patterns available with Talend?

12. What is the basic requirement while updating the perticular table?

13. How to generate surrogate key by using Talend?

14. What is the use of Expression editor in Talend?

15. How to debug a particular Talend job.

16. What is context variable and context group?

17. How to pass the variables from parent job to child job and from child job to Parent?

18. How to forcefully exit the job.

19. Explain the use of tContextload.

20. How to execute multiple queries by using Talend?

21. How to do the multithreading while executing the Job?

22. What is hashmap in Talend and how to use it?

23. How to do the full join in Talend.Explain the steps.

24. How to do the right outer join in Talend? Explain the steps?

25. How the ELT database components are differ from ETL database Components.

26. How to use the external libraries in Talend?

27. How to pass data from parent job to child jobs through tRunjob component ?

28. How to load context variables dynamically?

29. How to Share DB Connection in Talend?

30. Skip Header rows and footer rows before load.

31. What is Incremental Load? Describe using Talend.

32. How can you pass a value form parent job to child job in Talend.

33. How to call stored procedure and function in Talend Job

34. What is Talend ?

35. What is difference between ETL and ELT components of Talend ?

36. How to deploy talend projects ?

37. What are types of available version of Talend ?

38. How to implement versioning for talend jobs ?

39. What is tMap component ?

40. What is difference between tMap and tJoin compoents ?

41. Which component is used to sort that data ?

42. How to perform aggregate operations/functions on data in talend ?

43. What types of joins are supported by tMap component ?

44. How to schedule a talend job ?

45. How to runs talend job as web service ?

46. How to Integrate SVN with Talend ?

47. How to run talend jobs on Remote server ?

48. How to pass data from parent job to child jobs through trunjob component ?

49. How to load context variables dynamically from file/database ?

50. How to run talend jobs in Parallel ?

51. What is Context variables ?

52. How to export a talend job ?

53. Difference between TOS & TIS & ESB

54. Difference between tJava, tjavarow & tJavaFlex

55. How you perform Exception Handling without using component like tjava,tjavarow?

56. How you design SCD component

57. Transfer of context from main job to child job

What exactly is Data Lineage? and Data Lineage tools

by Unknown | in DW at 6:24 PM

Metadata management has become a key area in the company's to keep track of information's passing through many gates and understand it's value and how it's getting changed from one BU and another

Why Data Lineage :-

Lets imagine that an user has complained about user having multiple records with different customer information's.To understand the root cause of this in current world without metadata information will involve SME(subject matter experts) and it takes ages to back track.

Data lineage can answer the questions, tracing the data path (it’s “lineage”) upstream from the source to target, capturing it's original source,data flow transformation and target information's as well .

How to Track :-

This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.

Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure.

Data Lineage Tools :-

IBM InfoSphere Business Information Exchange

IBM InfoSphere Metadata Workbench

Informatica MetaData Manager (PDF download)

Informatica On Demand

Wednesday, October 1, 2014

Datastage Verson Differences

by Unknown | in DI at 2:12 AM

1. Design & Runtime Performance Changes:

DS8.5: Implemented by Internal code change. Design and Runtime performance is better than 8.1, 40% performance improvement in job open, save, compile etc.

DS8.7: Improvements in Xmeta. Significant Performance improvement in Job Open, Save, Compile etc.

DS9.1: No Change

2. PX Engine Performance Changes :

DS8.5: Not Exist

DS8.7: Improved partition/sort insertion algorithm. XML parsing performance is improved by 3x or more for large XML files.

DS9.1: No Change

3. Added View Job Log in Designer client

DS8.5: Not Exist

DS8.7: New Feature has been added (Menu -> View -> Job Log) .

Job log is now viewed in Designer client.

DS9.1: No Change

4. Added Stop/Reset buttons In Designer Client:

DS8.5: Not Exist

DS8.7: Stop/ Reset button added to Compile and Run buttons for the DS jobs.

DS9.1: No Change

5. Interactive Parallel Job Debugging

DS8.5: Not Exist

DS8.7: Breakpoints with conditional logic per link and node.

(Link -> Rclick -> Toggle Breakpoint)

The running job can be continued or aborted by using multiple breakpoints with conditional logic per link and node. (row data or job parameter values can be examined by breakpoint conditional logic)

DS9.1: No Change

6. Added Vertical Pivot : Pivot Enterprise Stage

DS8.5: Extended to current horizontal parallel pivot. Enhanced pivot stage to support vertical pivoting. (mapping multiple input rows with a common key, to a single output row containing multiple columns)

DS8.7: No Change

DS9.1: No Change

7. Balance Optimization :

DS8.5: Balanced Optimization is that to redesign the job automatically with maximize performance by minimizing the amount of input and output performed, and by balancing the processing against source, intermediate, and target environments. The Balanced Optimization enables to take advantage of the power of the databases without becoming an expert in native SQL.

DS8.7: No Change

DS9.1: Balanced Optimization for Hadoop.

8. Transformer Enhancements

DS8.5: Looping in the transformer, Multiple output rows to be produced from a single input row. 1. New input cache: SaveInputRecord(), GetSavedInputRecord().

2. New System Variables: @ITERATION, @Loop Count, @EOD(End of data flag for last row).

3. Functions : LastRowInGroup(InputColumn).

4. Null Handling more Options.

DS8.7: No Change

DS9.1: New Transformation Expressions has been added. EREPLACE : Function to replace substring in expression with another substring. If not specified occurrence, then each occurrence of substring will be replaced.

9. Big Data File Stage :

DS8.5: Not Exist

DS8.7: Big Data File Stage for Big Data sources (Hadoop Distributed File System-HDFS).

DS9.1: New Enhancement on

1. The IBM Big Data Solution Integrate and manage the full variety, velocity and volume of data.

2. New Hadoop-based Big Data Support Any to Big Data.

3. Big Data Integration with DataStage.

10. Added Java Integration Stage

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: Java code and creates baseline for upcoming big data source support.

11. Added Encryption Techniques

DS8.5: Not Exist

DS8.7: Encrypted because of security reasons.

1. Strongly encrypted credential files for command line utilities.

2. Strongly encrypted job parameter files for dsjob command.

3. Encryption Algorithm and Customization.

DS9.1: No Change

12. Added Dual-stack protocol Support

DS8.5: Not Exist

DS8.7: IPv6 Support: Information Server is fully compatible with IPv6 addresses and can support dual-stack protocol implementations. (Env Variable: APT_USE_IPV4.)

DS9.1: No Change

13. Added Unstructured text stage

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: Excel read capabilities on all platforms with rich features to support ranges, multiple worksheets and New Unstructured data read.

14. Added DBMS Connector Boost

DS8.5: Not Exist

DS8.7: Not Exist

DS9.1: New big buffer optimizations which has increased bulk load performance in DB2 and Oracle Connector by more than 50% in many cases.

Thursday, July 17, 2014

Extracting Data From PDF

by Unknown | in DI at 2:26 AM

Have you ever tired extracting data from PDF and you know how hard and time consuming to do that, anyway to make our life easier we have an open source tool available which will allow you extract data from PDF.

The tool is Tabula
Website --> http://tabula.nerdpower.org/

Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux.

Thursday, July 10, 2014

Data Intergration Life cycle

by Unknown | in DW at 6:09 AM

The Data integration life cycle span into 5 different areas and all of it has to follow a set of process to make it correct
Business Requirement
Functional Design
Mapping Document
Technical Design
Development
Testing
Deployment
Production Support

Business Requirement/Functional Design Process

Mapping Document/Technical Design Process

Development/Testing/Deployment/Production Support

Thursday, July 3, 2014

ETL Rule of Tumb for loading Dimension and Fact

by Unknown | in DW at 6:47 AM

I would recommend following simple rules for loading Dimension and Fact tables

Dimensions:

1) No duplicate values in Natural Keys.

2) Dummy record (-1) has to be present in the table, if the table is truncate and load then inserting the dummy row should be first step.

3) The Natural key shouldn't be blank except the dummy record.

Facts:

1) Any fact field that is not mapped from source should always be 0.

2) Any fact field that is mapped from source should always be NON-0 value.

3) All mapped fact key values should tie up with dimensions.

4) If no date value is received from source then it will be resolved to 12/31/4444 or any other date you like

5) If no values for Measures are received from source then it will be resolved to 0 .

There are many rules to apply, I recommend these are the important ones.

Monday, April 28, 2014

What is NewSQL

by Unknown | in DI at 2:42 AM

I have explained the importance of Nosql in my previous post, today I am going to explain about Newsql

What is NewSQL ?
As you have understood from my previous post, the traditional RDBMS databases are struggling to keep up with new set of unstructured, large volume of data and time it required to modify any changes to it's schema structure, Nosql databases are came as an alternative option for business to store and process those data.

Group of modern relational database management systems that systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (read-write) workloads while still maintaining the ACID guarantees of a traditional database system are called Newsql databases.

Though NewSQL databases differ in their internal architectures and processing, they all gruped under one umberla and called them self as Newsql databases.

Tuesday, March 25, 2014

How to setup a high speed data load from SSIS package to Netezza database

by Unknown | in Other at 9:06 AM

1. On the SSIS OS install the NPS OLE DB driver by running:

nzoledbsetup.exe (for 32 bit systems)

nzoledbsetup64.exe (for 64 bit systems).

2. In SSIS connection manager create connections to your data source(s) and target. Use Netezza OLE DB provider when defining connection to Netezza.

3. In SSIS define the data source as per normal (flat file, ODBC source, OLE DB source, etc...)

4. In SSIS use the “OLE DB Destination” object to send data to the target, Netezza.

5. In the OLE DB Destination object properties set the following parameter values so that SSIS invokes Netezza’s high speed loader properly. This dialog box is shown in the screenshot below.

Set AccessMode = OpenRowSet using FastLoad

Set AlwaysUseDefaultCodePage = true

Set FastLoadMaxInsertCommitSize = 0

Tuesday, March 18, 2014

Extracting SSIS Package Metadata

by Unknown | in Other at 2:32 AM

Retrieving the definitions of the SSIS Packages

SELECT  p.[name] as [PackageName]
   ,[description] as [PackageDescription]
   ,case [packagetype]
    when 0 then 'Undefined'
    when 1 then 'SQL Server Import and Export Wizard'
    when 2 then 'DTS Designer in SQL Server 2000'
    when 3 then 'SQL Server Replication'
    when 5 then 'SSIS Designer'
    when 6 then 'Maintenance Plan Designer or Wizard'
   end  as [PackageType]
   ,case [packageformat]
    when 0 then 'SSIS 2005 version'
    when 1 then 'SSIS 2008 version'
   end as [PackageFormat]
   ,l.[name] as [Creator]
   ,p.[createdate]
   ,CAST(CAST(packagedata AS VARBINARY(MAX)) AS XML) PackageXML
          FROM      [msdb].[dbo].[sysssispackages]  p
          JOIN  sys.syslogins      l
          ON  p.[ownersid] = l.[sid]


Extracting connection strings from an SSIS Package



;WITH XMLNAMESPACES ('www.microsoft.com/SqlServer/Dts' AS pNS1,
  'www.microsoft.com/SqlServer/Dts' AS DTS) -- declare XML namespacesSELECT c.name,
  SSIS_XML.value('./pNS1:Property [@pNS1:Name="DelayValidation"][1]', 
    'varchar(100)') AS DelayValidation,
  SSIS_XML.value('./pNS1:Property[@pNS1:Name="ObjectName"][1]', 
    'varchar(100)') AS ObjectName,
  SSIS_XML.value('./pNS1:Property[@pNS1:Name="Description"][1]', 
    'varchar(100)') AS Description,
   SSIS_XML.value('pNS1:ObjectData[1]/pNS1:ConnectionManager[1]
/pNS1:Property[@pNS1:Name="Retain"][1]', 'varchar(MAX)') Retain,  
   SSIS_XML.value('pNS1:ObjectData[1]/pNS1:ConnectionManager[1]
/pNS1:Property[@pNS1:Name="ConnectionString"][1]', 'varchar(MAX)') ConnectionStringFROM  --( SELECT    id ,
                    CAST(CAST(packagedata AS VARBINARY(MAX)) AS XML) PackageXML
          FROM      [msdb].[dbo].[sysssispackages]
        ) PackageXML
        CROSS APPLY PackageXML.nodes('/DTS:Executable/DTS:ConnectionManager') SSIS_XML (SSIS_XML )
        INNER JOIN [msdb].[dbo].[sysssispackages] c ON PackageXML.id = c.id

Big Data - DW & BI

Translate to your Language

Labels

Wednesday, July 22, 2015

Data Warehouse Testing

Friday, July 10, 2015

Accessing Big Data using DI tools

Tuesday, June 9, 2015

How to FTP a File to Mainframe Using DataStage Parallel

Monday, May 18, 2015

Talend Interview Questions

What exactly is Data Lineage? and Data Lineage tools

Wednesday, October 1, 2014

Datastage Verson Differences

1. Design & Runtime Performance Changes:

Thursday, July 17, 2014

Extracting Data From PDF

Thursday, July 10, 2014

Data Intergration Life cycle

Thursday, July 3, 2014

ETL Rule of Tumb for loading Dimension and Fact

Monday, April 28, 2014

What is NewSQL

Tuesday, March 25, 2014

How to setup a high speed data load from SSIS package to Netezza database

Tuesday, March 18, 2014

Extracting SSIS Package Metadata

Retrieving the definitions of the SSIS Packages

Extracting connection strings from an SSIS Package

Popular Posts

Pin It