Big Data - DW & BI

Friday, July 24, 2015

Tableau 9 Feature and 3 GB Data Experiment

by Unknown | in Tools at 10:44 AM

Tableau :-

Tableau is market leader in the BI software world, it allows business users to understand their data, then visualize and create interactive, sharable dashboards. Tableau is so simple to operate that even a non technical Excel user can learn it, but powerful enough to satisfy even the most complex analytical problems. Securely sharing your findings with others only takes seconds.

Tableau 9 has features to allow you to import date from basic interfaces like excel,text file..

Tableau has in-build connector to get data from Big data interface, SAP, SAS, cloud and many other interfaces

Accessing data Live or Extract :-
Once you have connected to the data source , tableau has two option to keep the connection.
Live :-
The live option allows you to keep connected to the data source, you can access the live data and view the result immediately in your dashboard, the performance of the report/dashboard will be impacted the system you are connected to

Extract:-
The extract option allows you to get the data extracted from the source and keep it locally for your use, extract option is best for handling large amount of data, performance will be better

Working with 3GB Data on Tableau :-
I experimented Tableau with 3GB data, it took around 8 min to load the data into the sheet, once the data get loaded into the sheet then the performance was really good. It takes little bit of time to get the data loaded into in-memory after that it ran fast

Though it takes time to load data initially , it performance really well once the data loaded into memory and it keeps it until you close the data source connection.

Wednesday, July 22, 2015

Data Warehouse Testing

by Unknown | in DW at 5:24 PM

Testing in data warehouse(DW) is always consider to be an time consuming and teddies process , it requires good understating of the data modeling which we are testing and good understanding of underlying data and business requirement .DW Testing broken into smaller strategies

ETL Phase:-

ETL phase is the most important area where source data get transferred into Target data by implementing business logic, care must be given to test it thoroughly and make sure functional testing needs to be carried out for all transformation logic implemented.

Historic Data Load/ Daily Data Load/Performance Testing:

Historic data load make sure that the system is caught up with current data and it’s ready to accept new upcoming data, testing needs to make sure that the new system is in sync with existing database/system.

The daily data load testing ensure that the system is getting loaded properly on time and no impact to the business, performance testing will be conducted to ensure that the system is capable of processing multiple files and loads data on time.

Regressing Testing:-

Regressing testing should be conducted when you have new feeds coming into the system/changes to existing logic/defect fixing to ensure that new updates have not broken any existing functionality

Best Practices:-

1) Focus on data quality

2) Identify critical business functionality

3) Identity performance bottle necks

4) Identify key business components for frequent regression testing

Testing Tools:

There are no standard DW testing tools available in the market, it’s get done manually in most of the cases. Having said that, QuerySurge has come up in the market and it promised be an automated testing tool for DW testing.

Please visit http://www.querysurge.com/solutions/data-warehouse-testing for more information

Friday, July 10, 2015

Accessing Big Data using DI tools

by Unknown | in DI at 3:04 PM

Companies are investing so much money to understand data they have accumulated for so many years and what value it can can potentially provide.

Hadoop plays an major role in processing/handling Big data ,Hadoop (HDFS) is simply a file system in which the data files are distributed across multiple computer systems (nodes).

A Hadoop cluster is a set of computer systems which function as the file system.

A single file in Hadoop can be spread over an indefinite amount of nodes in the Hadoop cluster.

In theory, there is no limit to the amount of data which the file system can store since it is always possible to add more nodes.

Datastage :-

DataStage has a stage called the Big Data File stage(BDFS) which allows DataStage to read and write from Hadoop.

Before we can utilize this stage in a DataStage job, we have to configure the environment correctly. The following pre-requirements have to be met:

Verify that the Hadoop (BigInsights) cluster is up and running correctly. The status of BigInsights can be checked either from the BigInsights console or from the command line.

Add the BigInsights library path to the dsenv file.

Find out the required connection details to the BigInsights cluster.

BDFS Cluster Host

BDFS Cluster Port Number

BDFS User: User name to access files

BDFS Group: Group name for permissions – Multiple groups can be listed.

The Big Data File stage functions similarly to the Sequential File stage. It can be used as either a source or a target in a job. Other than the required connection properties to the HDFS, the stage has the same exact properties as the Sequential File stage (i.e. First line is column names, Reject mode, Write mode, etc.)

Informatica:-

The informatica has hand full of Big data products which will allows informatica customers to process/access data from Hadoop environment

Power Exchange Connector:-

The power exchange connector has inbuild “hadoop” connector which will allow you to connect to hadoop directly

Informatica Big Data Edition:-

This edition provides an extensive library of prebuilt transformation capabilities on Hadoop, including

data type conversions and string manipulations, high performance cache-enabled lookups, joiners, sorters,

routers, aggregations, and many more

Other functionality provided:-

· Data profiling on Hadoop

· Data Parsing

· Entity Extraction and Data Classification

Sunday, June 14, 2015

Internet of Things (IoT) Players

by Unknown | in IoT at 7:04 PM

The Internet of Things (IoT) is one of the hottest topic in every industry, many organization wants to get an understanding on this and invest/learn and implement in the IoT.

Many IT professionals(including me) wants' to learn this skill and get's a hands-on-experience on IOT

Leading Iot Players :-

ARM
IBM
CISCO
Google
Bosch
Ericsson
GE
Intel
Microsoft
ThingWorx
Axeda
PTC

You can learn and see some good project about Iot in
http://www.iot-a.eu/public

Tuesday, June 9, 2015

How to FTP a File to Mainframe Using DataStage Parallel

by Unknown | in DI at 1:09 PM

I had a requirement where we need to send more than one set of files to Mainframe using Datastage, the catch was, each file had different set columns and length.

I have spent quite a bit of time coming up with a job which can process more than one file with different layout and length

Here you go….

Sequence File Stage :-

Transformer :-

FTP Plug in Stage:-

You must have been confused as why I have defined as delimiter when all mainframe files are Fixed width, the reason is, if you define as fixed width then you have to give the exact length instead I made it as delimited and put delimiter as #$%#(you can even more char) which is not a delimiter and it makes the job to read line by line as it and it send the file to Mainframe

Note :-- I didn’t do any data manipulation during FTP, that’s why I read the entire file as it and send it to FTP, if you want to do data manipulation then you need to bring file layout and it will make the job for processing a particular layout file and not suitable for other layout.

Thursday, June 4, 2015

BI Report Formatting Guidelines

by Unknown | in SAP BO at 11:09 AM

Query Tab

Ø If only one query, name it ‘Query_report name’. If multiple, it should have the ‘Query_’ prefix with something that reflects the query’s purpose. (e.g. The name of the fact table it uses)

Report Tab

Ø If only one tab, it should match the report title. If multiple, it should briefly reflect the purpose of the sub-report.

Report Header

Ø Report Title – Arial, Bold, Left Aligned, (should match report name, font sizes may vary)

Ø NO Refresh Date!

Report Block

Ø Column Headers – Arial, 9 (font size may vary), Bold, Wrapped, Auto Height, Measures Right Justified, Dimensions Left Justified

Ø Table Data – Arial, 8 (font size may vary), Measures Right Justified, Dimensions Left Justified

Note ----> Font can be changed based on your requirement

Page Header / Footer

Ø Nothing to be placed in the header, choose to not display it or make it smaller

Ø Footer should contain the page number – Arial, 10, Right Justified

o Displayed as: Page 1 of 5

o Formula ="Page "+Page()+" of "+NumberOfPages()

Paper Size / Page Orientation

Pick from these options:

Paper Size	Page Orientation
Letter	Portrait/Landscape
Letter	Landscape
Legal	Landscape

Printing

Ø Ask BI&A Functional Analyst about the default print output for the report and tailor the report for best viewing in that output

o i.e. PDF or Excel

§ Adjust margins accordingly

Monday, May 18, 2015

Talend Interview Questions

by Unknown | in Q&A at 6:34 PM

This is collection which I collected from across net for list of question you can get on Talend

1. Difference between tMap and tJoin component in Talend .

2. Difference between tAggregaterow and tAggregatesortedrow.

3. Difference between tJava,tJavarow,tJavaflex.

4. How to improve the performance of Talend job having complex design?

5. Difference between built in schema and Repository.

6. What is the declaration of method which we define in system routine?

7. What is XMS and XMX parameter in Talend?

8. How to resolve heap space issue in Talend ?

9. How to do the exception handling in Talend?

10. What is Default join for tMap.

11. What are the different lookup patterns available with Talend?

12. What is the basic requirement while updating the perticular table?

13. How to generate surrogate key by using Talend?

14. What is the use of Expression editor in Talend?

15. How to debug a particular Talend job.

16. What is context variable and context group?

17. How to pass the variables from parent job to child job and from child job to Parent?

18. How to forcefully exit the job.

19. Explain the use of tContextload.

20. How to execute multiple queries by using Talend?

21. How to do the multithreading while executing the Job?

22. What is hashmap in Talend and how to use it?

23. How to do the full join in Talend.Explain the steps.

24. How to do the right outer join in Talend? Explain the steps?

25. How the ELT database components are differ from ETL database Components.

26. How to use the external libraries in Talend?

27. How to pass data from parent job to child jobs through tRunjob component ?

28. How to load context variables dynamically?

29. How to Share DB Connection in Talend?

30. Skip Header rows and footer rows before load.

31. What is Incremental Load? Describe using Talend.

32. How can you pass a value form parent job to child job in Talend.

33. How to call stored procedure and function in Talend Job

34. What is Talend ?

35. What is difference between ETL and ELT components of Talend ?

36. How to deploy talend projects ?

37. What are types of available version of Talend ?

38. How to implement versioning for talend jobs ?

39. What is tMap component ?

40. What is difference between tMap and tJoin compoents ?

41. Which component is used to sort that data ?

42. How to perform aggregate operations/functions on data in talend ?

43. What types of joins are supported by tMap component ?

44. How to schedule a talend job ?

45. How to runs talend job as web service ?

46. How to Integrate SVN with Talend ?

47. How to run talend jobs on Remote server ?

48. How to pass data from parent job to child jobs through trunjob component ?

49. How to load context variables dynamically from file/database ?

50. How to run talend jobs in Parallel ?

51. What is Context variables ?

52. How to export a talend job ?

53. Difference between TOS & TIS & ESB

54. Difference between tJava, tjavarow & tJavaFlex

55. How you perform Exception Handling without using component like tjava,tjavarow?

56. How you design SCD component

57. Transfer of context from main job to child job

What exactly is Data Lineage? and Data Lineage tools

by Unknown | in DW at 6:24 PM

Metadata management has become a key area in the company's to keep track of information's passing through many gates and understand it's value and how it's getting changed from one BU and another

Why Data Lineage :-

Lets imagine that an user has complained about user having multiple records with different customer information's.To understand the root cause of this in current world without metadata information will involve SME(subject matter experts) and it takes ages to back track.

Data lineage can answer the questions, tracing the data path (it’s “lineage”) upstream from the source to target, capturing it's original source,data flow transformation and target information's as well .

How to Track :-

This lineage should be presented in a visual format, preferably with options for viewing at a summary level with an option to drill down for individual column and process details.

Knowing the original source, and understanding “what happens” to the data as it flows to a report helps boost confidence in the results and the overall business intelligence infrastructure.