Data Profiling and its importance

Each year, companies spend billions of dollars implementing enterprise applications such as CRM or ERP systems -or trying to integrate customer or product data across business units and applications. Industry estimates show these projects fail or go over-budget 65-75% of the time. Often, the root of these failures is a fundamental misunderstanding about the completeness, accuracy, and validity of source data that is essential to the initiative.
By identifying data problems before starting a data-driven project, you can drastically reduce the risk of project failure. Data profiling checks for small anomalies in the data that may require cleanup before the real production data is delivered. Finding these anomalies would seem to save the data warehouse team from little surprises after going into production.

What is data profiling?

Data profiling is the systematic up front analysis of the content of a data source, all the way from counting the bytes and checking cardinalities up to the most thoughtful diagnosis of whether the data can meet the high level goals of the data warehouse.

(OR)

Data Profiling is a process whereby one examines the data available in an existing database and collects statistics and information about that data. The purpose of these statistics may be to:

find out whether existing data can easily be used for other purposes
give metrics on data quality including whether the data conforms to company standards
assess the risk involved in integrating data for new applications, including the challenges of joins
apply six sigma methodologies to enterprise data by tracking data quality
assess whether metadata accurately describes the actual values in the source database
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.

Data profiling practitioners divide this analysis into a series of tests, starting with individual fields and ending with whole suites of tables comprising extended databases.

Individual fields are checked to see that their contents agree with their basic data definitions and domain declarations. It is especially valuable to see how many rows have null values, or have contents that violate the domain definition. For example, if the domain definition is “telephone number” then alphanumeric entries clearly represents a problem. The best data profiling tools count, sort, and display the entries that violate data definitions and domain declarations. This process is called Column Profiling.

Moving beyond single fields, data profiling then describes the relationships discovered between fields in the same table. Fields that implement a key to the data table can be displayed, together with higher level many-to-1 relationships that implement hierarchies. Checking what should be the key of a table is especially helpful because the violations (duplicate instances of the key field) are either serious errors, or reflect a business rule that has not been incorporated into the ETL design. This process is called Dependency Profiling.

Relationships between tables are also checked in the data profiling step, including assumed foreign key to primary key relationships and the presence of parents without children.

Redundancy Profiling, another kind of data profiling identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.

Finally, data profiling can be custom programmed to check complex business rules unique to a business such as verifying that all the preconditions have been met for granting approval of a major funding initiative

Big Data - DW & BI

Translate to your Language

Labels

Sunday, December 16, 2012

Data Profiling and its importance

0 comments:

Popular Posts

Pin It