Each year, companies
spend billions of dollars implementing enterprise applications such as CRM or
ERP systems -or trying to integrate customer or product data across business
units and applications. Industry estimates show these projects fail or go
over-budget 65-75% of the time. Often, the root of these failures is a
fundamental misunderstanding about the completeness, accuracy, and validity of
source data that is essential to the initiative.
By identifying data
problems before starting a data-driven project, you can drastically reduce the
risk of project failure. Data profiling checks for small anomalies in the data
that may require cleanup before the real production data is delivered. Finding
these anomalies would seem to save the data warehouse team from little
surprises after going into production.
What
is data profiling?
Data profiling is the systematic up front
analysis of the content of a data source, all the way from counting the bytes
and checking cardinalities up to the most thoughtful diagnosis of whether the
data can meet the high level goals of the data warehouse.
(OR)
Data Profiling is a process whereby one
examines the data available in an existing database and
collects statistics and information about that data. The purpose of these
statistics may be to:
- find out whether existing data can easily be used for other purposes
- give metrics on data quality including whether the data conforms
to company standards
- assess the risk involved in integrating data for new applications,
including the challenges of joins
- apply six sigma methodologies to enterprise data by
tracking data quality
- assess whether metadata accurately describes the actual values in the
source database
- Understanding data challenges early in any data intensive project, so
that late project surprises are avoided. Finding data problems late in the
project can incur time delays and project cost overruns.
Data profiling practitioners divide this
analysis into a series of tests, starting with individual fields and ending
with whole suites of tables comprising extended databases.
Individual fields are checked to see that their
contents agree with their basic data definitions and domain declarations. It is
especially valuable to see how many rows have null values, or have contents
that violate the domain definition. For example, if the domain definition is
“telephone number” then alphanumeric entries clearly represents a problem. The
best data profiling tools count, sort, and display the entries that violate
data definitions and domain declarations. This process is called Column
Profiling.
Moving beyond single fields, data profiling
then describes the relationships discovered between fields in the same table.
Fields that implement a key to the data table can be displayed, together with
higher level many-to-1 relationships that implement hierarchies. Checking what
should be the key of a table is especially helpful because the violations
(duplicate instances of the key field) are either serious errors, or reflect a
business rule that has not been incorporated into the ETL design. This process
is called Dependency Profiling.
Relationships between tables are also checked
in the data profiling step, including assumed foreign key to primary key
relationships and the presence of parents without children.
Redundancy Profiling,
another kind of data profiling identifies overlapping values between tables.
This is typically used to identify candidate foreign keys within tables, to
validate attributes that should be foreign keys (but that may not have
constraints to enforce integrity), and to identify other areas of data
redundancy. Example: redundancy analysis could provide the analyst with the
fact that the ZIP field in table A contained the same values as the ZIP_CODE
field in table B, 80% of the time.
Finally,
data profiling can be custom programmed to check complex business rules unique
to a business such as verifying that all the preconditions have been met for
granting approval of a major funding initiative
0 comments: