The Deciding factors for ETL Tools

Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical warehousing project. ETL tools promise quick results, improved manageability and meta data integration with other common design and implementation tools. However, due to the potentially huge amounts of money involved in a tool decision, choosing the correct ETL tool for your project can present a daunting challenge. With a bit of internal questioning in advance followed by a careful review of your key needs against the choices available on the market, you should be able to choose the most effective ETL tool for your project.
ETL tools perform, as you may guess, at least three specific functions all of which focus around the movement of data from one place (file type, server, location, etc.) or system to another. More encompassing than a simple file copy process, this class of software generally reads data from an input source (flat file, relational table, message queue, etc.); passes the stream of information through either an engine- or code-based process to modify, enhance, or eliminate data elements based on the instructions of the job; and then writes the resultant data set back out to a flat file, relational table, etc. As you may have guessed, these three steps are known as extraction, transformation and loading, respectively.
You may also see these tools classified as data cleansing tools, though here we should make a careful distinction. While data cleansing definitively can encompass the ETL functions, most true data cleansing tools are not architected to perform true ETL and, vice versa, most ETL tools provide only limited true data cleansing functions.

A quick example can show the difference between the two types of tools. Suppose you have an input data file containing the full name of a customer on an account. Particularly in financial services, this data may contain any manner of formats; but for this example, let's use "John S. and Mary Smith." A true data-cleansing tool will possess integrated functions to parse this phrase into the two named account holders, "John S. Smith" and "Mary Smith." This type of complex string parsing is not generally a strong function of an ETL tool. On the other hand, the ETL tool will generally be better at efficiently looking up the name "Mary Smith" in a relational database customer table and returning the integer numeric key value related to the "Mary Smith" customer name.
There are more than 75 tools on the market that purport to have some ETL function. These may be classified according to function, engine type and development environment.
Function
The tools fall into several general categories:
EtL Tools – These "small t" tools emphasize the extraction and loading processes and should actually be referred to as data migration tools instead of full-function ETL processors.
eTL or ETl Tools – These "small e" or "small l" tools typically accept only a specific input or output type, such as a flat file source or a specific database format, but offer fairly robust transformation functions within the processing engine.
eTl Tools – These "capital T" tools perform the data transformation step relatively well, but may lack efficient connectivity to many of the common data formats that may be encountered.
ETL Tools – These "complete" ETL tools provide a rich mix of functionality and connectivity, but may be significantly more expensive than tools found in the other categories. For extremely complex projects or those attempting to process massive amounts of data, these tools may present the only true option for ensuring success in the ETL phase of the project. In other cases, this class of tool may offer features that simply are not required in the existing environment.
Engine Type
The engine type classification segments the tools by how the developed ETL processes are executed. Typically, the tools encountered fall into one of two categories: server engine or client engine. The server engine allows execution of multiple concurrent jobs from more than one developer. Server engines generally take advantage of multiple CPUs and are designed to coordinate and manage the execution of multiple simultaneous routines.
The client engine is simpler and assumes that the ETL routines are executed on the same machine as they are developed. Concurrency for multiple jobs is limited, if allowed at all, and client engine tools often do not support scaling across multiple CPU machines. Client engines are significantly cheaper to purchase due to their more limited functionality and scalability.
Development Environment
Development environments are typically split two ways: GUI- based or code-based tools. Code-based tools are the most familiar and may not be considered "tools" independent of the language they represent. For example, Perl can be used as a code-based ETL tool, but it is also a more generalized programming language. The embedded transactional code languages within common database platforms (e.g., PL/SQL with Oracle, Transact*SQL with Microsoft SQL Server, etc.) may also provide ETL functionality, but are not limited to this capability. Aside from general programming languages, several tools on the market utilize a custom- scripting language developed explicitly for the optimization of ETL routines.
GUI-based ETL tools have been on the market for at least six years. The purpose of the GUI is to remove the coding layer for the developer and allow generation of ETL routines without requiring the mastery of any particular coding language. GUI tools are also useful in that they provide some self-documentation about the job flow just from the layout and positioning of the graphical elements.
Key Questions To Ask
Now that you have some background on the types of tools available, what criteria should you use to help decide which tool is best for you? We will focus the discussion around the five C's: complexity, concurrency, continuity, cost and conformity.
Complexity
Complexity is generally evaluated through the following series of questions:
How many distinct sources of information will need to be processed with this tool?
If using files, what is the expected size of the input files? Are they ASCII or EBCDIC formatted? For mainframe file sources, do they contain OCCURS, REDEFINES or packed fields? For other file sources, are the files delimited or fixed length? What character is used as a delimiter?
If using relational database sources, how many rows will need to be queried each time the job is run? Do primary and foreign key relationships exist within and/or between the tables of interest? Can the tool support the types of joins needed to execute against the tables of interest? Is it possible to see and tune/edit the SQL the tool is generating against the database?
What "magic values" exist within given data fields? Is it necessary to translate computer codes into more understandable terms?
Is there a need to join information together from dissimilar sources? Is there a requirement to match flat- file data with fields in a relational database table? Is there a requirement to join information from two different types of databases?
Does the data warehouse use the "slowly changing dimensions" concept? If so, how does the tool support appropriately updating dimensional attributes?
How well does the tool support testing and debugging while in development?
Concurrency
Concurrency evaluates both the number of developers and number of simultaneous processes that the tool must support:
How many developers will need access to the tool?
Is there a need to implement security to limit access for individuals or groups? How does the tool support this? Is it possible to leverage LDAP or other in-place security mechanisms for authentication purposes?
How are change management functions or "versioning" handled within the tool? What happens if two developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference? What complexities are introduced through the specific versioning or change management functions used by the tool?
What is the maximum number of simultaneous processes that are ex-pected to run? What size server would be required to support this number of concurrent processes with this tool? Can the tool support running twice that many simultaneous processes?
How flexible is the tool when executing dependent or sequential job streams? Can the tool recover from a failed job without manual intervention? How flexible is the error condition testing? How are errors logged?
Is the product multithreaded? If so, is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization?
Continuity
Continuity involves two distinct subjects. Continuity establishes the scalability of the tool to meet future demands from system growth, and it evaluates how reliable the product is under unusual circumstances:
If data sources double in size, can this tool support that level of processing?
How easy is it to make changes to developed jobs or routines?
Can more than the "minimum" hardware and operating system requirements to support the product be deployed?
How well does the product respond to spurious failures such as network problems, server crashes or running out of disk space?
How often does the vendor upgrade the product? How painful are the upgrade installations?
Cost
Cost represents the real purchase cost and perhaps the hidden support costs associated with each product:
What is the quoted price for the product options needed to be successful with this product? Where will additional software licensing costs (adding data sources, developer seats, server CPUs, etc.) be incurred in the future?
What future software costs will be incurred to receive software support and upgrades from the vendor? Is the first year of support and maintenance included with product purchase?
Is it necessary to purchase additional hardware or software components in order to effectively utilize this tool?
Who will administer the tool? Will it be necessary to hire a dedicated person or can someone already on the team fulfill this duty?
How much training is required to learn the tool? How much additional training is required to administer it? What advanced topic courses are offered? What do the courses cost? How often and where are they offered?
How available are support resources outside of those provided by the vendor? What resellers or systems integrators support the product? Are there local users groups? Are there technical forums available on the Internet?
How long should it to take to implement our first production processes? How long is the typical learning curve on a tool like this (measure to the point of effective use of the tool, not just through the end of training!)?
Conformity
Conformity outlines how the tool behaves against existing architectural limitations or requirements:
Does this tool support the hardware platform and operating system that will be used? Are there any incompatibilities with any other software typically run on the servers (backup software, virus scanning software, etc.)?
Does this tool integrate with the network and user security protocols used? What, if any, work-arounds will be required?
Does this tool allow access to the type and location of data sources that will be used? Should mainframe files be accessed directly (and can the tool support this?) or is it necessary to extract mainframe data to files first?
Can the tool import and/or export meta data or other key information with the specific tools (data modeling tools, business intelligence tools, etc.) that are deployed in the data warehouse?
What form of documentation does the tool generate? Does this format and content meet internal documentation standards? What additional documentation will need to developed and maintained outside of the tool?
Is this purchase a departmental tool or will the tool be used across the enterprise? How would the tool architecture scale to grow from a departmental solution into one suitable for enterprise deployment?
Evaluating the ETL Purchase Decision
You will find that ETL tools can range in cost from free (freeware) to more than several hundred thousand dollars. As such, your decision delivers a widely ranging and potentially lasting impact on your systems architecture. If you are not already using a tool, you obviously first must compare the decision to purchase a tool against the "keep doing it like we're doing it today" option. Very few IT departments are able to meet the current and future development needs of the business users without substantial growth. One way to offset the need for additional headcount is to take advantage of the efficiency that modern ETL tools can bring if chosen and employed effectively.
To be blunt, you may encounter the feeling of buyer's remorse with almost any software purchase you make. Unfortunately, there is no single right answer in any tool choice today (despite what vendors tell you). As such, your goal should be to understand in advance where the expected weaknesses will occur with your chosen solution and be prepared to mitigate those consequences.
First Steps – "Deal Breakers"
Your first order of business in outlining your ETL selection is to determine if any critical business factors exist that might help limit the number of tools you will consider. Examples of these factors may include:
We must have a tool that runs on {specific operating system}.
We must be able to process {amount} GB of data nightly in a {number} hour batch window.
We must have connectivity to {type of data source}.
We must be able to run our ETL jobs on the mainframe.
We must have a tool that supports concurrent development by multiple sites.
We absolutely cannot spend more than ${x}.
The specific objective is to identify conditions that weed out solutions that simply will not be acceptable within your given architecture. Ideally, this first step should identify a simple list of the top six or seven tools that might meet the majority of your requirements. An exhaustive review of the tools is not required at this point.
Making the Final Decision
After winnowing out the clearly unacceptable choices, your next step in the process is to outline what criteria you will use in the evaluation. You may use the questions in the earlier sections of this article as a guideline; but you should be fairly specific and outline specific data sources with which you require connectivity, hardware expectations, operating system options, etc.
After building the list, generate a matrix with your decision criteria as rows and the tool choices as the columns of the matrix. Then rank each of the tools using publicly available information (Web sites, publications, etc.) as much as possible. If you do not feel you have enough information available to adequately answer your questions, contact a vendor representative to obtain the necessary information to complete the matrix. Note that managing more than two to three vendor presentations and handling appropriate follow-up can be an extremely time-consuming task. For this reason, you may wish not to begin onsite formal vendor presentations until you've created your "short list" of the two to three finalists for the selection. This approach will allow you to truly focus your energies on understanding why one tool is specifically better than just one or two others. With a wider field, you will generally find many "triangular" relationships (tool A is better than tool B at criteria 1, tool B is better than tool C at criteria 2, tool C is better than tool A at criteria 3, etc.) that confuse the overall picture.
Eventually, you should arrive at a reasonable conclusion about which tool will best meet your needs. You should definitely try to obtain a customer reference from each vendor, preferably another company doing approximately the same function as your organization (though for competitive reasons this may be hard to obtain). Focus on your key technical requirements to ensure that the customer reference you obtain is relevant and useful to you. A customer reference accessing similar systems and operating on similar hardware and operating systems is probably more applicable than another similar industry reference on a completely different platform.
In the end, of course, it all comes down to putting ink to paper and taking on the responsibility for this new project. Much like having a baby, you won't know for some time how decisions you make today turn out; and you'll inevitably have second thoughts along the way (Was I ready for this?). However, with a solid decision-making process behind you, you can expect to grow and succeed with your selection. At the final bell, it is the results that count!

Source by-->http://www.information-management.com/issues/20010301/3084-1.html

Big Data - DW & BI

Translate to your Language

Labels

Tuesday, September 28, 2010

The Deciding factors for ETL Tools

0 comments:

Popular Posts

Pin It