IBM Netezza Best Practices & Guidelines

I am sharing the some of the best practices we followed on designing the netezza table and database

In order to leverage the Netezza box to its fullest potential and achieve optimum performance, we recommend the best practices and guidelines listed below:

Distribution :-

§ The distribution of the data across the various disks is the single most important factor that can impact performance. Consider below points when choosing the distribution key:

§ Column with High Cardinality.

§ Column(s) that will frequently be used in join conditions.

§ Avoid using a boolean column, which causes Data Skew.

§ Avoid distributing the tables on columns that are often used in the where clause as it will cause processing skew. Date columns would be an example of where not to use this, especially in DW environments.

§ Good distribution is a fundamental element of performance!

§ If all data slices have the same amount of work to do, a query will be 94 times quicker than if one data slice was asked to do the same work

§ Bad distribution is called data skew

§ Skew to one data slice is the worst case scenario

§ Skew affects the query in hand and others as the data slice has more to do

§ Skew also means that the machine will fill up much quicker

§ Simple rule. Good distribution – Good performance

To create an explicit distribution key, the Netezza SQL syntax is:

usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON [HASH] ( <column> [ ,… ] ) ;

The phrase distribute on specifies the distribution key, hash is optional.

You cannot update columns specified as the DISTRIBUTE ON key.

To create a random distribution, the Netezza SQL syntax is:

usage: CREATE TABLE <tablename> [ ( <column> [, … ] ) ]

DISTRIBUTE ON RANDOM;

The phrase DISTRIBUTE ON RANDOM specifies round-robin

§ Select the common key between the Dim and Fact tables if possible; if not select the key to ensure that the larger table (Fact) is not redistributed.

§ Choose Random Distribution only as the last resort as it will more often lead to a table being redistributed or broadcasted. This is okay for a small table but will impact performance if done to large tables.

§ Use Char(x) instead of Varchar(x) when you expect the data to be a fixed length as this not only helps to save disk space but also helps performance due to reduced I/O.

§ Where possible, use the NOT NULL constraint for columns in the tables, especially for columns that are used in where clauses (Restriction) or join conditions.

§ Use Same Data Type and Length for columns that are often used for joining tables.

§ Use joins over sub queries.

Create materialized views for vertically partitioning small sets of columns that are often used in queries. The Optimizer automatically decides when to use the MV or the underlying table.

Big Data - DW & BI

Translate to your Language

Labels

Friday, November 1, 2013

IBM Netezza Best Practices & Guidelines

0 comments:

Popular Posts

Pin It