I am sharing the some
of the best practices we followed on designing the netezza table and database
In order to leverage
the Netezza box to its fullest potential and achieve optimum performance, we
recommend the best practices and guidelines listed below:
Distribution :-
§ The distribution of the data across the
various disks is the single most important factor that can impact performance.
Consider below points when choosing the distribution key:
§ Column with High Cardinality.
§ Column(s) that will frequently be used in join
conditions.
§ Avoid using a boolean column, which causes
Data Skew.
§ Avoid distributing the tables on columns that
are often used in the where clause as it will cause processing skew. Date
columns would be an example of where not to use this, especially in DW
environments.
§ Good distribution is a fundamental element of
performance!
§ If all data slices have the same amount of
work to do, a query will be 94 times quicker than if one data slice was asked
to do the same work
§ Bad distribution is called data skew
§ Skew to one data slice is the worst case
scenario
§ Skew affects the query in hand and others as
the data slice has more to do
§ Skew also means that the machine will fill up
much quicker
§ Simple rule. Good distribution – Good
performance
To create an explicit distribution key, the
Netezza SQL syntax is:
usage:
CREATE TABLE <tablename> [ ( <column> [, … ] ) ]
DISTRIBUTE
ON [HASH] ( <column> [ ,… ] ) ;
The phrase distribute on specifies the
distribution key, hash is optional.
You cannot update columns specified as the
DISTRIBUTE ON key.
To create a random distribution, the Netezza
SQL syntax is:
usage:
CREATE TABLE <tablename> [ ( <column> [, … ] ) ]
DISTRIBUTE
ON RANDOM;
The phrase DISTRIBUTE ON RANDOM specifies
round-robin
§ Select the common key between the Dim and Fact
tables if possible; if not select the key to ensure that the larger table
(Fact) is not redistributed.
§ Choose Random Distribution only as the last
resort as it will more often lead to a table being redistributed or
broadcasted. This is okay for a small table but will impact performance if done
to large tables.
§ Use Char(x) instead of Varchar(x) when you
expect the data to be a fixed length as this not only helps to save disk space
but also helps performance due to reduced I/O.
§ Where possible, use the NOT NULL constraint
for columns in the tables, especially for columns that are used in where
clauses (Restriction) or join conditions.
§ Use Same Data Type and Length for columns that
are often used for joining tables.
§ Use joins over sub queries.
Create materialized views for vertically
partitioning small sets of columns that are often used in queries. The
Optimizer automatically decides when to use the MV or the underlying table.
0 comments: