Azure SQL DW data structure

In my previous post, I outlined my choice in using ASDW. Due to the storage architecture of the data being distributed across 60 nodes, there arises some very different table structure and query optimisation techniques that need to be implemented. This article will outline the data structure requirements of ASDW.

Data Distribution

For each table you have to specify how  to distribute the data across the storage nodes. There are three options for distribution, round robin, hash and replicated. For maximum query performance you need to minimise data movement between nodes when joining tables. Now, for your dimensions of less than about 5GB you should choose replicated. This means that when joining to the large fact tables that are distributed using either of the other methods that all data values in the dimensions are available on all nodes and no data movement needs to take place.

Now, if all your dimensions stay under around 5GB, all your fact tables can be distributed using the default distribution of round robin until you can build on your query requirements at which point you may wish to using the hash distribution method.

Hash distribution should be used when you either have a large dimension that is outside the recommended 5GB in size, or when you frequently join to large fact tables using a common dimension key. When defining a table to be distributed using the hash method, you identify the column on which you wish to apply a deterministic hashing algorithm to distribute the data. This then means that all rows that have the same key are stored on the same distribution. When this method has been applied, you should also define the same hash distribution on the dimension on which key you have distributed the fact table.

One consideration you need to take into account when choosing a column to hash distribute on is the skewness of the data. You need to choose a column on which the data is evenly distributed to optimise query performance. If the data is skewed, this can cause too much processing to be occurring on a single node which could produce some severe performance degradation in the queries.

Data Storage

For all dimensions that have been marked for replication, you can use either a heap or columnstore compression depending on the size of the tables. Columnstore works best when compressing 1M rows at a time, so use that as your general rule of thumb.

Now for fact tables, or dimensions distributed using a hash distribution, you have to take into account the fact that the table is split into 60 segments. This means not using columnstore on anything below around 60M rows as the columnstore compression will not be working at its optimum.