Building the warehouse

Now we have the data in a central location in blob storage, we now needed to select the right database technology to build the warehouse. Normally on an in premise setup I would use SSIS to orchestrate the ETL, but running a serverless architecture in Azure opened up some new options.

The main orchestration engine for ETL in Azure is Azure Data Factory, or ADF for short. Compared to the options available in SSIS, ADF seems fairly lacking at first sight, but when you start to look at how to ingest data into Azure SQL Data Warehouse, you realise that these limitations enable an entirely new viewpoint and technique for data loading.

If we take a step back and look at the underlying architecture of ASDW, we start to realise that data ingestion is a high point. The data in an ASDW db is stored on 60 storage nodes stored on blob storage and each node is able to ingest data from blob storage independently of each other. This enables the DB to ingest billions of rows of data at phenomenal speeds compared to normal etl methods that ingest data serially.

Now we have selected the database architecture using the Massively Parallel Processing (MPP) architecture, we now need to work on query performance.