As development continues on ADLS gen2, Power BI now has a beta connector available. This can be found in the Azure folder of the Get Data Experience
To connect to your storage account, you will need to provide the DFS url to your account
Depending on whether or not the right permissions exist, you may get some weird errors once you have signed in with your organisational account, for example I saw this error
To rectify this I went to the Access Control tab in the Azure portal for my storage account and selected one of three of the blob roles (Storage Blob Reader, Storage Blob Contributor or Storage Blob Owner) to assign to the account I used to authenticate
Once this has been done (it can take up to 5 minutes for the roles to propagate in larger organisations) you will now be able to see the contents of your storage account and create reports based on data stored in ADLS gen2.
Thanks to Adam Pearce, a super helpful Senior Consultant here at Altius who found the fix for the connection error.
Back in April (I know, this post is a bit late!) a new opportunity presented itself to me to return to the fast paced world of Consulting with the very highly respected Microsoft Business Intelligence consultancy Altius. With a focus on the Azure data platform and the advances in communication technologies since I was last in the consulting world in 2011, this felt like the right time to make the move back.
I have now been working in the role as an Azure Data Platform consultant for 4 months, and have really enjoyed working alongside such a talented team. I have already learned a lot, and still have a lot to learn, in this short space of time and have already been able to put my performance optimisation skills to the test.
I look forward to being able to share some of the insights that I have gained and hope you enjoy taking this new journey into the ever expanding Azure data landscape with me.
As you may remember from previous posts, I was well into the SQL DW architecture and query optimisation that was a fundamental part of my job in bringing analytics into Snow Software as a company.
After this inaugural year in Snow, things have changed, Microsoft have moved so quickly it is hard to keep up with everything. But the one gamechanger is how Mark Kromer and his Azure Data Factory team are transforming ETL in the cloud.
They have a new paradigm in play for etl to replace on prem SSIS, and that is ADF data flows. This is a drag and drop interface that hides the complexity of one of the biggest names in the industry, Databricks.
The opportunities that using data flows to do the MPP transform vs SQLDW means that loading an Azure SQL DB (especially in hyperscale) opens up options for massive data and real time reporting that SQLDW cannot handle with its query concurrency limits.
The ADF team have been absolutely amazing in supporting requests for new features and I look forward to them implementing some of the awesome stuff they have in their pipeline (of which I may have contributed some ideas…)
In my previous post, I outlined my choice in using ASDW. Due to the storage architecture of the data being distributed across 60 nodes, there arises some very different table structure and query optimisation techniques that need to be implemented. This article will outline the data structure requirements of ASDW.
For each table you have to specify how to distribute the data across the storage nodes. There are three options for distribution, round robin, hash and replicated. For maximum query performance you need to minimise data movement between nodes when joining tables. Now, for your dimensions of less than about 5GB you should choose replicated. This means that when joining to the large fact tables that are distributed using either of the other methods that all data values in the dimensions are available on all nodes and no data movement needs to take place.
Now, if all your dimensions stay under around 5GB, all your fact tables can be distributed using the default distribution of round robin until you can build on your query requirements at which point you may wish to using the hash distribution method.
Hash distribution should be used when you either have a large dimension that is outside the recommended 5GB in size, or when you frequently join to large fact tables using a common dimension key. When defining a table to be distributed using the hash method, you identify the column on which you wish to apply a deterministic hashing algorithm to distribute the data. This then means that all rows that have the same key are stored on the same distribution. When this method has been applied, you should also define the same hash distribution on the dimension on which key you have distributed the fact table.
One consideration you need to take into account when choosing a column to hash distribute on is the skewness of the data. You need to choose a column on which the data is evenly distributed to optimise query performance. If the data is skewed, this can cause too much processing to be occurring on a single node which could produce some severe performance degradation in the queries.
For all dimensions that have been marked for replication, you can use either a heap or columnstore compression depending on the size of the tables. Columnstore works best when compressing 1M rows at a time, so use that as your general rule of thumb.
Now for fact tables, or dimensions distributed using a hash distribution, you have to take into account the fact that the table is split into 60 segments. This means not using columnstore on anything below around 60M rows as the columnstore compression will not be working at its optimum.
Now we have the data in a central location in blob storage, we now needed to select the right database technology to build the warehouse. Normally on an in premise setup I would use SSIS to orchestrate the ETL, but running a serverless architecture in Azure opened up some new options.
The main orchestration engine for ETL in Azure is Azure Data Factory, or ADF for short. Compared to the options available in SSIS, ADF seems fairly lacking at first sight, but when you start to look at how to ingest data into Azure SQL Data Warehouse, you realise that these limitations enable an entirely new viewpoint and technique for data loading.
If we take a step back and look at the underlying architecture of ASDW, we start to realise that data ingestion is a high point. The data in an ASDW db is stored on 60 storage nodes stored on blob storage and each node is able to ingest data from blob storage independently of each other. This enables the DB to ingest billions of rows of data at phenomenal speeds compared to normal etl methods that ingest data serially.
Now we have selected the database architecture using the Massively Parallel Processing (MPP) architecture, we now need to work on query performance.
In my previous post I signed off the post by mentioning something I have come to call distributed ETL. This was a solution to a problem I was presented with many years ago when I was working as a consultant. A client of mine was building a central data warehouse to host client data. The problem arose because the clients of my client were all on their own networks which meant that the normal pull architecture of ETL, a central ETL server orchestrating the data extract, was not going to work. The data had to be pushed from each of the clients installations over the internet to a central repository where the T&L portions could be carried out.
Now back in my previous role, I created some SSIS packages that could be installed locally to extract the relevant data and then FTP it to the central server to be processed. This worked very well, but it required someone to visit each site and set up the packages and schedules. Now Snow has in the region of 5000 clients, and for the new product to be successful we needed a less hands on approach.
Enter the Snow Update Service or SUS for short. What is SUS? SUS is Snow’s own version of a system such as Windows Update. It is a system that has a central server that clients installations can connect to, to request and download the latest updates to all of our software that we sell. Sounds familiar right? The distributed ETL paradigm in reverse. It just needed a bit of work to enable it to send new data to a central data storage area.
Now we had a system that, once a client had bought an analytics pack, would enable data to be uploaded, we could start to build out the central warehouse and all that lay around it to enable the clients to be able to visualise their data.