It’s an early start today here in Belgium preparing for my first public sessions as a speaker.
As I have mentioned previously, I will be presenting a session on Saturday at Power UserDays in Belgium on optimising large data sets in Power BI with aggregations https://sessionize.com/app/speaker/event/details/1667, but the crazy news is that I will also now be presenting another session focused on handling the most awkward of files, the multiformat flat file in Mapping Data Flows as well! Join me on Saturday to find out how to handle this really awkward ETL pattern! Sorry that you have to hear me droning on for multiple sessions, but I promise it will be worth it!
I am really happy to announce that I will be presenting a session at Power User Days in Belgium next month on Saturday September 14th. This is my first conference session I will have ever presented at and I am diving in at the deep end with delivering a hands-on session on how to optimise reporting in Power BI against large data sets using aggregations. I will be demonstrating this using a few billion rows in direct query to highlight how aggregations are a real game changer.
For more details and to register for free for this event please visit https://poweruserdays.com/?fbclid=IwAR2ORE5uwXUYu31dXZ-N4i7e_BnlPWqBu5dUCwXfAGdeTL5vKIj587Tf4pA . Don’t forget to bring your laptop, as I will be getting you to build along with me! There might even be a competition on who can build the best dashboard using the data set provided!
As development continues on ADLS gen2, Power BI now has a beta connector available. This can be found in the Azure folder of the Get Data Experience
To connect to your storage account, you will need to provide the DFS url to your account
Depending on whether or not the right permissions exist, you may get some weird errors once you have signed in with your organisational account, for example I saw this error
To rectify this I went to the Access Control tab in the Azure portal for my storage account and selected one of three of the blob roles (Storage Blob Reader, Storage Blob Contributor or Storage Blob Owner) to assign to the account I used to authenticate
Once this has been done (it can take up to 5 minutes for the roles to propagate in larger organisations) you will now be able to see the contents of your storage account and create reports based on data stored in ADLS gen2.
Thanks to Adam Pearce, a super helpful Senior Consultant here at Altius who found the fix for the connection error.
Although still in its infancy, there is already a lot of good resources out the for Mapping Data Flows in Azure Data Factory. I will try and keep this post updated as more blogs appear.
Mark Kromer: https://kromerbigdata.com/tag/mapping-data-flows/
Cathrine Wilhelmsen: https://www.cathrinewilhelmsen.net/tag/data-flows/
Azure Documentation: https://docs.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview
SQL Player: https://sqlplayer.net/tag/adfdf/
List of videos: https://github.com/kromerm/adfdataflowdocs/tree/master/videos
Hopefully this is a useful reference, let me know of any more in the comments.
Back in April (I know, this post is a bit late!) a new opportunity presented itself to me to return to the fast paced world of Consulting with the very highly respected Microsoft Business Intelligence consultancy Altius. With a focus on the Azure data platform and the advances in communication technologies since I was last in the consulting world in 2011, this felt like the right time to make the move back.
I have now been working in the role as an Azure Data Platform consultant for 4 months, and have really enjoyed working alongside such a talented team. I have already learned a lot, and still have a lot to learn, in this short space of time and have already been able to put my performance optimisation skills to the test.
I look forward to being able to share some of the insights that I have gained and hope you enjoy taking this new journey into the ever expanding Azure data landscape with me.
Stay tuned for more.
As you may remember from previous posts, I was well into the SQL DW architecture and query optimisation that was a fundamental part of my job in bringing analytics into Snow Software as a company.
After this inaugural year in Snow, things have changed, Microsoft have moved so quickly it is hard to keep up with everything. But the one gamechanger is how Mark Kromer and his Azure Data Factory team are transforming ETL in the cloud.
They have a new paradigm in play for etl to replace on prem SSIS, and that is ADF data flows. This is a drag and drop interface that hides the complexity of one of the biggest names in the industry, Databricks.
The opportunities that using data flows to do the MPP transform vs SQLDW means that loading an Azure SQL DB (especially in hyperscale) opens up options for massive data and real time reporting that SQLDW cannot handle with its query concurrency limits.
The ADF team have been absolutely amazing in supporting requests for new features and I look forward to them implementing some of the awesome stuff they have in their pipeline (of which I may have contributed some ideas…)
In my previous post, I outlined my choice in using ASDW. Due to the storage architecture of the data being distributed across 60 nodes, there arises some very different table structure and query optimisation techniques that need to be implemented. This article will outline the data structure requirements of ASDW.
For each table you have to specify how to distribute the data across the storage nodes. There are three options for distribution, round robin, hash and replicated. For maximum query performance you need to minimise data movement between nodes when joining tables. Now, for your dimensions of less than about 5GB you should choose replicated. This means that when joining to the large fact tables that are distributed using either of the other methods that all data values in the dimensions are available on all nodes and no data movement needs to take place.
Now, if all your dimensions stay under around 5GB, all your fact tables can be distributed using the default distribution of round robin until you can build on your query requirements at which point you may wish to using the hash distribution method.
Hash distribution should be used when you either have a large dimension that is outside the recommended 5GB in size, or when you frequently join to large fact tables using a common dimension key. When defining a table to be distributed using the hash method, you identify the column on which you wish to apply a deterministic hashing algorithm to distribute the data. This then means that all rows that have the same key are stored on the same distribution. When this method has been applied, you should also define the same hash distribution on the dimension on which key you have distributed the fact table.
One consideration you need to take into account when choosing a column to hash distribute on is the skewness of the data. You need to choose a column on which the data is evenly distributed to optimise query performance. If the data is skewed, this can cause too much processing to be occurring on a single node which could produce some severe performance degradation in the queries.
For all dimensions that have been marked for replication, you can use either a heap or columnstore compression depending on the size of the tables. Columnstore works best when compressing 1M rows at a time, so use that as your general rule of thumb.
Now for fact tables, or dimensions distributed using a hash distribution, you have to take into account the fact that the table is split into 60 segments. This means not using columnstore on anything below around 60M rows as the columnstore compression will not be working at its optimum.