Now we have the data in a central location in blob storage, we now needed to select the right database technology to build the warehouse. Normally on an in premise setup I would use SSIS to orchestrate the ETL, but running a serverless architecture in Azure opened up some new options.
The main orchestration engine for ETL in Azure is Azure Data Factory, or ADF for short. Compared to the options available in SSIS, ADF seems fairly lacking at first sight, but when you start to look at how to ingest data into Azure SQL Data Warehouse, you realise that these limitations enable an entirely new viewpoint and technique for data loading.
If we take a step back and look at the underlying architecture of ASDW, we start to realise that data ingestion is a high point. The data in an ASDW db is stored on 60 storage nodes stored on blob storage and each node is able to ingest data from blob storage independently of each other. This enables the DB to ingest billions of rows of data at phenomenal speeds compared to normal etl methods that ingest data serially.
Now we have selected the database architecture using the Massively Parallel Processing (MPP) architecture, we now need to work on query performance.
In my previous post I signed off the post by mentioning something I have come to call distributed ETL. This was a solution to a problem I was presented with many years ago when I was working as a consultant. A client of mine was building a central data warehouse to host client data. The problem arose because the clients of my client were all on their own networks which meant that the normal pull architecture of ETL, a central ETL server orchestrating the data extract, was not going to work. The data had to be pushed from each of the clients installations over the internet to a central repository where the T&L portions could be carried out.
Now back in my previous role, I created some SSIS packages that could be installed locally to extract the relevant data and then FTP it to the central server to be processed. This worked very well, but it required someone to visit each site and set up the packages and schedules. Now Snow has in the region of 5000 clients, and for the new product to be successful we needed a less hands on approach.
Enter the Snow Update Service or SUS for short. What is SUS? SUS is Snow’s own version of a system such as Windows Update. It is a system that has a central server that clients installations can connect to, to request and download the latest updates to all of our software that we sell. Sounds familiar right? The distributed ETL paradigm in reverse. It just needed a bit of work to enable it to send new data to a central data storage area.
Now we had a system that, once a client had bought an analytics pack, would enable data to be uploaded, we could start to build out the central warehouse and all that lay around it to enable the clients to be able to visualise their data.
In my previous post I covered the journey into Snow Software and SAM. It was at times a tortuous journey learning the intricacies of software licensing, mainly from reverse engineering Sql stored procs and views. Times were moving quickly and work needed to be done.
My first job was to produce an on premise PowerBI template that could be applied to our local install of our application server to extract pertinent information and present useful BI insights into the client organisation. This was my first foray into understanding the intricacies of SAM and this led my thinking and intricacies of a never ending changing dataset.
The lessons I learned from this exercise that was it was pretty much commercial suicide. The services side of the business would be flooded with support requests to support multiple versions of our installed server applications. This is where my experience of distributed ETL rang true.
Wow, 2018 has been a massive year for me. This time last year I was in the interview process for my current job. Little did I know how much I would learn in the job that I would accept and start on the 1st of February 2018.
Coming into the job, all I knew was that I was the first member of a totally brand new team and function within my new Company, Snow Software. All I knew was that they were industry leaders in a practice called Software Asset Management, or SAM for short.
Now for those who have not encountered SAM, it is the practice of monitoring the usage and compliance of software licensing. Yawn, you may say, but having spent some time dealing with the data and the intricacies of software licensing, the massive financial impact of not being compliant, or even being over compliant (having too many licenses) , on businesses in this day and age can really reap benefits when you can monitor and fix the gaps in compliance, especially when one of the large 4 software companies can come in and audit your software usage whenever they want…
Now back to the question at hand, how does one come into a business that performs a function to their clients like this, and then re-present that data to them in a more enriched environment? That was the reason I was employed, and the process to enabling clients to access and act on their data will follow in the next few blogs.