The distributed ETL paradigm

In my previous post I signed off the post by mentioning something I have come to call distributed ETL. This was a solution to a problem I was presented with many years ago when I  was working as a consultant. A client of mine was building a central data warehouse to host client data. The problem arose because the clients of my client were all on their own networks which meant that the normal pull architecture of ETL, a central ETL server orchestrating the data extract, was not going to work. The data had to be pushed from each of the clients installations over the internet to a central repository where the T&L portions could be carried out.

Now back in my previous role, I created some SSIS packages that could be installed locally to extract the relevant data and then FTP it to the central server to be processed. This worked very well, but it required someone to visit each site and set up the packages and schedules. Now Snow has in the region of 5000 clients, and for the new product to be successful we needed a less hands on approach.

Enter the Snow Update Service or SUS for short. What is SUS? SUS is Snow’s own version of a system such as Windows Update. It is a system that has a central server that clients installations can connect to, to request and download the latest updates to all of our software that we sell. Sounds familiar right? The distributed ETL paradigm in reverse. It just needed a bit of work to enable it to send new data to a central data storage area.

Now we had a system that, once a client had bought an analytics pack, would enable data to be uploaded, we could start to build out the central warehouse and all that lay around it to enable the clients to be able to visualise their data.