This post is the quick and shorter version of a presentation I gave on the topic of batch execution for Azure ML through SSIS. The material is available here: http://www.sqlsaturday.com/433/Sessions/Details.aspx?sid=38704
Predictive experiments in Azure ML work as a wrapper for your trained model in terms of providing web service inputs and outputs. As a developer implementing these services that is about as much you need to know, we are going to sample some data, send it to the model and get some scores back and then load the new information back into a DW.
This is what the design of a typical predictive experiment looks like. On the left is the finished and polished machine learning model. From the top comes information about what the metadata of the webservice input should look like and we have the scoring activity in the middle.
If you want to follow along at home, you will need the following.
- SQL Server 2014 Developer Edition
- Visual Studio 2013 with SSDT-BI
- Azure Feature Pack
- Custom component for the batch execution
The finished product will be a package as simple as in the introduction. Load some data to Azure blob storage, create and execute a batch on that data, retrieve the output URI and load that information back into a database table.
This package can act as a template for future use. The components can be parameterized and with some effort you can get it to work with BIML.
Azure Machine Learning has two modes of running predictive experiments through the API.
- Request-Response Service (RRS) which takes the inputs for a single case and returns the predictions, useful for streaming data.
- Batch Execution Service (BES) which we will explore here since SQL Server data and SSIS is inherently batch oriented.
When running a batch execution for an Azure ML prediction experiment it is important to remember that the only source you can use is an Azure Blob file, typically a csv file. Therefor we must first make sure our data is in the correct format before running the batch and we need to make sure we know where to look for the results which also to no surprise is a csv file in Azure Blob Storage.
First thing is first.
Our control flow kicks of with a data flow that takes our data, preferably in a view with correct formatting and filtering applied, and puts it into a csv file in Azure Blob Storage.
The connection manager for our DW is pretty straight forward but for the blob storage we are going to need that Azure Feature Pack I was telling you about. Make sure that you have properly installed the pack by looking for the Azure Blob Destination among the destinations in the SSIS Toolbox in Visual Studio.
Make sure you have your connection information available. You can use the same blob storage account that Azure ML uses, just make sure you have the storage name and access key which you can get by drilling down on the Azure Portal. You probably want to create a container for the csv files you are going to upload as input for the batch executions.
With connection managers in place we can now in our dataflow add the source and destinations, hooking each up to respective connections, adding information about database in table/view for data in the source and information about container/filename in the destination. These things are great to parameterize!
If you are unsure of where to find your blob storage information head over to manage.windowsazure.com and make sure you have a storage account set up. Then find the link to the service endpoint and one of your access keys.
Second part is to run the batch job. To get our batch job up and running all we need to do is to create an execution, point it to where the source input is located, and then start the job.
Well, it turns out that is easier said than done. The inspiration for this blog post came from another post by Chris Price over here: http://blogs.msdn.com/b/bluewatersql/archive/2015/04/04/building-an-azure-ml-ssis-task.aspx
What you need to do is clone this, build it, and distribute the assembly both for IDE and for execution (including any other servers where you want the packages to run). The component will show up in your SSIS toolbox in the control flow when properly installed.
I augmented some of his code to get another output method, just the uri of the experiment output.
I also changed so that the component uses a storage manager from the Azure Feature Pack instead of the connection manager from the other project in Chris’s solution.
You want to keep track of at least the URL and key for the predictive experiment. You might also want to parameterize the source blob location.
With this in place we can send the component information about where to get web service input for our experiment, and which variable to update with the location of the webservice output from the experiment.
So we add a variable for the location and split this up into two variables that the Azure Blob Source in our last data flow needs. This is because that component needs container and file name, not one long path.
This leaves us with the task of getting the data back.
Our second and last data flow task will read data from the Azure Storage where the experiment output landed and make sure the output data format matches the table in our DW. I also put in a derived column that makes the output a boolean depending on the the value of the experiment outputs which was a string for the label and a numeric for the score.
In summary, we have looked at how the Azure ML API works and how we can use it from SSIS. The big challange is in controlling and keeping track of the batch execution job while it is running.
If you are more serious about using batch execution I recommend using SSIS Data Factory.