In the last article in this series about Python for SQL Server specialists, you are going to learn how to use SQL Server Python libraries in SQL Server. You can use two scalable libraries, the revoscalepy and microsoftml libraries, which correspond to equivalent R libraries.
SQL Server and ML Integration
With SQL Server 2016 and 2017, you get a highly scalable Machine Learning (ML) engine. Not every function and algorithm is rewritten as a scalable one. Nevertheless, you will probably find the one you need for your analysis of a big dataset. You can store a Python or R data mining or machine learning model in a SQL Server table and use it for predictions on new data. You can even store graphs in a binary column and use it in SQL Server Reporting Services (SSRS) reports. Finally, ML support is not limited to SQL Server only. You can use R code also in Power BI Desktop and Power BI Service, where we at this time (April 2018) still waiting for Python support. However, you can already use both languages, Python and R, in Azure Machine Learning (Azure ML) experiments.
Microsoft provides the highly scalable ML engine in two flavors:
• ML Services (In-Database): This is the installation that integrates ML into SQL Server. It includes a database service that runs outside the SQL Server Database Engine and provides a communication channel between the Database Engine and R runtime. You install it with SQL Server setup. The ML engine includes the open source R and Python components and in addition a set of scalable R and Python packages.
• Microsoft ML Server: This is a standalone ML server with the same open and scalable packages that run on multiple platforms.
Some of the scalable packages shipped with SQL Server R Services are:
• RevoScaleR (for R) and revoscalepy (for Python): This is a set of parallelized scalable functions for processing data, data overview and preliminary analysis, and machine learning models. The procedures in this package can work with chunks of data at a time, so they do not need to load all of the data in memory immediately.
• MicrosoftML (for R) and microsoftml (for Python): This is a package from December 2016, with many additional scalable machine-learning algorithms implemented.
The following figure shows how the communication process between SQL Server and ML engine works:
The components involved and their communications are as follows:
• In SQL Server Database Engine, you run R or Python script with the sys.sp_execute_external_script system stored procedure. SQL Server sends the request to the Launchpad service, a new service that supports the execution of external scripts.
• The Launchpad service starts the launcher appropriate for the language of your script, either the RLauncher.dll or the PyLauncher.dll, and therefore you can launch an external script from SQL Server using the R or Python language. You can see that the infrastructure is prepared to enable the execution of scripts in additional programming languages.
• The RLauncher or the PyLauncher starts RTerm.exe, the R terminal application for executing R scripts, or Python.exe, the Python terminal application.
• The terminal application in any of the two languages sends the script to BxlServer. This is a new executable used for communication between SQL Server and the ML engine. The scalable ML functions are implemented in this executable as well.
• The BxlServer uses SQL Satellite, a new extensibility API that provides a fast data transfer between SQL Server and external runtime. Again, currently the R and the Python runtimes are supported.
Using the Scalable Functions
Time to start using the scalable function. I will show you how to use some of the functions from the revoscalepy package, which is part of the imports at the beginning of the following code. The code defines the connection string to my local SQL Server, the AdventureWorksDW2016 database. Remember, the dbo.vTargetMail view comes from this database. Also note that the RUser used to connect to SQL Server needs permission to use the sys.sp_execute_external_script procedure. The code also defines the chunk size of 1,000 rows.
import numpy as np import pandas as pd import pyodbc import revoscalepy as rp # Connection string and chunk size sqlConnStr = """Driver=SQL Server;Server=SQL2017EIM; Database=AdventureWorksDW2016;Uid=RUser;Pwd=Pa$$w0rd""" chunkSize = 1000;
The next code defines the query to read the data from SQL Server. Note that an ODBC connection is not needed. The RxSqlServerData() function generates a SQL Server data source object. You can think of it as a proxy object to the SQL Server rowset, which is the result of the query. The data itself is not stored in the memory.
# Define the query TMquery = """SELECT EnglishOccupation AS Occupation, YearlyIncome, Age, TotalChildren, NumberCarsOwned, BikeBuyer FROM dbo.vTargetMail""" # Only creates the data source object and does not populate it sqlTM = rp.RxSqlServerData(sql_query = TMquery, connection_string = sqlConnStr, string_as_factors = True, rows_per_read = chunkSize);
Of course, you can also load the data in an in-memory data frame, like the following code does.
# Import the data in memory TMSQL = rp.rx_import(input_data = sqlTM, report_progress = 3);
With the rx_get_info() function, you can get some basic information about both, the proxy object and the in-memory data frame.
# Get info about the data source and memory data rp.rx_get_info(sqlTM) rp.rx_get_info(TMSQL)
Here are the results. You can see that the first object is just a proxy, while the second has rows and columns.
>>> rp.rx_get_info(sqlTM) Connection string_Driver=SQL Server;Server=SQL2017EIM; Database=AdventureWorksDW2016;Uid=RUser;Pwd=Pa$$w0rd Data Source:SqlServer >>> rp.rx_get_info(TMSQL) Number of observations:18484.0 Number of variables:6.0
You can use the rx_summary() revoscalepy scalable function to quickly get some descriptive statistics for the data. The function can use both, the proxy object and the in-memory data frame as the source.
# rx_summary() can use the data source or the memory data sumOut = rp.rx_summary(formula = "~ NumberCarsOwned + Occupation + F(BikeBuyer)", data = sqlTM) print(sumOut) sumOut = rp.rx_summary(formula = "~ NumberCarsOwned + Occupation + F(BikeBuyer)", data = TMSQL) print(sumOut)
For the sake of brevity, I am showing only one result here.
Summary Statistics Results for: ~ NumberCarsOwned + Occupation + F(BikeBuyer) Number of valid observations: 18484.0 Name Mean StdDev Min Max ValidObs MissingObs 0 NumberCarsOwned 1.502705 1.138394 0.0 4.0 18484.0 0.0 Category Counts for Occupation Number of categories: 5 Counts Occupation Professional 5520.0 Management 3075.0 Skilled Manual 4577.0 Clerical 2928.0 Manual 2384.0 Category Counts for F(BikeBuyer) Number of categories: 2 Counts F(BikeBuyer) 1 9352.0 2 9132.0
The functions that do not come from the revoscalepy package cannot use the proxy object. For example, the following code uses the pandas crosstab() function to do a crosstabulation on the data from the in-memory data frame and from the proxy object. Note that the code that tries to use the proxy object produces an error.
# Pandas crosstab can use the data, but can't use the data source object pd.crosstab(TMSQL.NumberCarsOwned, TMSQL.TotalChildren) # The next row produces an error pd.crosstab(sqlTM.NumberCarsOwned, sqlTM.TotalChildren) # AttributeError: 'RxSqlServerData' object has no attribute 'NumberCarsOwned'
The next step is initializing and training a linear regression model, using number of cars owned as the target variable, and income, age and number of children as input variables. Please note the syntax of the rx_lin_mod() function – it actually uses R syntax for the function parameters. This syntax might be simpler for you if you already use R; however, it might look a bit weird to pure Python developers.
# Create a linear regression model linmod = rp.rx_lin_mod( "NumberCarsOwned ~ YearlyIncome + Age + TotalChildren", data = sqlTM)
Finally, the following code makes the predictions on the in-memory data frame and shows the first ten rows of those predictions.
# Predictions on the memory data frame predmod = rp.rx_predict(linmod, data=TMSQL, output_data = TMSQL) predmod.head(10)
Note that the RevoScaleR package for R (https://docs.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler) includes many functions more than the revoscalepy package (https://docs.microsoft.com/en-us/machine-learning-server/python-reference/revoscalepy/revoscalepy-package) currently supports for Python.
Executing Python Code in SQL Server
Now you need to switch to SQL Server Management Studio (SSMS). You will use Python inside T-SQL code. If you did not configure your SQL Server to allow external scripts, you have to do it now.
-- Configure SQL Server to allow external scripts USE master; EXEC sys.sp_configure 'show advanced options', 1; RECONFIGURE WITH OVERRIDE; EXEC sys.sp_configure 'external scripts enabled', 1; RECONFIGURE WITH OVERRIDE; GO -- Restart SQL Server -- Check the configuration EXEC sys.sp_configure; GO
You can immediately check whether you can run Python code with the sys.sp_execute_external_script procedure. The following code returns a 1×1 table, with value 1 in the single cell.
-- Check whether Python code can run EXECUTE sys.sp_execute_external_script @language =N'Python', @script=N' OutputDataSet = InputDataSet print("Input data is: \n", InputDataSet) ', @input_data_1 = N'SELECT 1 as col'; GO
And finally, here is the big code that runs Python to create the same linear regression model as before, however this time within SQL Server. In the result, you get the actual data with the predicted number of cars.
-- Create a model and use it inside SQL Server USE AdventureWorksDW2016; EXECUTE sys.sp_execute_external_script @language =N'Python', @script=N' from revoscalepy import rx_lin_mod, rx_predict import pandas as pd linmod = rx_lin_mod( "NumberCarsOwned ~ YearlyIncome + Age + TotalChildren", data = InputDataSet) predmod = rx_predict(linmod, data = InputDataSet, output_data = InputDataSet) print(linmod) OutputDataSet = predmod ', @input_data_1 = N' SELECT CustomerKey, CAST(Age AS INT) AS Age, CAST(YearlyIncome AS INT) AS YearlyIncome, TotalChildren, NumberCarsOwned FROM dbo.vTargetMail;' WITH RESULT SETS (( "CustomerKey" INT NOT NULL, "Age" INT NOT NULL, "YearlyIncome" INT NOT NULL, "TotalChildren" INT NOT NULL, "NumberCarsOwned" INT NOT NULL, "NumberCarsOwned_Pred" FLOAT NULL)); GO
Before finishing this article, let me point out casts in the input SELECT statement. In comparison to SQL Server, Python supports a limited number of data types. Some conversions between SQL Server data types can be done implicitly, other must be done manually. You can read the details about possible implicit conversions at https://docs.microsoft.com/en-us/sql/advanced-analytics/python/python-libraries-and-data-types.
Conclusion
This article concludes the series about Python for SQL Server specialists. I hope that you enjoyed it, and especially that you gained some new knowledge that you will use soon.
Click here to download the code.
Read the whole series:
Python for SQL Server Specialists Part 1: Introducing Python
Python for SQL Server Specialists Part 2: Working with Data
Python for SQL Server Specialists Part 3: Graphs and Machine Learning
This is just an introduction, for more on starting with data science in SQL Server please refer to the book “Data Science with SQL Server Quick Start Guide” (https://www.packtpub.com/big-data-and-business-intelligence/data-science-sql-server-quick-start-guide), where you will learn about tools and methods not just in Python, but also R and T-SQL.