In one of my previous blogs I explained basic descriptive statistics procedures. When you prepare the data for Data Mining, there is never enough of over viewing and checking the data with statistical methods. Therefore, it would be nice to have more of them available in SQL Server. Very useful measures are the 3rd and the 4th population moments- Skewness and Kurtosis.
Skewness is a parameter that describes asymmetry in a random variable’s probability distribution. Skewness characterizes the degree of asymmetry of a distribution around its mean. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.
The formula for the skewness consists of sum of the deviations from the mean divided by standard deviation on the third degree. Additionally, the formula can include a corrective factor to shift the values of the skewness around 0, so 0 means no skewness at all. I am using the same formulas as MS Excel Skew function. So the formula for the skewness is:
Skew = n / (n-1) / (n-2) * SUM(((x-µ)/s)3)
Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution. The formula for the kurtosis consists of sum of the deviations from the mean divided by standard deviation on the fourth degree. Again, the formula can have some corrective factor to centralize the kurtosis around 0, and again I am using the same formula as MS Excel. The formula:
Kurt = n * (n+1) / (n-1) / (n-2) / (n-3) * SUM(((x-µ)/s)4) – 3 * (n-1)2 / (n-2) / (n-3)
In this first part I am adding the Transact-SQL solutions. Note how I add the Mean and StDev values to each row of the outer table – using a cross join with a derived table that calculates them. I could calculate them with a CTE, but this solution with a cross join works on SQL 2000 as well. Still, I have two passes through the data. In next blog I will show you how you can calculate skewness and kurtosis with CLR UDAs in a single pass through the data.
USE AdventureWorksDW;
Go
— Skewness
SELECT Skew = SUM(((Age*1.0–m.mean)/m.[StDev])*((Age*1.0–m.mean)/m.[StDev])*((Age*1.0–m.mean)/m.[StDev]))
* MIN(m.corrfact1)
FROM vTargetMail v CROSS JOIN
(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
corrfact1 = COUNT(*)*1.0 / (COUNT(*)-1) / (COUNT(*)-2)
FROM vTargetMail v) AS m;
— Kurtosis
SELECT Kurt = SUM(SQUARE(SQUARE(((Age*1.0–m.mean)/m.[StDev])))) * MIN(m.corrfact2) – MIN(m.subfact)
FROM vTargetMail v CROSS JOIN
(SELECT mean = AVG(Age*1.0), [StDev] = STDEV(Age),
corrfact2 = COUNT(*)*1.0 * (COUNT(*)+1) / (COUNT(*)-1) / (COUNT(*)-2) / (COUNT(*)-3),
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM vTargetMail v) AS m;
GO