One of the tricks i used was to figure out how to only apply changes and not do a full load each night. The source data was not setup for auto change detection and did not have a modified date i could use. This was a head turner.
The solution was to move lots of the code from SQL into SSIS. The following discussion breaks down various features that were used.
The first thing i tackled was a massive SQL statement that had a ton of joins (lets say close to 20). I broke this up by taking almost all of the joins out and moving those to SSIS lookups. This allowed the original retrieval of data to become quite fast.
There were a couple of areas where i could operate on the smaller data and make decisions at that point. This would limit the amount of rows from further down the pipe.
Next, i went through the current SQL code that worked on the tables i was working on. I found a ton of Update statements that would process the entire table and some Updates that would use a Where clause. I moved all of this code in either Script code or Derived Column. Since the data was in memory at this point, the new code works very fast and eliminated many expensive SQL operations.
Now, how could we detect the changes to rows? The solution was to calculate an SHA hash value for each row and store it in a new table. The field name is ChangeHashSHA and the row is keyed with an ID to the source table row. The length of the ChangeHashSHA field is 88 and its a string. This field was added to the incoming data using a Derived Column Transformation.
First i created a script method to calculate the hash value for a field:
Public Function CreateSHAHash(ByVal Input As String) As String Dim HashTool As New System.Security.Cryptography.SHA512Managed() Dim inputBytes As Byte() = System.Text.Encoding.ASCII.GetBytes(Input) Dim hashBytes As Byte() = HashTool.ComputeHash(inputBytes) ' Convert the byte array to hexadecimal string
HashTool.Clear()
Dim myMetadata As IDTSComponentMetaData100 myMetadata = Me.ComponentMetaData Return Convert.ToBase64String(hashBytes) End Function
Next i created a method to add the value to the main value:
Private Sub AddItemToUpdate(ByVal ColumnValue As String) UpdateTestValue &= ColumnValue.Trim().Replace(" ", "").Replace(":", "").Replace("-", "") End Sub
The next code comes from the main script method (Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)). This
If Row.ClaimAcceptedDateConverted_IsNull = False Then AddItemToUpdate(Row.ClaimAcceptedDateConverted.ToString()) End If
'Now add the new hash to ChangeHash Row.ChangeHashSHA = CreateSHAHash(UpdateTestValue) Row.ChangeDate = Me.Variables.BuildDateTime
There is an If block for each field in the source table. You can see how the call to CreateSHAHash will pass back the new SHA has for the row.
Now that we have a hash for the row, we can lookup the stored hash (if there is one) for that ID and test them. If they match, we drop this row as it has not changed. If they don’t match we do an update of the source and the new hash value and if the lookup fails, we do an insert and also insert the new hash.
One other feature we used here was the MultiCast transform. The original load would do an insert into the History table and then after all the processing was finished a Select Into another table. Instead of doing this, i used the MultiCast to split the incoming stream and make sure each record was going to the 2nd table and then do an insert. This totally eliminated an expensive SQL process (Select Into).
The next post will dig into another are of this application.