Skip to content

Instantly share code, notes, and snippets.

@trumpet7347
Forked from danieljarolim/SSIS_SHA1.cs
Last active February 9, 2018 18:11
Show Gist options
  • Save trumpet7347/6ce9200bd749932220fb42e98155c4a6 to your computer and use it in GitHub Desktop.
Save trumpet7347/6ce9200bd749932220fb42e98155c4a6 to your computer and use it in GitHub Desktop.
SHA1 SSIS Script Component
using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using Microsoft.SqlServer.Dts.Pipeline;
using System.Security.Cryptography;
using System.Text;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private int _columnCount;
private ColumnClass[] _columnNameArray;
private PipelineBuffer _inputBuffer;
private SHA1CryptoServiceProvider _sha1Hash;
private class ColumnClass
{
public int Index;
public string Name;
}
public override void PreExecute()
{
base.PreExecute();
// Initialise the SHA1 Crypto Provider
_sha1Hash = new SHA1CryptoServiceProvider();
_columnCount = ComponentMetaData.InputCollection[0].InputColumnCollection.Count;
_columnNameArray = new ColumnClass[_columnCount];
int[] ColumnIndexes = GetColumnIndexes(ComponentMetaData.InputCollection[0].ID); // Same as InputID in ProcessInput
int columnIndex = 0;
foreach (IDTSInputColumn100 item in ComponentMetaData.InputCollection[0].InputColumnCollection)
{
_columnNameArray[columnIndex] = new ColumnClass();
_columnNameArray[columnIndex].Name = Convert.ToString(item.Name);
_columnNameArray[columnIndex].Index = ColumnIndexes[ComponentMetaData.InputCollection[0].InputColumnCollection.GetObjectIndexByID(item.ID)];
columnIndex++;
}
Array.Sort(
_columnNameArray,
delegate (ColumnClass column1, ColumnClass column2)
{ return column1.Name.CompareTo(column2.Name); }
);
}
public override void PostExecute()
{
base.PostExecute();
}
public override void ProcessInput(int InputID, Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer Buffer)
{
// We need access to the PipelineBuffer which isn't exposed in ProcessInputRow
_inputBuffer = Buffer;
base.ProcessInput(InputID, Buffer);
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var hashString = new StringBuilder(_columnCount);
for (int columnIndex = 0; columnIndex < _columnCount; columnIndex++)
{
var indexValue = _inputBuffer[_columnNameArray[columnIndex].Index];
// Check if column is blob, if so, get the string of the blob
if (indexValue is BlobColumn)
{
hashString.Append(BlobColumnToString(indexValue as BlobColumn));
hashString.Append("|");
}
// Check if column is null, and appends string "NULL" if so
else if (indexValue == null)
{
hashString.Append("NULL");
hashString.Append("|");
}
// Else, just add the value of the column to the string to be hashed
else
{
hashString.Append(Convert.ToString(indexValue));
hashString.Append("|");
}
}
// Generate the hash and save it to the output
Row.SHA1 = _sha1Hash.ComputeHash(Encoding.Unicode.GetBytes(hashString.ToString()));
}
// Converts a blob column to a redable string
private string BlobColumnToString(BlobColumn blobColumn)
{
if (blobColumn.IsNull)
{
return "NULL";
}
var blobLength = Convert.ToInt32(blobColumn.Length);
var blobData = blobColumn.GetBlobData(0, blobLength);
var stringData = Encoding.Default.GetString(blobData);
return stringData;
}
}

SHA1 SSIS Script Component

When comparing entire rows of a wide table with many columns, using the derived column component or conditional split component becomes complicated because of the 4000 character limit. Changes to metadata in the data flow mean lots of work updating complex comparison expressions. It’s easier to calculate a hash value of the fields and compare the hash value in the conditional split.

There are a few ways to get the hash.

  • Using the HASH() function within an SQL query but that’s not possible for columns changed by the data flow.
  • The SSIS Multiple Hash component from http://ssismhash.codeplex.com/ which I’ve found a little slow to setup because of the way input fields are selected.
  • This script. I’ve found it’s just as fast as the SSIS Multiple Hash custom component.

To use the script: Drag in a script component (default C#) and add a single (DT_BYTES,20) output called SHA1, edit the script and paste in the script.

It’s fast to use when creating data flows if you copy and paste the component to create new instances rather than creating from scratch each time. Then select the input rows to hash. Hit edit to open the script and close it again to re-save.

There are caveats to using the script.

  • The fields are sorted in alpha order based on the data flow field names and concatenated before SHA1 is calculated on the whole. This means changes to field names could change sort order and change the hash value. It also means fields on both sides of the indented compare should be named the same or named in a way that results in the same sort order.
  • Because of the method used to sort the fields, any time columns within the data flow change, the script needs to be opened and closed so SSIS can update the field names and field indexes used for sorting by the script.

Fork Edits

In this fork, the udpated code will now handle if a column is NULL, and treat it differently than if the column is just an empty string by simply passing "NULL" as the string value isntead of an empty string. I also updated it to handle BlobColumns correctly, so that is can also hash varchar(MAX) columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment