When comparing entire rows of a wide table with many columns, using the derived column component or conditional split component becomes complicated because of the 4000 character limit. Changes to metadata in the data flow mean lots of work updating complex comparison expressions. It’s easier to calculate a hash value of the fields and compare the hash value in the conditional split.
There are a few ways to get the hash.
- Using the HASH() function within an SQL query but that’s not possible for columns changed by the data flow.
- The SSIS Multiple Hash component from http://ssismhash.codeplex.com/ which I’ve found a little slow to setup because of the way input fields are selected.
- This script. I’ve found it’s just as fast as the SSIS Multiple Hash custom component.
To use the script: Drag in a script component (default C#) and add a single (DT_BYTES,20) output called SHA1, edit the script and paste in the script.
It’s fast to use when creating data flows if you copy and paste the component to create new instances rather than creating from scratch each time. Then select the input rows to hash. Hit edit to open the script and close it again to re-save.
There are caveats to using the script.
- The fields are sorted in alpha order based on the data flow field names and concatenated before SHA1 is calculated on the whole. This means changes to field names could change sort order and change the hash value. It also means fields on both sides of the indented compare should be named the same or named in a way that results in the same sort order.
- Because of the method used to sort the fields, any time columns within the data flow change, the script needs to be opened and closed so SSIS can update the field names and field indexes used for sorting by the script.
In this fork, the udpated code will now handle if a column is NULL, and treat it differently than if the column is just an empty string by simply passing "NULL" as the string value isntead of an empty string. I also updated it to handle BlobColumns correctly, so that is can also hash varchar(MAX) columns