Skip to content

Instantly share code, notes, and snippets.

@Hashbrown777
Created November 23, 2022 17:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Hashbrown777/a4ea9ce56668e19e78d4a78212fa310e to your computer and use it in GitHub Desktop.
Save Hashbrown777/a4ea9ce56668e19e78d4a78212fa310e to your computer and use it in GitHub Desktop.
Get multiple checksum hashes on a per-chunk basis from a single file.
<# Similar to a CheckSum, but you receive one for every chunk of a given size in the file as if hashing several files each of that size.
You could use this to assess where in a file it differs from another (say, over a network, where comparing digests is infinitely better than comparing the real byte streams)
-File
Can be a string or file reference (it's passed to Get-Item internally anyway)
-ChunkSize
The size to 'break' the file into to report each hash on.
A chunksize greater or equal to the filesize is equivalent to a normal Get-FileHash call.
This figure does not have to be a multiple of BufferSize (nor vice versa); the streaming is robust.
-BufferSize
The amount, in bytes, to read at a time before passing it on to the hashing algorithm.
NB that double this amount is reserved in memory; a buffer of this size is read and passed on to be hashed,
and whilst the hashing is taking place on that buffer, a second one is read to from disk simultaneously
(to be used for hashing once the first is completed, where that one is then read to from disk, so on and so forth).
-Algorithm
The hashing algorithm to use, accepting the same string names as Get-FileHash.
Alternatively you can pass a HashAlgorithm '::Create()'-er or object.
E.g. `-Algorithm 'sha512'` or `-Algorithm ([System.Security.Cryptography.SHA512])`
-Accumulate
Instead of taking each chunk as it's own hash, each chunk boundary [from the start of the file] is hashed.
This has the benefit that the final hash generated is the same as the hash of the whole file.
Unfortunately .NET implementations of HashAlgorithm do not allow you calculate the hash AND then continue to digest data
(like the algorithms themselves would actually allow), nor can you clone the objects prior to, or "reset" them after, calling TransformFinalBlock.
So as a hack the Accumulate flag actually runs (FileSize / ChunkSize) hashes in tandem, throwing one away each time a hash is returned.
This is incredibly inefficient. Do not use Accumulate.
ChunkSum, whether the -ChunkSize exceeds the filesize or is many multiples smaller, performs about on par with Get-FileHash, regardless of -BufferSize.
Testing on a 10GB file, ChunkSum actually vastly outperforms Get-FileHash for simple algorithms like MD5, but is considerably slower than it for longer checksum generators like SHA512.
#>
Function ChunkSum { Param(
$File,
[uint64]$ChunkSize = 1GB,
[uint32]$BufferSize = 1MB,
$Algorithm = 'MD5',
[switch]$Accumulate
)
if ($Algorithm -is [string]) {
[AppDomain]::CurrentDomain.GetAssemblies() `
| %{
if ($Algorithm -is [string]) {
$_.GetTypes()
}
} `
| % {
if (
$Algorithm -is [string] -and
$_.NameSpace -eq 'System.Security.Cryptography' -and
$_.Name -eq $Algorithm
) {
$Algorithm = $_
}
}
}
if ($Algorithm.BaseType.FullName -eq 'System.Security.Cryptography.HashAlgorithm') {
$Algorithm = $Algorithm::Create()
}
if ($Algorithm -isnot [System.Security.Cryptography.HashAlgorithm]) {
throw $Algorithm.GetType().FullName
}
$File = $File | Get-Item
if ($Accumulate) {
$toSum = [System.Math]::Ceiling($File.Length / $ChunkSize)
($tmp = [System.Collections.Queue]::new($toSum)).Enqueue($Algorithm)
for ($Algorithm = $tmp; $toSum -gt 1; --$toSum) {
$Algorithm.Enqueue($Algorithm.Peek()::Create())
}
}
$File = [System.IO.File]::OpenRead($File.FullName)
try {
$toSum = $ChunkSize
$buffer = [byte[]]::new($BufferSize),[byte[]]::new($BufferSize)
$parallel = 0
for (
$next = $File.ReadAsync($buffer[$parallel], 0, $BufferSize);
$next -and ($next.Wait() -or $True);
$parallel = $parallel -bxor 1
) {
if ($bytes = $next.Result) {
$next = $File.ReadAsync($buffer[$parallel -bxor 1], 0, $BufferSize)
}
else {
$next = $NULL
}
$index = 0
do {
if ($bytes) {
if (!$toSum) {
if (!$Accumulate) {
$Algorithm.Initialize()
}
$toSum = $ChunkSize
}
$use = if ($bytes -lt $toSum) { $bytes } else { $toSum }
[void]$Algorithm.TransformBlock($buffer[$parallel], $index, $use, $buffer[$parallel], $index)
$index += $use
$bytes -= $use
$toSum -= $use
}
elseif ($toSum) {
$toSum = 0
}
else {
$toSum = -1
}
if (!$toSum) {
if ($Accumulate) {
$tmp = $Algorithm.Dequeue()
[void]$tmp.TransformFinalBlock($buffer[$parallel], $index, 0)
($tmp.Hash | %{ '{0:X2}' -f $_ }) -join ''
$tmp.Dispose()
}
else {
[void]$Algorithm.TransformFinalBlock($buffer[$parallel], $index, 0)
($Algorithm.Hash | %{ '{0:X2}' -f $_ }) -join ''
}
}
} while ($bytes)
}
}
finally {
$Algorithm.Dispose()
$File.Dispose()
}
}
<# TODO(?) Implement skip and count flags that allow people to take a certain number of chunk hashes from arbiturary positions in the file.
This would allow someone to make a function *automatically* showing, to any given granularity, where two files diverge, using 'binary search'.
Id est take a hash of the whole files (up to the smallest filesize), if they dont match hash the first half, if they match do the step again on the latter half, and if not hash the first half of that... et cetera.
Although, I mean you can manually do this now by just picking your smallest granularity right at the start, like 10KB of chunksize, and just running this once and comparing the two outputs.
It is not computationally more expensive to run over data with a tiny chunksize vs a large one, just you'll get more data back to compare.
Unless you have like 100GB files and/or "know" that the first xGB are identical I don't see the point, your disk AND cpu speeds will always make prohibitive multiple passes.
Maybe instead what would be needed is a pausable function version, and the function calls are run in lockstep, and bail out at the first difference,
so you don't process the whole file if it's excessively large, but I'd imagine for anyone with this actual usecase they're using a filesystem that either records
this checksum data at the block level for use, or is a distributed filesystem to begin with.
#>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment