Skip to content

Instantly share code, notes, and snippets.

@KEINOS
Last active March 20, 2024 04:05
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save KEINOS/78cc23f37e55e848905fc4224483763d to your computer and use it in GitHub Desktop.
Save KEINOS/78cc23f37e55e848905fc4224483763d to your computer and use it in GitHub Desktop.
GAS(Google Apps Script) user function to get MD5 hash or 4digit shortened hash for Multibyte(UTF-8, 2bytes character) environment.
/**
* ------------------------------------------
* MD5 function for GAS(GoogleAppsScript)
*
* You can get a MD5 hash value and even a 4digit short Hash value of a string.
* ------------------------------------------
* Usage1:
* `=MD5("YourStringToHash")`
* or
* `=MD5( A1 )`
* to use the A1 cell value as the argument of MD5.
*
* result:
* `FCE7453B7462D9DE0C56AFCCFB756193`
*
* For your sure-ness you can verify it locally in your terminal as below.
* `$ md5 -s "YourStringToHash"`
*
* Usage2:
* `=MD5("YourStringToHash", true)` for short Hash
*
* result:
* `6MQH`
* Note that it has more conflict probability.
*
* How to install:
* Copy the scipt, pase it at [Extensions]-[Apps Script]-[Editor]-[<YourProject>.gs]
* or go to https://script.google.com and paste it.
* For more details go:
* https://developers.google.com/apps-script/articles/
*
* License: WTFPL (But mentioning the URL to the latest version is recommended)
*
* Version: 1.1.0.2022-11-24
* Latest version:
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d
*
* Author/Collaborator/Contributor:
* KEINOS @ https://github.com/keinos
* Alex Ivanov @ https://github.com/contributorpw
* Curtis Doty @ https://github.com/dotysan
* Haruo Nakayama @ https://github.com/harupong
*
* References and thanks to:
* https://stackoverflow.com/questions/7994410/hash-of-a-cell-text-in-google-spreadsheet
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d#gistcomment-3129967
* https://gist.github.com/dotysan/36b99217fdc958465b62f84f66903f07
* https://developers.google.com/apps-script/reference/utilities/utilities#computedigestalgorithm-value
* https://cloud.google.com/dataprep/docs/html/Logical-Operators_57344671
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d#gistcomment-3441818
* ------------------------------------------
*
* @param {(string|Bytes[])} input The value to hash.
* @param {boolean} isShortMode Set true for 4 digit shortend hash, else returns usual MD5 hash.
* @return {string} The hashed input value.
* @customfunction
*/
function MD5( input, isShortMode )
{
var isShortMode = !!isShortMode; // Ensure to be bool for undefined type
var txtHash = '';
var rawHash = Utilities.computeDigest(
Utilities.DigestAlgorithm.MD5,
input,
Utilities.Charset.UTF_8 // Multibyte encoding env compatibility
);
if ( ! isShortMode ) {
for ( i = 0; i < rawHash.length; i++ ) {
var hashVal = rawHash[i];
if ( hashVal < 0 ) {
hashVal += 256;
};
if ( hashVal.toString( 16 ).length == 1 ) {
txtHash += '0';
};
txtHash += hashVal.toString( 16 );
};
} else {
for ( j = 0; j < 16; j += 8 ) {
hashVal = ( rawHash[j] + rawHash[j+1] + rawHash[j+2] + rawHash[j+3] )
^ ( rawHash[j+4] + rawHash[j+5] + rawHash[j+6] + rawHash[j+7] );
if ( hashVal < 0 ) {
hashVal += 1024;
};
if ( hashVal.toString( 36 ).length == 1 ) {
txtHash += "0";
};
txtHash += hashVal.toString( 36 );
};
};
// change below to "txtHash.toUpperCase()" if needed
return txtHash;
}
@harupong
Copy link

@KEINOS
Thank you so much for sparing your time debugging and updating the script!! My apology for not giving an example. Glad you could test it w/ Japanese characters.

One small nitpick: you might wanna update the URL on L36 to https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d, as the current one is directing to an old fork.

@KEINOS
Copy link
Author

KEINOS commented Nov 25, 2022

@harupong
current one is directing to an old fork.

Oops! Thank you! Updated! 👍

@knwpsk
Copy link

knwpsk commented Aug 8, 2023

Trying to use this to get a unique signature for each file in my Google Drive. It's returning the same signature every time, for different files.

fileList = DriveApp.getFiles();
while fileList.hasNext() {
nextFile = fileList.next();
myBlob = nextFile.getBlob;
mySignature = MD5(myBlob,false);
console.log(myFile.getId(); mySignature);
}

Result:
File ID 1saEqsCkZV9xFEQu36ppw7E-cvK27OKKwtFp58Jxo957AvaNr4ljn hash: 664bf1381332339527e743b02104f0e0
File ID 1t7snIRMWuZYflbYiVCPAPK1jwbLc_kvTJLIEUX0 hash: 664bf1381332339527e743b02104f0e0
File ID 1beIlkskAUNfH2Ol4QJA9ACIdnDHH_WrSIEAvzc hash: 664bf1381332339527e743b02104f0e0
etc

any help?

@oshliaer
Copy link

oshliaer commented Aug 9, 2023

@knwpsk says:

Trying to use this to get a unique signature for each file in my Google Drive. It's returning the same signature every time, for different files.

any help?

You have to get bytes instead

const myBlob = nextFile.getBlob().getBytes()

@knwpsk
Copy link

knwpsk commented Aug 9, 2023

@contributorpw thanks for helping. This isn't working for me yet.
When I try it your way I get an error on that statement:
Exception: Converting from application/vnd.google-apps.script to application/pdf is not supported.

(edit) On further investigation, this error seems to happen only on some files but not others. In fact it happens when the script encounters a Google Apps Script file in my Gdrive. If I skip over that file programmatically, then the script continues ok.
Can you help me understand why?

Finally -- is getBytes an expensive operation? It seems to take a long time for some (large) files. For example, my script tried to getBytes for a large Google Sheets file and that took 16 seconds, and it produced an array of something like 50,000 rows. Then the MD5 function timed out on it.

(edit) As a workaround for the GAS files, I found I can get the file's MIME type and if it's "application/vnd.google-apps.script" then I just skip processing it for now.

@KEINOS
Copy link
Author

KEINOS commented Aug 11, 2023

@knwpsk

Trying to use this to get a unique signature for each file in my Google Drive.

Since the returned signature (hash value) are all the same (664bf1381332339527e743b02104f0e0), I assume that myBlob = nextFile.getBlob; is not returning a string or a byte array. (Pointer to an object, may be?)

How about trying the below?

  fileList = DriveApp.getFiles();
  while fileList.hasNext() {
    nextFile = fileList.next();
-   myBlob = nextFile.getBlob;
-   mySignature = MD5(myBlob,false);
+   mySignature = MD5(nextFile.getId(),false);

    console.log(nextFile.getId(); mySignature);
  }

@knwpsk
Copy link

knwpsk commented Aug 11, 2023

@KEINOS thank you.

Your method would produce an MD5 of the Google Drive ID for the file, but not for the file content. (If I had two identical copies of a file in google drive, they would have different IDs, and get different MD5 values. This is the opposite of what I'm trying to achieve.)

I didn't really spell out my end goal (now I realize). I want to get a "signature" of the file contents, so that I can compare and see if I have duplicate files in my Drive. Presumably if I get the MD5 (or another hash or signature of a file), I can compare and see if any other file has the same content, and identify it as a duplicate.

(Why not use the filename? Because there are many apps/devices that use the same naming conventions, such as IMG0001.jpg on my camera.... and those files aren't dupes.)

I'm now tinkering with using the file's size in bytes, along with the file name, as a way to get a multi-factor unique/dupe indicator. It's less than perfect, but better than just filename alone.

If anyone knows a reasonably fast/efficient way to get a hash/signature of a file's contents, please lemme know.
Thanks!

@KEINOS
Copy link
Author

KEINOS commented Aug 28, 2023

@knwpsk

I want to get a "signature" of the file contents, so that I can compare and see if I have duplicate files in my Drive.

Got it! Some what like CAS, isn't it?

How about using getBytes() method from the getBlob()'s blob object? (not tested though)

fileList = DriveApp.getFiles();
while fileList.hasNext() {
  nextFile = fileList.next();
  myBlob = nextFile.getBlob();
  myBytes = myBlob.getBytes();
  // myBytes = fileList.next().getBlob().getBytes()
  mySignature = MD5(myBytes,false);
  console.log(myFile.getId(); mySignature);
}

@KEINOS
Copy link
Author

KEINOS commented Aug 28, 2023

Finally -- is getBytes an expensive operation? It seems to take a long time for some (large) files. For example, my script tried to getBytes for a large Google Sheets file and that took 16 seconds, and it produced an array of something like 50,000 rows. Then the MD5 function timed out on it.

(edit) As a workaround for the GAS files, I found I can get the file's MIME type and if it's "application/vnd.google-apps.script" then I just skip processing it for now.

Oops you've already tried it sorry.

As you mentioned, I assume getBytes is expensive or has some kind of limitation and stretches the response time, thus time out.

I wonder why they don't implement a hash function to the blob class since it is useful to de-dup files for machine learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment