Skip to content

Instantly share code, notes, and snippets.

@KEINOS
Last active March 20, 2024 04:05
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save KEINOS/78cc23f37e55e848905fc4224483763d to your computer and use it in GitHub Desktop.
Save KEINOS/78cc23f37e55e848905fc4224483763d to your computer and use it in GitHub Desktop.
GAS(Google Apps Script) user function to get MD5 hash or 4digit shortened hash for Multibyte(UTF-8, 2bytes character) environment.
/**
* ------------------------------------------
* MD5 function for GAS(GoogleAppsScript)
*
* You can get a MD5 hash value and even a 4digit short Hash value of a string.
* ------------------------------------------
* Usage1:
* `=MD5("YourStringToHash")`
* or
* `=MD5( A1 )`
* to use the A1 cell value as the argument of MD5.
*
* result:
* `FCE7453B7462D9DE0C56AFCCFB756193`
*
* For your sure-ness you can verify it locally in your terminal as below.
* `$ md5 -s "YourStringToHash"`
*
* Usage2:
* `=MD5("YourStringToHash", true)` for short Hash
*
* result:
* `6MQH`
* Note that it has more conflict probability.
*
* How to install:
* Copy the scipt, pase it at [Extensions]-[Apps Script]-[Editor]-[<YourProject>.gs]
* or go to https://script.google.com and paste it.
* For more details go:
* https://developers.google.com/apps-script/articles/
*
* License: WTFPL (But mentioning the URL to the latest version is recommended)
*
* Version: 1.1.0.2022-11-24
* Latest version:
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d
*
* Author/Collaborator/Contributor:
* KEINOS @ https://github.com/keinos
* Alex Ivanov @ https://github.com/contributorpw
* Curtis Doty @ https://github.com/dotysan
* Haruo Nakayama @ https://github.com/harupong
*
* References and thanks to:
* https://stackoverflow.com/questions/7994410/hash-of-a-cell-text-in-google-spreadsheet
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d#gistcomment-3129967
* https://gist.github.com/dotysan/36b99217fdc958465b62f84f66903f07
* https://developers.google.com/apps-script/reference/utilities/utilities#computedigestalgorithm-value
* https://cloud.google.com/dataprep/docs/html/Logical-Operators_57344671
* https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d#gistcomment-3441818
* ------------------------------------------
*
* @param {(string|Bytes[])} input The value to hash.
* @param {boolean} isShortMode Set true for 4 digit shortend hash, else returns usual MD5 hash.
* @return {string} The hashed input value.
* @customfunction
*/
function MD5( input, isShortMode )
{
var isShortMode = !!isShortMode; // Ensure to be bool for undefined type
var txtHash = '';
var rawHash = Utilities.computeDigest(
Utilities.DigestAlgorithm.MD5,
input,
Utilities.Charset.UTF_8 // Multibyte encoding env compatibility
);
if ( ! isShortMode ) {
for ( i = 0; i < rawHash.length; i++ ) {
var hashVal = rawHash[i];
if ( hashVal < 0 ) {
hashVal += 256;
};
if ( hashVal.toString( 16 ).length == 1 ) {
txtHash += '0';
};
txtHash += hashVal.toString( 16 );
};
} else {
for ( j = 0; j < 16; j += 8 ) {
hashVal = ( rawHash[j] + rawHash[j+1] + rawHash[j+2] + rawHash[j+3] )
^ ( rawHash[j+4] + rawHash[j+5] + rawHash[j+6] + rawHash[j+7] );
if ( hashVal < 0 ) {
hashVal += 1024;
};
if ( hashVal.toString( 36 ).length == 1 ) {
txtHash += "0";
};
txtHash += hashVal.toString( 36 );
};
};
// change below to "txtHash.toUpperCase()" if needed
return txtHash;
}
@oshliaer
Copy link

oshliaer commented Jan 7, 2020

Does var isShortMode = !!isShortMode; work different?

@KEINOS
Copy link
Author

KEINOS commented Jan 8, 2020

@schoraria911

Thanks! Glad that it helped you!

@contributorpw

work different?

You're right. I guess it works the same. 😳 I think I just wanted to be sure at that time ...

@oshliaer
Copy link

As for me it's a cool function!

@KEINOS
Copy link
Author

KEINOS commented Jan 26, 2020

@contributorpw

thanks! ☺️

@dotysan
Copy link

dotysan commented Apr 18, 2020

It fails to calculate md5sum for binary data. Since Utilities.computeDigest() already can handle both String and Bytes[] input, don't hard-code String!

@dotysan
Copy link

dotysan commented Apr 18, 2020

@KEINOS
Copy link
Author

KEINOS commented Apr 19, 2020

@dotysan

Utilities.computeDigest() already can handle both String and Bytes[] input

Indeed.

It seems that I was too afraid of the other encodings in Japan, such as JIS, SJIS, SJIS-WIN, EUC, ASCII.
But I realized that GAS works with UTF-8. Converting the input string encoding to UTF-8 should be the user side matter. Thanks!

I updated the script a little after all these years.

@rohanchutke
Copy link

Can you give me the algorithm to decrypt it back?

@KEINOS
Copy link
Author

KEINOS commented Apr 24, 2020

@rohanchutke

Can you give me the algorithm to decrypt it back?

I would like to, but unfortunately, I can't ... sorry.

If I could, I might be nominated somehow in the encryption world or be in the breaking world news!!

Note that "hash" is not encryption nor compression. So there's no decryption or decompression as well. You may have a value that collides by calcutlation though.

One of the characteristics of the hash function is that it can not be back-calculated the original value. And another characteristic is that if you provide the same value you get the same result.

This wiki page might help you some.

@dotysan
Copy link

dotysan commented Aug 13, 2020

@KEINOS
Copy link
Author

KEINOS commented Aug 17, 2020

@Hash7ag asks about license

@dotysan @contributorpw

MIT? Maybe? Is that OK for you two?
Personally, I don't care if it's WTFPL but I would like to respect the contribution.

@harupong
Copy link

harupong commented Sep 4, 2020

Does Utilities.computeDigest require another argument Utilities.Charset.UTF_8, in order to properly handle multibyte(UTF-8, 2bytes character) environment?
I couldn't get Japanese multibyte letters processed properly with your function.

@choraria
Copy link

@KEINOS — I'm using this function as part of the Custom Functions add-on for Google Sheets which is purely based on open-source content and attributed it like so (source file) —

Screenshot 2022-11-24 at 00 49 18

Wasn't sure about how to handle licensing side of things but let me know in case this works or if you'd want me to change something.

@KEINOS
Copy link
Author

KEINOS commented Nov 24, 2022

@harupong
I couldn't get Japanese multibyte letters processed properly with your function.

Sorry for the delay. Indeed. I confirm that it doesn't work with multibyte chars. Thank you for the report.

2022-11-24 9 05 25

After some research, I found that it was a bug of Utilities.computeDigest at that time. And now computeDigest method has an additional parameter to set the character encoding.

function PureMD5(input) {
  var txtHash = ''
  rawHash = Utilities.computeDigest(
                      Utilities.DigestAlgorithm.MD5,
                      input,
+                     Utilities.Charset.UTF_8
            )
  for ( i = 0; i < rawHash.length; i++ ) {
            var hashVal = rawHash[i];

            if ( hashVal < 0 ) {
                hashVal += 256;
            };
            if ( hashVal.toString( 16 ).length == 1 ) {
                txtHash += '0';
            };
            txtHash += hashVal.toString( 16 )};
  return txtHash
}

@KEINOS
Copy link
Author

KEINOS commented Nov 24, 2022

@choraria
Wasn't sure about how to handle licensing side of things but let me know in case this works or if you'd want me to change something.

No problem at all! 👍 Thank you for your respect! ❤️

@KEINOS
Copy link
Author

KEINOS commented Nov 24, 2022

Fixed

スクリーンショット 2022-11-24 9 43 45

@harupong
Copy link

@KEINOS
Thank you so much for sparing your time debugging and updating the script!! My apology for not giving an example. Glad you could test it w/ Japanese characters.

One small nitpick: you might wanna update the URL on L36 to https://gist.github.com/KEINOS/78cc23f37e55e848905fc4224483763d, as the current one is directing to an old fork.

@KEINOS
Copy link
Author

KEINOS commented Nov 25, 2022

@harupong
current one is directing to an old fork.

Oops! Thank you! Updated! 👍

@knwpsk
Copy link

knwpsk commented Aug 8, 2023

Trying to use this to get a unique signature for each file in my Google Drive. It's returning the same signature every time, for different files.

fileList = DriveApp.getFiles();
while fileList.hasNext() {
nextFile = fileList.next();
myBlob = nextFile.getBlob;
mySignature = MD5(myBlob,false);
console.log(myFile.getId(); mySignature);
}

Result:
File ID 1saEqsCkZV9xFEQu36ppw7E-cvK27OKKwtFp58Jxo957AvaNr4ljn hash: 664bf1381332339527e743b02104f0e0
File ID 1t7snIRMWuZYflbYiVCPAPK1jwbLc_kvTJLIEUX0 hash: 664bf1381332339527e743b02104f0e0
File ID 1beIlkskAUNfH2Ol4QJA9ACIdnDHH_WrSIEAvzc hash: 664bf1381332339527e743b02104f0e0
etc

any help?

@oshliaer
Copy link

oshliaer commented Aug 9, 2023

@knwpsk says:

Trying to use this to get a unique signature for each file in my Google Drive. It's returning the same signature every time, for different files.

any help?

You have to get bytes instead

const myBlob = nextFile.getBlob().getBytes()

@knwpsk
Copy link

knwpsk commented Aug 9, 2023

@contributorpw thanks for helping. This isn't working for me yet.
When I try it your way I get an error on that statement:
Exception: Converting from application/vnd.google-apps.script to application/pdf is not supported.

(edit) On further investigation, this error seems to happen only on some files but not others. In fact it happens when the script encounters a Google Apps Script file in my Gdrive. If I skip over that file programmatically, then the script continues ok.
Can you help me understand why?

Finally -- is getBytes an expensive operation? It seems to take a long time for some (large) files. For example, my script tried to getBytes for a large Google Sheets file and that took 16 seconds, and it produced an array of something like 50,000 rows. Then the MD5 function timed out on it.

(edit) As a workaround for the GAS files, I found I can get the file's MIME type and if it's "application/vnd.google-apps.script" then I just skip processing it for now.

@KEINOS
Copy link
Author

KEINOS commented Aug 11, 2023

@knwpsk

Trying to use this to get a unique signature for each file in my Google Drive.

Since the returned signature (hash value) are all the same (664bf1381332339527e743b02104f0e0), I assume that myBlob = nextFile.getBlob; is not returning a string or a byte array. (Pointer to an object, may be?)

How about trying the below?

  fileList = DriveApp.getFiles();
  while fileList.hasNext() {
    nextFile = fileList.next();
-   myBlob = nextFile.getBlob;
-   mySignature = MD5(myBlob,false);
+   mySignature = MD5(nextFile.getId(),false);

    console.log(nextFile.getId(); mySignature);
  }

@knwpsk
Copy link

knwpsk commented Aug 11, 2023

@KEINOS thank you.

Your method would produce an MD5 of the Google Drive ID for the file, but not for the file content. (If I had two identical copies of a file in google drive, they would have different IDs, and get different MD5 values. This is the opposite of what I'm trying to achieve.)

I didn't really spell out my end goal (now I realize). I want to get a "signature" of the file contents, so that I can compare and see if I have duplicate files in my Drive. Presumably if I get the MD5 (or another hash or signature of a file), I can compare and see if any other file has the same content, and identify it as a duplicate.

(Why not use the filename? Because there are many apps/devices that use the same naming conventions, such as IMG0001.jpg on my camera.... and those files aren't dupes.)

I'm now tinkering with using the file's size in bytes, along with the file name, as a way to get a multi-factor unique/dupe indicator. It's less than perfect, but better than just filename alone.

If anyone knows a reasonably fast/efficient way to get a hash/signature of a file's contents, please lemme know.
Thanks!

@KEINOS
Copy link
Author

KEINOS commented Aug 28, 2023

@knwpsk

I want to get a "signature" of the file contents, so that I can compare and see if I have duplicate files in my Drive.

Got it! Some what like CAS, isn't it?

How about using getBytes() method from the getBlob()'s blob object? (not tested though)

fileList = DriveApp.getFiles();
while fileList.hasNext() {
  nextFile = fileList.next();
  myBlob = nextFile.getBlob();
  myBytes = myBlob.getBytes();
  // myBytes = fileList.next().getBlob().getBytes()
  mySignature = MD5(myBytes,false);
  console.log(myFile.getId(); mySignature);
}

@KEINOS
Copy link
Author

KEINOS commented Aug 28, 2023

Finally -- is getBytes an expensive operation? It seems to take a long time for some (large) files. For example, my script tried to getBytes for a large Google Sheets file and that took 16 seconds, and it produced an array of something like 50,000 rows. Then the MD5 function timed out on it.

(edit) As a workaround for the GAS files, I found I can get the file's MIME type and if it's "application/vnd.google-apps.script" then I just skip processing it for now.

Oops you've already tried it sorry.

As you mentioned, I assume getBytes is expensive or has some kind of limitation and stretches the response time, thus time out.

I wonder why they don't implement a hash function to the blob class since it is useful to de-dup files for machine learning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment