Skip to content

Instantly share code, notes, and snippets.

@northwestcoder
Last active September 23, 2020 15:00
Show Gist options
  • Save northwestcoder/935003fa48706a5c7149ee36ceb1a41b to your computer and use it in GitHub Desktop.
Save northwestcoder/935003fa48706a5c7149ee36ceb1a41b to your computer and use it in GitHub Desktop.
Apps Script: Find duplicate images using generated MD5 Digest/Checksum
/*
OVERVIEW:
Apps Script which, when triggered on a Google Sheet row,
finds the image content referenced and
creates a checksum/digest string for that content and
inserts that string back into the Google Sheet
BACKGROUND:
https://stackoverflow.com/questions/16216868/get-back-a-string-representation-from-computedigestalgorithm-value-byte
https://developers.google.com/apps-script/articles/removing_duplicates
REQUIREMENTS:
1. The Appsheet app which is a companion to this script, "copy and customize" this app:
https://www.appsheet.com/samples/Uses-apps-script-to-identity-duplicate-images-uploaded-via-Appsheet?appGuidString=1aa9850d-7e49-41ca-ba4b-851bdd6a4747
2. This script and a valid G Suite account in which to run it
3. Once you copy the app above, go find the Google Sheet used by the app, then go to TOOLS menu and choose script editor
4. paste in this script.
5. change the "folder" ID shown below to your Appsheet images folder, if you copied the sample app the folder name is "Google Doc_Images"
6. Enable a trigger for this script, set to trigger off of the Google Sheet for any "onChange" event.
7. Test.
We currently only tested this on image uploads, and have not tested on file uploads (pdf's etc).
USAGE:
a) when you upload an image using the Appsheet app, this script will insert both the Google ID as well as a Checksum of the image file back into the Google Sheet.
b) The premise is that you would then have a second script which runs periodically and looks for
duplicate images (using the new checksum column and info) and then "do something" with this info.
E.g. here is a tutorial on removing "duplicates" from a Google Sheet:
https://developers.google.com/apps-script/articles/removing_duplicates
*/
function main(e) {
// we hardwire the image uploads directory because:
// A) appsheet could have lots of duplicate apps and we don't want to search by name across all of those
// B) helps performance for folder.getFilesByName() call
// CHANGE the drive ID in this next line:
var folder = DriveApp.getFolderById("CHANGE_THIS_TO_YOUR_IMAGE_FOLDER_DRIVE_ID");
// If you copied the sample Appsheet app, nothing below needs to change
// If you are hacking around, you may need to change the Google Sheet column numbers referenced below.
var theSource = e.source;
var theSheet = theSource.getActiveSheet();
var theActiveRange = theSheet.getActiveRange();
var theActiveRow = theActiveRange.getRow();
// the number "2" in the next line is column B in our spreadsheet and represents the image that was uploaded by appsheet
var theCellvalue = theSheet.getRange(theActiveRow,2).getValue();
var localFileName = folder.getFilesByName(theCellvalue.split("/")[1]).next();
var theFileID = getFileByName(localFileName);
var fileDigestMD5 = getFileChecksum(theFileID.id);
theSheet.getRange(theActiveRow,6).setValue(fileDigestMD5);
if(theFileID.id === false){ // If file cannot be accurately found.
Logger.log(theFileID.error); // Alert or log error. Give option to try another FileName
} else {
// If the file ID exists then proceed
// the number "5" in the next line is column "E" where we will store the Google Doc ID
theSheet.getRange(theActiveRow,5).setValue(theFileID.id);
}
}
// helper function to create a Checksum or Digest of an image file
// inspiration from https://stackoverflow.com/questions/16216868/get-back-a-string-representation-from-computedigestalgorithm-value-byte
function getFileChecksum(FileID){
var tempFile = DriveApp.getFileById(FileID);
var tempBlob = tempFile.getBlob();
var tempText = tempBlob.getBytes();
var signature = Utilities.computeDigest(
Utilities.DigestAlgorithm.MD5,
tempText.join(),
Utilities.Charset.UTF_8);
var signatureStr = '';
for (i = 0; i < signature.length; i++) {
var byte = signature[i];
if (byte < 0)
byte += 256;
var byteStr = byte.toString(16);
// Ensure we have 2 chars in our byte, pad with 0
if (byteStr.length == 1) byteStr = '0'+byteStr;
signatureStr += byteStr;
}
return signatureStr;
}
// helper function 'getFileByName'
// example lifted from the google app script community and cleaned up here.
// finding ID's by filename can result in more than one file found, this function tries to deal with that.
// but in the AppSheet world it's nearly impossible for an
// uploaded image or file to end up with multiple parent folders,
// so a lot of this code is probably not really needed... until it is.
function getFileByName(fileName, fileInFolder){
var filecount = 0;
var dupFileArray = [];
var folderID = "";
var files = DriveApp.getFilesByName(fileName);
while(files.hasNext()){
var file = files.next();
dupFileArray.push(file.getId());
filecount++;
};
if(filecount > 1){
if(typeof fileInFolder === 'undefined'){
folderID = {"id":false,"error":"More than one file with name: "+fileName+". \nTry adding the file's folder name as a reference in Argument 2 of this function."}
}else{
//iterate through list of files with the same name
for(fl = 0; fl < dupFileArray.length; fl++){
var activeFile = DriveApp.getFileById(dupFileArray[fl]);
var folders = activeFile.getParents();
var folder = ""
var foldercount = 0;
//Get the folder name for each file
while(folders.hasNext()){
folder = folders.next().getName();
foldercount++;
};
if(folder === fileInFolder && foldercount > 1){
folderID = {"id":false,"error":"There is more than one parent folder: "+fileInFolder+" for file "+fileName}
};
if(folder === fileInFolder){
folderID = {"id":dupFileArray[fl],"error":false};
}else{
folderID = {"id":false,"error":"There are multiple files named: "+fileName+". \nBut none of them are in folder, "+fileInFolder}
};
};
};
}else if(filecount === 0){
folderID = {"id":false,"error":"No file in your drive exists with name: "+fileName};
}else{ //IF there is only 1 file with fileName
folderID = {"id":dupFileArray[0],"error":false};
};
return folderID;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment