Skip to content

Instantly share code, notes, and snippets.

@sneakers-the-rat
Last active February 18, 2022 15:54
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f to your computer and use it in GitHub Desktop.
Save sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f to your computer and use it in GitHub Desktop.
Elsevier PDF "hashes"
[
"FCi27mtaKod38ztmGndn-y8NNz.r.lt6SndqGztz_ztr-ngqQm9aMo9eOnMeJntuNntu",
"D2ei2mgqJz9b-m.mGmPqRyLNNnwmOlt7.ywiGmt-Kndr9otqRywv8o9ePmtiNmd2Sn92Tma",
"6U7vcmPuOn9uLnMaGyM7-nLNNntv9lt6RmtaGmweOyMmJnMmSmgmOo9eOnM6LnMaRmM-Tma",
"lXLf8owyQztiMzwqGnMz7zcNNotb7lwf.m9qGzt6Km.qMngqLndqLo9eOotaNm96Mmt6Tma",
"FCi27y9qOnd-Ny96GmPmOmcNNzwf-lwj-m9mGztz7ytaMnM78n9v-o9ePmM6Rm9-Qn9eTma",
"XlEDumMz7nM7-m9iGogmRmLNNyt_8lwiKz9eGm9-Pm.v7ztiLztz_o9eOnMeQnd-Sodm",
"lXLf8yt-JywmNmPeGm9n9n8NNzgn.lt_8zwqGogz7zgn7zt6SyPr-o9eOnM6Pot2Mn9qTma",
"FCi27zgf8mdqMmMeGnMmMy8NNz9eQlweNy.eGmMiMm96Qmgr9nMb-o9ePmtuRmt6JotmTma",
"FCi27nwmKnMeSodeGm.z.y8NNntz.lt-PywmGy9__ngqQmtiPmtb7o9ePmteJotyJoduTma",
"HIoniz.qOnd-Nmt-GmteNn8NNot7.lt-QndaGnPv.mdaMmt6RnMqMo9ePmdmOmdiKod-Tma",
"ZtV1wntuPyPn9z.qGyPv7msNNytz7lwiKyM6GntmJnt_-nteRm.mRo9eOnM6Pot2MnMyTma",
"d2UUdywiJmtz7zt-Gm9eQmcNNzt2Qlwf7m9uGzd_7zdf7owr9yMqOo9ePmtaKnM2NmduTma",
"tprDsnMeJn9iOnweGnPuQnsNNz.eMlt-Qm.mGotz.ytiNz.yRmd-Mo9eOnM6Pot2OmM6Tma",
"tprDsyPiNn9iQn9-GmMiSy8NNn96Llwf9owiGowqQyMiRzwv_ngqPo9eOnM6Pot2OndyTma",
"ZIFNOztmRotn9owiGzduNmsNNnd-Rlt_8otiGot-Oy92QnMeSyMqKo9eOnM6Pot2OntaTma",
"D2ei2nMb_zwmSowyGzwv8mLNNotj8lt-My9yGmtaModaNm92RytySo9ePmtaKn92Qmt2Tma",
"d2UUdot__owr-y9mGodqLocNNn.eOlwmPmtaGmgj7ndn_nMiMndiNo9ePmdiLnMmPotmTmq",
"6U7vcmtuSndmSntqGmdiMy8NNnPz7lt_7ndeGmtv7n9eLndj_zduJo9ePmtiOntmNntmTma",
"ZtV1wn9mMnd2MzwiGz9eRysNNmgySlt7_ot-Gy97.mgiKotqKnt_.o9eOnM6Pot2Mn96Tma",
"XlEDuyweNmtz9ntqGm9aMocNNodr9lt__z9iGmdj_n9yNnt6Sm9-Lo9ePmd6KotmRnM2Tma",
"HIonintn-z9uPogmGnMeSzsNNogf-lwj.z.qGmgqSn9yPndf7mdmLo9eOotuLm9aNodqTma",
"ZlkjsyMj7mPr.ndiGowuMmcNNy.mNlwj9m.yGmtb7z.qRz.iKyt38o9eOnM6Pot2MnMeTma",
"Dpairmdj9mPr8nwmGn.r7z8NNnMb7lwj8otiGyt-MzwuKzd__nt39o9ePmtaPotaJm9-Tma",
"6mIUqngiNzduNn9iGmgeJnsNNot2Rlt-SzguGzt2Oodf_n.eNodz.o9eOn9mQnMqOm9e",
"FCi27mwr_mPn-m.mGmPuKncNNmduOlweOytuGogj.yMv-z92Pyt6Mo9eOnM6Pot2Mn9yTma",
"6U7vcngj-zt2Ln.uGodr8mcNNmdeSlweKmd2Gzdz9nM3_mgf7yt2Ro9ePmt6Sn9qLntyTma",
"zjJBNmPn.mdiRntiGzgmPnLNNmM2Klt6JmMqGy9aNz9aMmdv_mwuNo9ePm96Qm9iRndiTma",
"FCi27mPmRnPiKngeGngqJzcNNogj8lwj-zwiGnPiLmtb7y9qKzgeMo9eOnMeLn9aNm9m"
]
import exiftool
from pathlib import Path
import json
import pdb
import re
paper_root = Path().home() / 'location/of/papers'
hashes = []
get_n = 100
processed = 0
rehash = re.compile(r'<([0-9A-Za-z_.-]{40,})/>')
try:
with exiftool.ExifTool() as et:
for path in paper_root.glob('**/*.pdf'):
md = et.execute(b'-b', b'-xmp', str(path).encode('utf-8'))
try:
md = md.decode('utf-8')
except UnicodeDecodeError:
print(f'Couldnt decode {path}')
continue
ahash = rehash.findall(md)
hashes.extend(ahash)
if len(ahash)>0:
processed += 1
finally:
with open('elsev_hashes.json', 'w') as hashfile:
json.dump(hashes, hashfile, indent=2)
print(f'processed {processed} files')
@sneakers-the-rat
Copy link
Author

Updated after
https://twitter.com/horsemankukka/status/1486268962119761924?s=20

let me know that the tags were being parsed incorrectly. Rescanned and found a few more. Also attaching the v simple code so you can check my work.

@cbandy
Copy link

cbandy commented Jan 27, 2022

The few I downloaded from open access were visible to grep; usually toward the end of the file in an XML stream:

grep -Ena '<[^/]{50,}/>' *.pdf

A variation on https://twitter.com/Jofkos/status/1486244612960366593.

@Aariq
Copy link

Aariq commented Jan 28, 2022

Some more examples here with associated DOIs: https://gist.github.com/Aariq/a23958e168e347f1bacf9dfa777b911f

@rgrunbla
Copy link

rgrunbla commented Jan 30, 2022

I managed to get hashes that are very close on the same paper ( https://doi.org/10.1016/j.ijhydene.2021.11.149 ) :

lXLf8 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMoti Tma
FCi27 ndj8y.uMn9q G yPn8m8 NN ogiM l t-SyPu G y.z8zwf8zgiNmMqM o9e PndmNn9iMot2 Tma
LMfns mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmdq Tma
w8arl mgeLmPf7zgm G y.iJns NN mPuQ l wf.ogm G nduLot2Mz9v9otr7 o9e PndmNn9iNmd- Tma

I put some spaces in the hashes, because I think there are some patterns at such positions.

Later obtained hashes seem very different, still.

Here are some informations regarding the files, in the same order than the hashes :

  File: 1-s2.0-S0360319921045377-main.pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067528     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:19:53.072211357 +0100
Modify: 2022-01-29 14:19:53.185217711 +0100
Change: 2022-01-29 14:19:53.325225583 +0100
 Birth: 2022-01-29 14:19:53.072211357 +0100
  File: 1-s2.0-S0360319921045377-main(1).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067359     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:19:57.310442520 +0100
Modify: 2022-01-29 14:19:57.493452096 +0100
Change: 2022-01-29 14:19:57.539454503 +0100
 Birth: 2022-01-29 14:19:57.310442520 +0100
  File: 1-s2.0-S0360319921045377-main(2).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067360     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:20:04.484795768 +0100
Modify: 2022-01-29 14:20:04.608801481 +0100
Change: 2022-01-29 14:20:04.663804016 +0100
 Birth: 2022-01-29 14:20:04.484795768 +0100
  File: 1-s2.0-S0360319921045377-main(3).pdf
  Size: 5225391   	Blocks: 7833       IO Block: 131072 regular file
Device: 0,37	Inode: 1067005     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/    remy)   Gid: (  100/   users)
Access: 2022-01-29 14:20:09.293007869 +0100
Modify: 2022-01-29 14:20:09.448014381 +0100
Change: 2022-01-29 14:20:09.492016229 +0100
 Birth: 2022-01-29 14:20:09.293007869 +0100

@sneakers-the-rat
Copy link
Author

WOW that looks like they might just be timestamps, that is LAZY on their part. I'll try and systematically sample across time and see if i can get repeating patterns/match subsections with times. I think you're right, those do seem to be independent and repeatable sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment