Skip to content

Instantly share code, notes, and snippets.

@JeffreyMFarley
Last active August 25, 2020 14:17
Show Gist options
  • Save JeffreyMFarley/52aa1ce6398595aadf10247b05ed4113 to your computer and use it in GitHub Desktop.
Save JeffreyMFarley/52aa1ce6398595aadf10247b05ed4113 to your computer and use it in GitHub Desktop.
Find Duplicated Javascript code
#!/bin/sh
files=frontend/src/static/js
pmd cpd --language ecmascript --minimum-tokens 25 --files "$files" --format csv \
| sed "1s/.*/lines,tokens,occurrences,L1,F1,L2,F2,L3,F3,L4,F4/" \
| sed "s|$(pwd)/$files|.|g" \
> dups.csv
@JeffreyMFarley
Copy link
Author

JeffreyMFarley commented Aug 25, 2020

Some annotations

pmd cpd (source) is an open source tool that detects similar lines

Line 1: Read in all the Javascript files, look for at least a run of 25 identical tokens, output to CSV
Line 2: Rename the first line to something more usable to Excel
Line 3: Replace the full absolute pathname in the files with the shorter relative directory
Line 4: Output to dups.csv

Dups will contain a list of the identical token blocks with the Lx & Fx columns containing the Line and File

@JeffreyMFarley
Copy link
Author

JeffreyMFarley commented Aug 25, 2020

But what if the identical blocks of code are rearranged?

This will appear as, say, 10 blocks of 90+ tokens, but you know that most of the file is copied

Screen Shot 2020-08-25 at 10 01 28 AM

The cycle is:

  1. Run dups
  2. Move some identical lines where they appear in the original
  3. Run dups again
  4. You should see longer token runs and fewer of them. 90 tokens => 170 tokens

Screen Shot 2020-08-25 at 10 01 37 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment