Skip to content

Instantly share code, notes, and snippets.

@atomotic
Created June 8, 2018 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atomotic/445c3996727ad77db30e15259304a15c to your computer and use it in GitHub Desktop.
Save atomotic/445c3996727ad77db30e15259304a15c to your computer and use it in GitHub Desktop.
warcdedup
# apt install rustc cargo
# git clone https://github.com/tari/warcdedupe
# cd warcdedupe
# cargo install
# ...
# ./target/debug/warcdedupe -h
WARC deduplicator.
Usage:
warcdedupe [options] [<infile>] [<outfile>]
If infile or outfile is not specified or is '-', read from standard input or
write to standard output.
Options:
-h --help Show this help.
--compressed-input Assume records in non-file input are compressed.
--compress-output Write compressed records to non-file output.
When input or output is a file, the --compressed-input and --compress-output
options are ignored each is assumed to be compressed if the file name ends in
'.gz'.
# ./target/debug/warcdedupe /tmp/archive-with-dup.warc.gz
45A2F157D26A2A1F369BCD3D019A7507C1827BC2 12191 <https://literarymachin.es/>
8AE39EC448002DF045E72EDBB729FBDAE70CD343 310 <https://literarymachin.es/robots.txt>
1BFB5974A123509DEB12E5899F80FB90053DF898 7964 <https://literarymachin.es/assets/css/normalize.css>
6D508F99E2C266ECC8034005E8ECBEF6A55EF05A 18972 <https://literarymachin.es/assets/css/screen.css>
EB7EAC5C485C5E6E6CD4B9FE18998E9F517484EE 4162 <https://literarymachin.es/assets/css/syntax.css>
CE80B095F190902A6C7E50486E9824709ABF3142 1485 <https://literarymachin.es/assets/css/main.css>
F00C98D606DC09DEA9864C5CFCE62F802C20E4FA 329352 <https://literarymachin.es/assets/images/tumblr.jpg>
CA0B047CFFB453E035349042914CA6BEF4879B1C 4965 <https://literarymachin.es/about.html>
AC7D9B46B1EDE42763C3603B3EE18B4DBF35BA76 9714 <https://literarymachin.es/pywb-2/>
E583C4EED081A6379E02497622F35D707FDCE555 15317 <https://literarymachin.es/pywb-wayback-machine/>
5ABEAA3F5435F50FE48BB684EE5154490C4A60C1 11607 <https://literarymachin.es/anonymous-webarchiving/>
0A70C2833F05D55228A2A6B57BC8A620E0B0BEC0 13787 <https://literarymachin.es/open-bni/>
0695569E0ECACBC3128BCFB78D4C593DD7DA3D2A 9590 <https://literarymachin.es/epub-linkrot/>
106BA68EE379438CBEACA2D28FF9CE4629CE0AAC 18254 <https://literarymachin.es/skos-autocomplete/>
8A79EEB61E4BF5AAEC795B97646EA06776DCA8C3 10090 <https://literarymachin.es/deepzoom-osd-server/>
A711B93CBAE26D4D71CA202E618C2186C1F12154 12006 <https://literarymachin.es/opendata-anagrafe-biblioteche/>
FCEE371F3CA42CC44AD11A3F4C3830FCFFF03D89 24678 <https://literarymachin.es/sbn-json-api/>
75BF65B428C490F29C488E79EF2EF5C925CC409D 9600 <https://literarymachin.es/rss.xml>
7637959A0884E17496CA4BD35569E2B638EC246A 93548 <https://literarymachin.es/assets/js/jquery-1.10.2.min.js>
0EB02B84F8C641AAC0356465908FD4F52516845E 3219 <https://literarymachin.es/assets/js/jquery.fitvids.js>
BD32DBEBCF22A9CC89BA6EA091D15BD53F897389 642 <https://literarymachin.es/assets/js/index.js>
121EB88FBC8FFA14142F64EFEB7D931396218AD0 6633 <https://literarymachin.es/assets/js/cookiechoices.js>
6848C5FDC5C97E87F349CB208F139F88A7B0A2B0 2697 <https://literarymachin.es/assets/fonts/icons.eot>
6848C5FDC5C97E87F349CB208F139F88A7B0A2B0 2697 <https://literarymachin.es/assets/fonts/icons.eot?>
1AA4E9E050B995C186CF553470CC61A091860E35 3093 <https://literarymachin.es/assets/fonts/icons.woff>
DCC1E2E710650C5A088FEA8AD30515827AF9C6AB 2536 <https://literarymachin.es/assets/fonts/icons.ttf>
E42D349329D5E9BCBD91B6A0F75736CF2441906B 4827 <https://literarymachin.es/assets/fonts/icons.svg>
BD47D0EBCD9D41B8F3CBC54AC3A7625A30A1CF23 586 <https://literarymachin.es/pywb-wayback-machine>
F6756E3A136517406725EF9003D053A962138DA7 29518 <https://literarymachin.es/assets/images/profile.png>
9E0DF3F57E6666D4ECEA69C0CB57E6BDA45CF6EB 132883 <https://literarymachin.es/assets/images/pywb-tor.png>
EC7DB75FF488D87869E27F5905D99770A5CA262F 17721 <https://literarymachin.es/assets/images/nuovosoggettario.jpg>
D38F4E86AE6ED0CE993FD23C638CF56DAB960414 999365 <https://literarymachin.es/assets/images/deepzoom-osd-server-screenshot.png>
955C530A9FA9B20CBABA34CCBE5F8B09C7396055 124344 <https://literarymachin.es/assets/images/sbn-mobile-screenshot-2.png>
7534F51D4D79C525257EA575E52986E9C6810B27 130956 <https://literarymachin.es/assets/images/mitmproxy.png>
45A2F157D26A2A1F369BCD3D019A7507C1827BC2 12191 <https://literarymachin.es/>
8AE39EC448002DF045E72EDBB729FBDAE70CD343 310 <https://literarymachin.es/robots.txt>
1BFB5974A123509DEB12E5899F80FB90053DF898 7964 <https://literarymachin.es/assets/css/normalize.css>
6D508F99E2C266ECC8034005E8ECBEF6A55EF05A 18972 <https://literarymachin.es/assets/css/screen.css>
EB7EAC5C485C5E6E6CD4B9FE18998E9F517484EE 4162 <https://literarymachin.es/assets/css/syntax.css>
CE80B095F190902A6C7E50486E9824709ABF3142 1485 <https://literarymachin.es/assets/css/main.css>
F00C98D606DC09DEA9864C5CFCE62F802C20E4FA 329352 <https://literarymachin.es/assets/images/tumblr.jpg>
CA0B047CFFB453E035349042914CA6BEF4879B1C 4965 <https://literarymachin.es/about.html>
AC7D9B46B1EDE42763C3603B3EE18B4DBF35BA76 9714 <https://literarymachin.es/pywb-2/>
E583C4EED081A6379E02497622F35D707FDCE555 15317 <https://literarymachin.es/pywb-wayback-machine/>
5ABEAA3F5435F50FE48BB684EE5154490C4A60C1 11607 <https://literarymachin.es/anonymous-webarchiving/>
0A70C2833F05D55228A2A6B57BC8A620E0B0BEC0 13787 <https://literarymachin.es/open-bni/>
0695569E0ECACBC3128BCFB78D4C593DD7DA3D2A 9590 <https://literarymachin.es/epub-linkrot/>
106BA68EE379438CBEACA2D28FF9CE4629CE0AAC 18254 <https://literarymachin.es/skos-autocomplete/>
8A79EEB61E4BF5AAEC795B97646EA06776DCA8C3 10090 <https://literarymachin.es/deepzoom-osd-server/>
A711B93CBAE26D4D71CA202E618C2186C1F12154 12006 <https://literarymachin.es/opendata-anagrafe-biblioteche/>
FCEE371F3CA42CC44AD11A3F4C3830FCFFF03D89 24678 <https://literarymachin.es/sbn-json-api/>
75BF65B428C490F29C488E79EF2EF5C925CC409D 9600 <https://literarymachin.es/rss.xml>
7637959A0884E17496CA4BD35569E2B638EC246A 93548 <https://literarymachin.es/assets/js/jquery-1.10.2.min.js>
0EB02B84F8C641AAC0356465908FD4F52516845E 3219 <https://literarymachin.es/assets/js/jquery.fitvids.js>
BD32DBEBCF22A9CC89BA6EA091D15BD53F897389 642 <https://literarymachin.es/assets/js/index.js>
121EB88FBC8FFA14142F64EFEB7D931396218AD0 6633 <https://literarymachin.es/assets/js/cookiechoices.js>
6848C5FDC5C97E87F349CB208F139F88A7B0A2B0 2697 <https://literarymachin.es/assets/fonts/icons.eot>
6848C5FDC5C97E87F349CB208F139F88A7B0A2B0 2697 <https://literarymachin.es/assets/fonts/icons.eot?>
1AA4E9E050B995C186CF553470CC61A091860E35 3093 <https://literarymachin.es/assets/fonts/icons.woff>
DCC1E2E710650C5A088FEA8AD30515827AF9C6AB 2536 <https://literarymachin.es/assets/fonts/icons.ttf>
E42D349329D5E9BCBD91B6A0F75736CF2441906B 4827 <https://literarymachin.es/assets/fonts/icons.svg>
BD47D0EBCD9D41B8F3CBC54AC3A7625A30A1CF23 586 <https://literarymachin.es/pywb-wayback-machine>
F6756E3A136517406725EF9003D053A962138DA7 29518 <https://literarymachin.es/assets/images/profile.png>
9E0DF3F57E6666D4ECEA69C0CB57E6BDA45CF6EB 132883 <https://literarymachin.es/assets/images/pywb-tor.png>
EC7DB75FF488D87869E27F5905D99770A5CA262F 17721 <https://literarymachin.es/assets/images/nuovosoggettario.jpg>
D38F4E86AE6ED0CE993FD23C638CF56DAB960414 999365 <https://literarymachin.es/assets/images/deepzoom-osd-server-screenshot.png>
955C530A9FA9B20CBABA34CCBE5F8B09C7396055 124344 <https://literarymachin.es/assets/images/sbn-mobile-screenshot-2.png>
7534F51D4D79C525257EA575E52986E9C6810B27 130956 <https://literarymachin.es/assets/images/mitmproxy.png>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment