Skip to content

Instantly share code, notes, and snippets.

Avatar

Eric Lease Morgan ericleasemorgan

View GitHub Profile
@ericleasemorgan
ericleasemorgan / natural language processing with shell
Last active Mar 15, 2018
some one-liners to extract urls, email address, and a dictionary from a text file
View natural language processing with shell
# extract all urls from a text file
cat file.txt | egrep -o 'https?://[^ ]+' | sed -e 's/https/http/g' | sed -e 's/\W+$//g' | sort | uniq -c | sort -bnr
# extraxt domains from URL's found in text files
cat file.txt | egrep -o 'https?://[^ ]+' | sed -e 's/https/http/g' | sed -e 's/\W+$//g' | sed -e 's/http:\/\///g' | sed -e 's/\/.*$//g' | sort | uniq -c | sort -bnr
# extract email addresses
cat file.txt | grep -i -o '[A-Z0-9._%+-]\+@[A-Z0-9.-]\+\.[A-Z]\{2,4\}' | sort | uniq -c | sort -bnr
# list all words in a text file
@ericleasemorgan
ericleasemorgan / tika2text.sh
Last active Mar 27, 2017
(brain-dead) shell script using TIKA in server mode to convert a batch of files to plain text
View tika2text.sh
#!/bin/bash
# tika2text.sh - given a directory, recursively extract text frome files
# Eric Lease Morgan <emorgan@nd.edu>
# (c) University of Notre Dame, distributed under a GNU Public License
# March 27, 2017 - a second cut; works with a directory
@ericleasemorgan
ericleasemorgan / gist:8984187
Created Feb 13, 2014
given a (CrossRef) DOI, parse link header of HTTP request to get fulltext URLs
View gist:8984187
sub extracter {
# given a (CrossRef) DOI, parse link header of HTTP request to get fulltext URLs
# see also: https://prospect.crossref.org/splash/
# Eric Lease Morgan <emorgan@nd.edu>
# February 12, 2014 - first cut
# require
use HTTP::Request;
@ericleasemorgan
ericleasemorgan / gist:8438082
Created Jan 15, 2014
Perl subroutine to slurp up the contents of a text file
View gist:8438082
sub slurp {
my $f = shift;
open ( F, $f ) or die "Can't open $f: $!\n";
my $r = do { local $/; <F> };
close F;
return $r;
}
You can’t perform that action at this time.