Skip to content

Instantly share code, notes, and snippets.

View ianmilligan1's full-sized avatar

Ian Milligan ianmilligan1

View GitHub Profile
@ianmilligan1
ianmilligan1 / csvfilter.sh
Created February 11, 2016 21:35
CSV Filtering
pip install csvfilter
## grabs the year and language field
csvfilter -f 2,5 derivative-data.csv > year_language.csv
## sorts them so that you have the years by language
cat year_language.csv | sort | uniq -c > sorted_year_language.csv
## pulls languages out, arranged by year
grep "en" sorted_year_language.csv
@ianmilligan1
ianmilligan1 / errortrace.log
Last active February 15, 2016 19:19
Error for Warcbase on GeoCities WARCs
[Stage 0:==> (483 + 12) / 8897]ERROR Executor - Exception in task 494.0 in stage 0.0 (TID 494)
java.io.EOFException
at org.archive.util.zip.OpenJDK7GZIPInputStream.readUByte(OpenJDK7GZIPInputStream.java:270)
at org.archive.util.zip.OpenJDK7GZIPInputStream.readUShort(OpenJDK7GZIPInputStream.java:260)
at org.archive.util.zip.OpenJDK7GZIPInputStream.readHeader(OpenJDK7GZIPInputStream.java:169)
at org.archive.util.zip.OpenJDK7GZIPInputStream.<init>(OpenJDK7GZIPInputStream.java:84)
at org.archive.util.zip.GZIPMembersInputStream.<init>(GZIPMembersInputStream.java:79)
at org.archive.util.zip.GZIPMembersInputStream.<init>(GZIPMembersInputStream.java:74)
at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader.<init>(WARCReaderFactory.java:252)
at org.archive.io.warc.WARCReaderFactory.getArchiveReader(WARCReaderFactory.java:113)
alias mvn='/Users/ianmilligan1/maven/apache-maven-3.2.2/bin/mvn'
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home
##
# Your previous /Users/ianmilligan1/.bash_profile file was backed up as /Users/ianmilligan1/.bash_profile.macports-saved_2014-08-06_at_11:17:27
##
# MacPorts Installer addition on 2014-08-06_at_11:17:27: adding an appropriate PATH variable for use with MacPorts.
export PATH="/opt/local/bin:/opt/local/sbin:$PATH"
# Finished adapting your PATH environment variable for use with MacPorts.
i2millig@camalon00:~/spark-1.5.1-bin-hadoop2.6/bin$ spark-shell --jars ~/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 75 --executor-cores 5 --executor-memory 15G --driver-memory 30G
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.3.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_71)
Type in expressions to have them evaluated.
val r =
RecordLoader.loadArc(arc,
sc)
.keepValidPages()
.map(r => {
val t = ExtractRawText(r.getBodyContent)
val len = 100
(r.getCrawldate, createClickableLink(r.getUrl,
r.getCrawldate), if ( t.length > len ) t.substring(0, len) else t)})
.collect()
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/",sc)
.keepValidPages()
.keepContent(Set("auschwitz".r, "auschwitz-birkenau".r, "dachau".r, "neuengamme".r, "sachsenhausen".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("holocaust-text-geocities/")
i2millig@rho:/mnt/vol1/derivative_data/walk$ spark
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RecordLoader, ExtractClusters}
val recs=RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/", sc)
.keepUrlPatterns(Set("http://geocities.com/EnchantedForest/.*".r))
val clusters = ExtractClusters(recs, sc)
.topNWords("GEO_ENCHANTED_FOREST_TOP_N", sc)
.computeLDA("GEO_ENCHANTED_FOREST_LDA", sc)
.saveSampleDocs("GEO_ENCHANTED_FOREST_LDA", sc)
This file has been truncated, but you can view the full file.
i2millig@camalon01:~$ spark-shell --jars ~/git/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 75 --executor-cores 5 --executor-memory 20G --driver-memory 10G
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.3.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.