This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
We have a bunch of files in the form of social.*.xz.gpg.save | |
These need to be moved to gzip format. | |
""" | |
import argparse | |
import os | |
def decrypt(filename): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<!-- | |
clueweb12++ Crawl job configuration file | |
======================================== | |
This file is the template for the job configurations. | |
It is based on the sample Heritrix 3 job. configuration file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
''' | |
Script to download the 2013 corpus | |
''' | |
import requests | |
import urlparse | |
from BeautifulSoup import BeautifulSoup, SoupStrainer |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
''' | |
Our nabble crawl contains a ton of 503s. Poll heritrix and set up new jobs | |
''' | |
import argparse | |
import daemon | |
import os | |
import sys |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | |
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | |
<modelVersion>4.0.0</modelVersion> | |
<groupId>lemurproject</groupId> | |
<artifactId>sutime-clojure</artifactId> | |
<version>0.0.1-SNAPSHOT</version> | |
<packaging>jar</packaging> | |
<name>sutime-clojure</name> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import edu.stanford.nlp.*; | |
import edu.stanford.nlp.pipeline.*; | |
import edu.stanford.nlp.time.TimeAnnotations; | |
import edu.stanford.nlp.util.CoreMap; | |
import edu.stanford.nlp.ling.*; | |
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; | |
import edu.stanford.nlp.ling.CoreAnnotations.NormalizedNamedEntityTagAnnotation; | |
import edu.stanford.nlp.ling.tokensregex.types.Expressions.VarAssignmentExpression; | |
import java.util.*; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
'''Resolve a url''' | |
import argparse | |
import requests | |
def resolve_url(url): | |
return (requests.get(url)).url | |
if __name__ == '__main__': |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
Heritrix control from the command line | |
All control can be done by using job directories on the command line | |
''' | |
import argparse | |
import os | |
def stop_job(job_dir): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
The purpose of this script is to keep pausing / unpausing | |
the ygroups download | |
''' | |
import argparse | |
import os | |
import sys | |
import time |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
''' | |
Index pages scraper that goes through and finds the most recent pages | |
''' | |
import argparse | |
import os | |
import sys | |
import warc | |
from BeautifulSoup import BeautifulSoup, SoupStrainer |