Skip to content

Instantly share code, notes, and snippets.

View shriphani's full-sized avatar

Shriphani Palakodety shriphani

View GitHub Profile
@shriphani
shriphani / decompress_and_recompress.py
Created March 6, 2013 08:18
Repackage KBA data (each file is decrypted, decompressed and recompressed with gzip to make it easier to handle)
"""
We have a bunch of files in the form of social.*.xz.gpg.save
These need to be moved to gzip format.
"""
import argparse
import os
def decrypt(filename):
@shriphani
shriphani / scrape-crawler-beans.cxml
Created March 26, 2013 20:16
heritrix config file to scrape pages with a particular url format (specified using a regex).
<?xml version="1.0" encoding="UTF-8"?>
<!--
clueweb12++ Crawl job configuration file
========================================
This file is the template for the job configurations.
It is based on the sample Heritrix 3 job. configuration file.
@shriphani
shriphani / kba_download_2013_stream.py
Created April 11, 2013 20:50
Gets a list of files from the new KBA corpus. Can use this to filter out stuff off a wget dump
#!/usr/bin/env python
'''
Script to download the 2013 corpus
'''
import requests
import urlparse
from BeautifulSoup import BeautifulSoup, SoupStrainer
@shriphani
shriphani / nabble_heritrix_setup.py
Last active December 16, 2015 03:09
Performs a full crawl of nabble iteratively crawling the sites we previously timed out on
#!/usr/bin/env python
'''
Our nabble crawl contains a ton of 503s. Poll heritrix and set up new jobs
'''
import argparse
import daemon
import os
import sys
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>lemurproject</groupId>
<artifactId>sutime-clojure</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>sutime-clojure</name>
import edu.stanford.nlp.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.time.TimeAnnotations;
import edu.stanford.nlp.util.CoreMap;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.NormalizedNamedEntityTagAnnotation;
import edu.stanford.nlp.ling.tokensregex.types.Expressions.VarAssignmentExpression;
import java.util.*;
@shriphani
shriphani / redirect_resolve.py
Created April 19, 2013 22:54
using the requests module to resolve urls
'''Resolve a url'''
import argparse
import requests
def resolve_url(url):
return (requests.get(url)).url
if __name__ == '__main__':
@shriphani
shriphani / cmd_heritrix.py
Created April 21, 2013 19:19
Operate a heritrix instance hosted at the default address using the command line
'''
Heritrix control from the command line
All control can be done by using job directories on the command line
'''
import argparse
import os
def stop_job(job_dir):
@shriphani
shriphani / ygroups_pauser.py
Created April 24, 2013 18:05
ygroups heritrix crawl setup
'''
The purpose of this script is to keep pausing / unpausing
the ygroups download
'''
import argparse
import os
import sys
import time
@shriphani
shriphani / nabble_scrape_index_pages.py
Created April 25, 2013 10:23
scrapes the downloaded index pages and sets up next stage of the crawl
'''
Index pages scraper that goes through and finds the most recent pages
'''
import argparse
import os
import sys
import warc
from BeautifulSoup import BeautifulSoup, SoupStrainer