Skip to content

Instantly share code, notes, and snippets.

View seanmacavaney's full-sized avatar

Sean MacAvaney seanmacavaney

View GitHub Profile
import hashlib
import ir_datasets
logger = ir_datasets.log.easy()
dochash2ids = {}
for doc in logger.pbar(ir_datasets.load('msmarco-passage-v2').docs):
h = hashlib.md5(doc.text.encode()).digest()[:8] # pretty low chance (~0.05%) of any collision for 64-bit hash and 138M passages
if h not in dochash2ids:
dochash2ids[h] = []
@seanmacavaney
seanmacavaney / instructions.md
Created January 23, 2019 14:00 — forked from armancohan/instructions.md
Instructions for submitting project 1

Instructions for preparing the code submission

Please read the following instructions carefully as you prepare your code submission of the project.

Your program needs to be run from a shell interface (e.g. terminal, command prompt) and it needs to accept arguments. The arguments are the inputs and the configuration to your code. Please follow the exact following instructions (including the exact name of the module and the arguments). Please note that the order of the arguments is important. Your code needs to run regardless of the underlying platform (if you have trouble achieving this, let us know). Please also include all the resources that your code uses in your submission. That is, you need to also submit the data file.

Preprocessing and classification

You need to submit an executable file. You module should follow the following naming convention: nb (stands for Naive Bayes). For example, if you are using python, your module should be named nb.py, in case of java nb.jar (a runnable jar),

Keybase proof

I hereby claim:

  • I am seanmacavaney on github.
  • I am macavaney (https://keybase.io/macavaney) on keybase.
  • I have a public key ASATHrSdYaeui-Nb3niv2fDJrWDjUoc9TBE3W1lekbPPMwo

To claim this, I am signing this object: