Skip to content

Instantly share code, notes, and snippets.

View mjpost's full-sized avatar

Matt Post mjpost

View GitHub Profile
@mjpost
mjpost / doi-2020.emnlp-main.xml
Last active March 4, 2022 15:27
XML file submitted to DOI for EMNLP 2020 main conference papers
<?xml version='1.0' encoding='UTF-8'?>
<doi_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.crossref.org/schema/4.4.1" xsi:schemaLocation="http://www.crossref.org/schema/4.4.1 http://www.crossref.org/schema/deposit/crossref4.4.1.xsd" version="4.4.1">
<head>
<doi_batch_id>1646395517</doi_batch_id>
<timestamp>1646395517</timestamp>
<depositor>
<depositor_name>Matt Post</depositor_name>
<email_address>anthology@aclweb.org</email_address>
</depositor>
<registrant>Association for Computational Linguistics</registrant>
@mjpost
mjpost / get_citation_counts.py
Last active February 14, 2024 12:58
Uses the Semantic Scholar API (with Anthology support!) to get paper citation counts for an Anthology volume
#!/usr/bin/env python3
"""Uses the Semantic Scholar API to get citation counts for all papers in
an ACL volume. Assumes old-style IDs (e.g., P96-1).
Mad props to Semantic Scholar for making this so easy.
"""
import json
import os
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2019--2021 Matt Post <post@cs.jhu.edu>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#!/usr/bin/env python3
import sys
from sacremoses.normalize import MosesPunctNormalizer
def main(args):
normalizer = MosesPunctNormalizer(lang=args.lang, penn=args.penn)
for line in sys.stdin:
#!/usr/bin/env python3
import sys
import sacremoses
def main(args):
"""Tokenizes, preserving tabs"""
mt = sacremoses.MosesTokenizer(lang=args.lang)
def tok(s):
#!/usr/bin/env python3
"""
Takes a list of collection IDs as arguments, and outputs a TSV
(name, Anthology ID, paper title) containing every person who
is the first author of a paper and has no other papers in the
Anthology.
Place in acl-anthology/bin and run
@mjpost
mjpost / trim_fairseq_model.py
Created May 15, 2020 14:37
Removes ADAM optimizer state from fairseq models, greatly reducing their size
#!/usr/bin/env python3
"""
This is code to take a trained Fairseq model and discard the ADAM optimizer state,
which is not needed at test time. It can reduce a model size by ~70%.
Original author: Brian Thompson
"""
from fairseq import checkpoint_utils
@mjpost
mjpost / parallel.sh
Last active October 27, 2022 13:19
Command line use of GNU parallel
# I can never remember syntax for GNU parallel
## Treat STDIN as a pool of commands to run, running the command for each, at most j in parallel
cat commands.txt | parallel -j 10
## Download a long list of files in parallel
cat files.txt | parallel -j 10 wget -q {}
## Start 10 parallel instances of COMMAND with FLAGS. Feed STDIN in 10k blocks to these commands. Assemble the outputs in order (-k).
cat large_input.txt | parallel -j 10 --pipe -k --block-size 10m COMMAND FLAGS > output.txt
@mjpost
mjpost / regenerate_ics.py
Created August 20, 2015 13:23
Rebuilds Apple Calendar *.ics files so they can be safely reimported
#!/usr/bin/env python
"""
Looks at all the *.ics files in the current directory, removes the X- keys,
and generates a new UUID. This is used for restoring an accidentally-deleted
calendar in Apple's Calendar program; it is a rewrite of the node.js version
that is linked to from here:
http://fokkezb.nl/2015/01/13/how-to-restore-a-deleted-icloud-calendar/
"""
@mjpost
mjpost / unicode_header.py
Last active September 10, 2015 13:33
Standard Python header
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Python *sucks* at UTF-8 (don't tell me "It's fixed in Python 3"; I don't care, plus no one uses Python 3)
# If you put this at the top of every Python script, however, it get rids of most of the headaches dealing with STDIN
# and STDOUT (basically, akin to "perl -C31"). I don't know if it's all necessary; I just know that if I put it at
# the top of my scripts, most of the problems go away, and I can stop thinking about it.
import sys
import codecs