Skip to content

Instantly share code, notes, and snippets.

View mapmeld's full-sized avatar

Nick Doiron mapmeld

  • Chicago, IL
View GitHub Profile

Releasing Hindi ELECTRA model

This is a first attempt at a Hindi language model trained with Google Research's ELECTRA. I don't modify ELECTRA until we get into finetuning, and only then because there's hardcoded train and test files

CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_

Additional background: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81

It's available on HuggingFace: https://huggingface.co/monsoon-nlp/hindi-bert - sample usage: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w

@mapmeld
mapmeld / InstaScan.py
Created November 28, 2012 15:30
Scrape Instagram geo-photos with a certain tag and date range
# InstaScan.py
# prints a CSV of all geolocated Instagram photos with a certain tag between dates
import json
import urllib
import datetime
createdate = datetime.datetime.now()
latestprint = datetime.datetime(2012, 11, 11) # Nov 11, 2012
earliestprint = datetime.datetime(2012, 10, 28) # Oct 28, 2012
@mapmeld
mapmeld / download_glue_data.py
Last active March 25, 2020 01:59 — forked from W4ngatang/download_glue_data.py
Script for downloading data of the GLUE benchmark (gluebenchmark.com)
''' Script for downloading all GLUE data.
Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).
mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC
@mapmeld
mapmeld / issues.md
Last active February 28, 2020 22:17
Nevada delegate issues

Assuming the final delegate counts and viability number are correct

Unusual

  • Carson City 107: extra delegate, Biden's 2nd
  • Carson City 407: delegate should have been added to Biden, not Klobuchar
  • Clark 1621: needs to add 1 leftover delegate each to Buttigieg and Sanders
  • Clark 1642: unclear, assigned too many delegates instead of a +1 to Sanders
  • Clark 1643: removed Klobuchar's 1 delegate to match expected delegates, even though viable; all had 1 delegate
  • Clark 1645: removed Warren's 1 delegate though viable
@mapmeld
mapmeld / calc_districtr_plans.py
Created January 8, 2020 18:00
Count number of saved plans
# calculate number of plans, by state
import json
plans = open('districtr_full_export.json', 'r').read().strip().split("\n")
places = {}
for raw in plans:
plan = json.loads(raw)
if ("plan" in plan) and ("placeId" in plan["plan"]):
place = plan["plan"]["placeId"]
if place in places:
@mapmeld
mapmeld / 1draft.py
Last active January 5, 2020 21:19
first-draft qa
from allennlp.predictors import Predictor
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers import pipeline
class HuggingFacePredictor(Predictor):
def __init__(self) -> None:
self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
self.model = pipeline('question-answering')
def predict(self, passage='', question=''):
@mapmeld
mapmeld / qa.py
Created January 5, 2020 16:53
Q&A Testing
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2018.11.30-charpad.tar.gz")
qas = open("simplified-nq-test.jsonl").read().split("\n")
for qa in qas:
rep = json.loads(qa)
best = rep['long_answer_candidates'][0]
print(rep['question_text'])
print('AllenNLP: ')
print(predictor.predict(
@mapmeld
mapmeld / state_specific.py
Created January 2, 2020 15:52
State-specific maps of Native American Communities
from sys import argv
import json
# pip install fiona shapely shapely-geojson
import fiona
from shapely.geometry import shape
from shapely_geojson import dumps
if len(argv) < 2:
print('usage: gen_map.py "New Mexico" > output.geojson')
@mapmeld
mapmeld / 2020_ml.md
Last active December 30, 2019 16:36
2020_ml_problems.md

The number of awesome ML projects is limitless, but:

This lists project ideas which I grouped together as awesome and seemingly achievable:

Open-ended Datasets

@mapmeld
mapmeld / mentionsum.py
Last active December 29, 2019 03:39
mentionsum
import pandas as pd
for lang in ['ar', 'en', 'ru', 'ja', 'tr', 'fa']:
mentionsum = {}
for doc in range(1, 10): # ends at 9
print(doc)
df = pd.read_csv("saudi_arabia_112019_tweets_csv_hashed_" + str(doc) + ".csv")
rows = df[df['tweet_language'] == lang][['user_mentions']].values.tolist()
df = None # clear memory
for row in rows:
mentions = row[0].replace('[','').replace(']','').replace('\'','').split(', ')