Nick Doiron mapmeld

## Baby-Hindi-Model.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / Baby-Hindi-Model.md
            
            
              Last active
              April 26, 2020 20:44
            
          
    Releasing Hindi ELECTRA model

This is a first attempt at a Hindi language model trained with Google Research's ELECTRA.  I don't modify ELECTRA until we get into finetuning, and only then because there's hardcoded train and test files
CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_
Additional background: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81
It's available on HuggingFace: https://huggingface.co/monsoon-nlp/hindi-bert - sample usage: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w

  
## InstaScan.py
# InstaScan.py
# prints a CSV of all geolocated Instagram photos with a certain tag between dates

import json
import urllib
import datetime

createdate = datetime.datetime.now()
latestprint = datetime.datetime(2012, 11, 11) # Nov 11, 2012
earliestprint = datetime.datetime(2012, 10, 28) # Oct 28, 2012

## download_glue_data.py
''' Script for downloading all GLUE data.

Note: for legal reasons, we are unable to host MRPC.
You can either use the version hosted by the SentEval team, which is already tokenized,
or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
You should then rename and place specific files in a folder (see below for an example).

mkdir MRPC
cabextract MSRParaphraseCorpus.msi -d MRPC

## issues.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / issues.md
            
            
              Last active
              February 28, 2020 22:17
            
              
                Nevada delegate issues
              
          
    Assuming the final delegate counts and viability number are correct
Unusual


Carson City 107: extra delegate, Biden's 2nd
Carson City 407: delegate should have been added to Biden, not Klobuchar
Clark 1621: needs to add 1 leftover delegate each to Buttigieg and Sanders
Clark 1642: unclear, assigned too many delegates instead of a +1 to Sanders
Clark 1643: removed Klobuchar's 1 delegate to match expected delegates, even though viable; all had 1 delegate
Clark 1645: removed Warren's 1 delegate though viable


## calc_districtr_plans.py
# calculate number of plans, by state
import json

plans = open('districtr_full_export.json', 'r').read().strip().split("\n")
places = {}
for raw in plans:
    plan = json.loads(raw)
    if ("plan" in plan) and ("placeId" in plan["plan"]):
        place = plan["plan"]["placeId"]
        if place in places:

## 1draft.py
from allennlp.predictors import Predictor
from transformers.tokenization_gpt2 import GPT2Tokenizer
from transformers import pipeline

class HuggingFacePredictor(Predictor):
    def __init__(self) -> None:
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        self.model = pipeline('question-answering')

    def predict(self, passage='', question=''):

## qa.py
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2018.11.30-charpad.tar.gz")

qas = open("simplified-nq-test.jsonl").read().split("\n")
for qa in qas:
  rep = json.loads(qa)
  best = rep['long_answer_candidates'][0]
  print(rep['question_text'])
  print('AllenNLP: ')
  print(predictor.predict(

## state_specific.py
from sys import argv
import json

# pip install fiona shapely shapely-geojson
import fiona
from shapely.geometry import shape
from shapely_geojson import dumps

if len(argv) < 2:
    print('usage: gen_map.py "New Mexico" > output.geojson')

## 2020_ml.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / 2020_ml.md
            
            
              Last active
              December 30, 2019 16:36
            
              
                2020_ml_problems.md
              
          
    The number of awesome ML projects is limitless, but:
This lists project ideas which I grouped together as awesome and seemingly achievable:
Open-ended Datasets


Twitter disinformation datasets https://about.twitter.com/en_us/values/elections-integrity.html#data
DuoLingo language development - https://research.duolingo.com/
YouTube reccomendations https://github.com/markledwich2/YouTubeNetworks
fake news dataset https://github.com/jgolbeck/fakenews
https://factordaily.com/indigenous-datasets-from-india/ - do MNIST in different languages


## mentionsum.py
import pandas as pd
for lang in ['ar', 'en', 'ru', 'ja', 'tr', 'fa']:
    mentionsum = {}
    for doc in range(1, 10): # ends at 9
        print(doc)
        df = pd.read_csv("saudi_arabia_112019_tweets_csv_hashed_" + str(doc) + ".csv")
        rows = df[df['tweet_language'] == lang][['user_mentions']].values.tolist()
        df = None # clear memory
        for row in rows:
            mentions = row[0].replace('[','').replace(']','').replace('\'','').split(', ')
	# InstaScan.py
	# prints a CSV of all geolocated Instagram photos with a certain tag between dates

	import json
	import urllib
	import datetime

	createdate = datetime.datetime.now()
	latestprint = datetime.datetime(2012, 11, 11) # Nov 11, 2012
	earliestprint = datetime.datetime(2012, 10, 28) # Oct 28, 2012
	''' Script for downloading all GLUE data.

	Note: for legal reasons, we are unable to host MRPC.
	You can either use the version hosted by the SentEval team, which is already tokenized,
	or you can download the original data from (https://download.microsoft.com/download/D/4/6/D46FF87A-F6B9-4252-AA8B-3604ED519838/MSRParaphraseCorpus.msi) and extract the data from it manually.
	For Windows users, you can run the .msi file. For Mac and Linux users, consider an external library such as 'cabextract' (see below for an example).
	You should then rename and place specific files in a folder (see below for an example).

	mkdir MRPC
	cabextract MSRParaphraseCorpus.msi -d MRPC
	# calculate number of plans, by state
	import json

	plans = open('districtr_full_export.json', 'r').read().strip().split("\n")
	places = {}
	for raw in plans:
	plan = json.loads(raw)
	if ("plan" in plan) and ("placeId" in plan["plan"]):
	place = plan["plan"]["placeId"]
	if place in places:
	from allennlp.predictors import Predictor
	from transformers.tokenization_gpt2 import GPT2Tokenizer
	from transformers import pipeline

	class HuggingFacePredictor(Predictor):
	def __init__(self) -> None:
	self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
	self.model = pipeline('question-answering')

	def predict(self, passage='', question=''):
	from allennlp.predictors.predictor import Predictor
	predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2018.11.30-charpad.tar.gz")

	qas = open("simplified-nq-test.jsonl").read().split("\n")
	for qa in qas:
	rep = json.loads(qa)
	best = rep['long_answer_candidates'][0]
	print(rep['question_text'])
	print('AllenNLP: ')
	print(predictor.predict(
	from sys import argv
	import json

	# pip install fiona shapely shapely-geojson
	import fiona
	from shapely.geometry import shape
	from shapely_geojson import dumps

	if len(argv) < 2:
	print('usage: gen_map.py "New Mexico" > output.geojson')
	import pandas as pd
	for lang in ['ar', 'en', 'ru', 'ja', 'tr', 'fa']:
	mentionsum = {}
	for doc in range(1, 10): # ends at 9
	print(doc)
	df = pd.read_csv("saudi_arabia_112019_tweets_csv_hashed_" + str(doc) + ".csv")
	rows = df[df['tweet_language'] == lang][['user_mentions']].values.tolist()
	df = None # clear memory
	for row in rows:
	mentions = row[0].replace('[','').replace(']','').replace('\'','').split(', ')