Skip to content

Instantly share code, notes, and snippets.

Avatar

Nick Doiron mapmeld

View GitHub Profile
View face_classifier.py
"""
# BASH dependencies
apt-get install python-opencv ffmpeg
pip install keras numpy shap matplotlib pillow
rm ./drive/My\ Drive/mlin/training/*/*.jpg
rm ./drive/My\ Drive/mlin/validation/*/*.jpg
"""
# native imports
@mapmeld
mapmeld / example.py
Created Dec 16, 2021
How to write an ML example
View example.py
# All I'm looking for on an ML example:
# ! pip install name_of_library
from name_of_library import model, other_stuff
tdata = load_data_from_file() # not a built-in datasets source where I'd need to write python to add data
tdata.apply(changes) # whose dataset is so perfect we don't edit it
model.train(tdata, **explained_params)
@mapmeld
mapmeld / patching_models_bigsci_proposal.md
Last active Dec 14, 2021
Patching Models BigSci Proposal
View patching_models_bigsci_proposal.md

Patching Models with New Words, People, and Events

May 6 - June 15, 2021

Scope

Once a large pre-trained language model is published, it is a snapshot of language when its corpus was collected. What are ways to update models to support new or newly-frequent terms (BIPOC), phrasing (social distancing), or people and events (Fyre Festival)? What are reliable, low-cost ways to test and benchmark these methods of updating?

Current status

@mapmeld
mapmeld / Vanguard-Sortfix.js
Last active Dec 1, 2021
Sort stocks by percent change or my holdings change
View Vanguard-Sortfix.js
/*
Generally, don't run random JS in your browser console, especially on financial sites, but here we are
By default this sorts by Percent Change. If you uncomment the next line it sorts by myDelta (price x your shares)
Caveats:
- I'm not affiliated with Vanguard or any licensed financial advisor or tax preparer. I don't have a clue what's going on with your finances.
- The script assumes you did NOT trade today; it uses today's change and current shares
- Delta-sort does not handle penny stocks as well because the UI says 0.01 and we reverse-engineer from current balance
*/
let sortRule = 'pct';
@mapmeld
mapmeld / split-multi.py
Created Dec 29, 2015
Split a GeoJSON MultiPolygon FeatureCollection into GeoJSON Polygons
View split-multi.py
# split-multi.py
# open source, MIT license
import json
js = open('multipolygon.geojson', 'r').read()
gj = json.loads(js)
output = { "type": "FeatureCollection", "features": [] }
@mapmeld
mapmeld / add_data_task.py
Created Jul 9, 2021
Add text file task to T5
View add_data_task.py
t5.data.TaskRegistry.add(
"byt5_ex",
t5.data.TextLineTask,
split_to_filepattern={
"train": "gs://BUCKET/train_lines.txt",
"validation": "gs://BUCKET/validation_lines.txt",
},
text_preprocessor=[
functools.partial(
t5.data.preprocessors.parse_tsv,
@mapmeld
mapmeld / CensusAPI.txt
Created Aug 2, 2012
Using the Census API
View CensusAPI.txt
NOTE: This how-to was written for the Census API at http://thedataweb.rm.census.gov/ -- it has since been moved to http://api.census.gov/
Mike Stucka, our contact at the Macon Telegraph, sent us a link to the Census's official API which is launching next month. You can skip ahead to the site - http://www.census.gov/developers/ - and get an API key, but also read my notes after using this yesterday:
1) The datasets
--- The 2010 Census Summary comes from everyone filling out census forms, and you can get stats at state level down to a super-detailed block level. Info from this includes population, age, gender, race, home ownership, members of a household, and various combinations of that. Full list: http://www.census.gov/developers/data/sf1.xml
--- The 2006-2010 American Community Survey is a longer form given to fewer households over 5 years (so its numbers are incompatible with the 2010 Census). You can get stats down only to the block group level. In addition to the standard census stats, you get: educa
@mapmeld
mapmeld / bb.md
Last active Jan 4, 2021
Bangla Benchmark runs
View bb.md

Code: https://colab.research.google.com/drive/1vltPI81atzRvlALv4eCvEB0KdFoEaCOb?usp=sharing

Can these scores be improved? YES!

Rerunning with more training data, more epochs of training, or using other libraries to set a learning rate / other hyperparameters before training.

  • Experimenting with epochs - when I doubled the number of epochs, MuRIL improves only slightly (69.5->69.7 on one task)

The point of a benchmark is to run these models through a reasonable and identical process; you can tweak hyperparameters on any model to improve results.

@mapmeld
mapmeld / twiml-lightning-share.md
Last active Oct 22, 2020
twiml-lightning-share
View twiml-lightning-share.md
@mapmeld
mapmeld / OverEncrypt.md
Last active Sep 27, 2020
OverEncrypt - paranoid HTTPS
View OverEncrypt.md

OverEncrypt

This is a guide that I wrote to improve the default security of my website https://fortran.io , which has a certificate from LetsEncrypt. I'm choosing to improve HTTPS security and transparency without consideration for legacy browser support.

WARNING: if you mess up settings, lose your certificates, or decide to no longer maintain HTTPS certs, these steps can and will make your domain inaccessible.

I would recommend these steps only if you have a specific need for information security, privacy, and trust with your users, and/or maintain a separate secure.example.com domain which won't mess up your main site. If you've been thinking about hosting a site on Tor, then this might be a good option, too.

The best resources that I've found for explaining these steps are https://https.cio.gov , https://certificate-transparency.org , and https://twitter.com/konklone