Nick Doiron mapmeld

## scratch_census_code.py
import json

j = json.load(open('./census-reviewed.json', 'r'))
headers = None
total_vars = {
    'P1_047N': 0,
    'P1_063N': 0,
    'P1_070N': 0,
    'P2_072N': 0,
    'Hisp': 0,

## llama2-langchain.py
# this should run on a GPU CoLab notebook
# pip install langchain xformers transformers datasets bitsandbytes accelerate --quiet
# get access to the meta-llama models, accept license, and get a read token

hf_auth = '######'

from langchain.chains import ConversationChain
from langchain.llms import HuggingFacePipeline
from langchain.memory import ConversationSummaryBufferMemory
from langchain.prompts.prompt import PromptTemplate

## kanji.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / kanji.md
            
            
              Created
              July 25, 2023 20:41
            
              
                Chinese characters example
              
          
    累令直漢刃
累令直漢刃

  
## chatgpt-on-gptnyc.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / chatgpt-on-gptnyc.md
            
            
              Created
              March 12, 2023 02:21
            
              
                ChatGPT on GPT-NYC type questions
              
          
    Date: February 25, 2023

Questions in quotes

My comments in bold italics

Hi, I'm going to ask some questions about New York City as a new visitor, and you should respond as an expert resident.

Sure, I'm happy to help! What would you like to know about New York City?

  
## example.py
# All I'm looking for on an ML example:
# ! pip install name_of_library

from name_of_library import model, other_stuff

tdata = load_data_from_file() # not a built-in datasets source where I'd need to write python to add data
tdata.apply(changes) # whose dataset is so perfect we don't edit it

model.train(tdata, **explained_params)

## patching_models_bigsci_proposal.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mapmeld
                / patching_models_bigsci_proposal.md
            
            
              Last active
              December 14, 2021 03:11
            
              
                Patching Models BigSci Proposal
              
          
    Patching Models with New Words, People, and Events

May 6 - June 15, 2021
Scope

Once a large pre-trained language model is published, it is a snapshot of language when its corpus was collected. What are ways to update models to support new or newly-frequent terms (BIPOC), phrasing (social distancing), or people and events (Fyre Festival)? What are reliable, low-cost ways to test and benchmark these methods of updating?
Current status


## Vanguard-Sortfix.js
/*
  Generally, don't run random JS in your browser console, especially on financial sites, but here we are
  By default this sorts by Percent Change. If you uncomment the next line it sorts by myDelta (price x your shares)
  Caveats:
  - I'm not affiliated with Vanguard or any licensed financial advisor or tax preparer. I don't have a clue what's going on with your finances.
  - The script assumes you did NOT trade today; it uses today's change and current shares
  - Delta-sort does not handle penny stocks as well because the UI says 0.01 and we reverse-engineer from current balance
*/

let sortRule = 'pct';

## add_data_task.py
t5.data.TaskRegistry.add(
      "byt5_ex",
      t5.data.TextLineTask,
      split_to_filepattern={
            "train": "gs://BUCKET/train_lines.txt",
            "validation": "gs://BUCKET/validation_lines.txt",
        },
      text_preprocessor=[
        functools.partial(
          t5.data.preprocessors.parse_tsv,

## bb.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                mapmeld
                / bb.md
            
            
              Last active
              January 4, 2021 16:01
            
              
                Bangla Benchmark runs
              
          
    Code: https://colab.research.google.com/drive/1vltPI81atzRvlALv4eCvEB0KdFoEaCOb?usp=sharing
Can these scores be improved? YES!
Rerunning with more training data, more epochs of training, or using other libraries to set a learning rate / other hyperparameters before training.

Experimenting with epochs - when I doubled the number of epochs, MuRIL improves only slightly (69.5->69.7 on one task)

The point of a benchmark is to run these models through a reasonable and identical process;
you can tweak hyperparameters on any model to improve results.

  
## twiml-lightning-share.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                mapmeld
                / twiml-lightning-share.md
            
            
              Last active
              October 22, 2020 15:38
            
              
                twiml-lightning-share
              
          
    Measuring Gender Bias in Spanish Language Models

Presenter

Nick Doiron, Tufts University / Independent Research
GitHub: https://github.com/mapmeld ; LinkedIn: https://www.linkedin.com/in/nickdoiron/
Context
	import json

	j = json.load(open('./census-reviewed.json', 'r'))
	headers = None
	total_vars = {
	'P1_047N': 0,
	'P1_063N': 0,
	'P1_070N': 0,
	'P2_072N': 0,
	'Hisp': 0,
	# this should run on a GPU CoLab notebook
	# pip install langchain xformers transformers datasets bitsandbytes accelerate --quiet
	# get access to the meta-llama models, accept license, and get a read token

	hf_auth = '######'

	from langchain.chains import ConversationChain
	from langchain.llms import HuggingFacePipeline
	from langchain.memory import ConversationSummaryBufferMemory
	from langchain.prompts.prompt import PromptTemplate
	# All I'm looking for on an ML example:
	# ! pip install name_of_library

	from name_of_library import model, other_stuff

	tdata = load_data_from_file() # not a built-in datasets source where I'd need to write python to add data
	tdata.apply(changes) # whose dataset is so perfect we don't edit it

	model.train(tdata, **explained_params)
	/*
	Generally, don't run random JS in your browser console, especially on financial sites, but here we are
	By default this sorts by Percent Change. If you uncomment the next line it sorts by myDelta (price x your shares)
	Caveats:
	- I'm not affiliated with Vanguard or any licensed financial advisor or tax preparer. I don't have a clue what's going on with your finances.
	- The script assumes you did NOT trade today; it uses today's change and current shares
	- Delta-sort does not handle penny stocks as well because the UI says 0.01 and we reverse-engineer from current balance
	*/

	let sortRule = 'pct';
	t5.data.TaskRegistry.add(
	"byt5_ex",
	t5.data.TextLineTask,
	split_to_filepattern={
	"train": "gs://BUCKET/train_lines.txt",
	"validation": "gs://BUCKET/validation_lines.txt",
	},
	text_preprocessor=[
	functools.partial(
	t5.data.preprocessors.parse_tsv,