Skip to content

Instantly share code, notes, and snippets.

What the BookCorpus?

So in the midst of all these Sesame Streets characters and robots transforming automobile era of "contextualize" language models, there is this "Toronto Book Corpus" that points to this kinda recently influential paper:

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books." In Proceedings of the IEEE international conference on computer vision, pp. 19-27.

Why do I even care, there's no translations there?

Some might know my personal pet peeve on collecting translation datasets but this BookCorpus has no translations, so why do I even care about it?

@alvations
alvations / voices.txt
Created May 20, 2023 05:08 — forked from mculp/voices.txt
List of voices available by the `say` command on OS X
Agnes en_US # Isn't it nice to have a computer that will talk to you?
Albert en_US # I have a frog in my throat. No, I mean a real frog!
Alex en_US # Most people recognize me by my voice.
Alice it_IT # Salve, mi chiamo Alice e sono una voce italiana.
Alva sv_SE # Hej, jag heter Alva. Jag är en svensk röst.
Amelie fr_CA # Bonjour, je m’appelle Amelie. Je suis une voix canadienne.
Anna de_DE # Hallo, ich heiße Anna und ich bin eine deutsche Stimme.
Bad News en_US # The light you see at the end of the tunnel is the headlamp of a fast approaching train.
Bahh en_US # Do not pull the wool over my eyes.
Bells en_US # Time flies when you are having fun.

Getting Stanford NLP and MaltParser to work in NLTK for Windows Users

Firstly, I strongly think that if you're working with NLP/ML/AI related tools, getting things to work on Linux and Mac OS is much easier and save you quite a lot of time.

Disclaimer: I am not affiliated with Continuum (conda), Git, Java, Windows OS or Stanford NLP or MaltParser group. And the steps presented below is how I, IMHO, would setup a Windows computer if I own one.

Please please please understand the solution don't just copy and paste!!! We're not monkeys typing Shakespeare ;P


@alvations
alvations / dispatch_openai_requests.py
Created April 21, 2023 18:19 — forked from neubig/dispatch_openai_requests.py
A simple script to get results from the OpenAI Asynchronous API
import openai
import asyncio
from typing import Any
async def dispatch_openai_requests(
messages_list: list[list[dict[str,Any]]],
model: str,
temperature: float,
max_tokens: int,
top_p: float,

Our dearest script that everyone uses in the modern #neuralempty revolution in machine translation is multi-bleu.perl

But consider this:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl 
Use of uninitialized value $ARGV[0] in string eq at multi-bleu.perl line 11.
usage: multi-bleu.pl [-lc] reference < hypothesis
Reads the references from reference or reference0, reference1, ...
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar\n')"

Deep Learning InfraOps without Tears

This is an introduction to managing infrastructure (Infra) and system administration operations (Ops) particularly for deep learning applications.

Why do I have to learn this?

Q: I have my Jupyter notebooks and virtual machines that comes with all batteries included. Why do I have to handle bare-metal machines?

A: You don't (if you are not interested). In most cases, the normal virtual machines with everything included, e.g. AWS Sagemaker or Azures Machine Learning Studio. But these are technically "Managed" Virtual Machines and often fixing something that goes wrong after the VM has been spun up for a couple of months will lead to the same issues that requires the knowledge of infra-ops to handle/fix the issue.

We can't make this file beautiful and searchable because it's too large.
This file has been truncated, but you can view the full file.
https://www.coop.ch/de/navigation/meganav/get?categoryCode=m_0475&_=1629798238733
https://www.amazon.com/dp/B0009R5B3U?th=1
https://play.google.com/_/PlayStoreUi/data/batchexecute?rpcids=qnKhOb&bl=boq_playuiserver_20191117.08_p1&gl=my&hl=ms&authuser&soc-app=121&soc-platform=1&soc-device=1&rt=c
https://www.coop.ch/de/lebensmittel/vorraete/pastasaucen-warme-saucen/saucen-gemischt/c/m_0166
https://www.amazon.com.au/gp/aod/ajax/ref=aod_f_freeShipping?asin=B08CXNTJ89&pageno=1&pc=dp
https://losangeles.craigslist.org/lac/ofc/d/san-diego-market-research-project/7369959933.html
https://www.bestbuy.ca/api/offers/v1/products/10434538/offers
https://www.zillow.com/homes/5266-S-Umatilla-Ave-Boise-ID-83709_rb
https://www.yelp.com/not_recommended_reviews/dannys-rv-repair-williams?not_recommended_start=40

NLTK API to Stanford NLP Tools compiled on 2015-12-09

Stanford NER

With NLTK version 3.1 and Stanford NER tool 2015-12-09, it is possible to hack the StanfordNERTagger._stanford_jar to include other .jar files that are necessary for the new tagger.

First set up the environment variables as per instructed at https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software