Skip to content

Instantly share code, notes, and snippets.

@mattb
mattb / summarize_youtube.sh
Created January 1, 2024 23:43
Download the automatic subtitle track from Youtube and summarize it with a local LLM
yt-dlp --skip-download --write-auto-sub --sub-format ttml --sub-lang en -o /tmp/out $1 &&
(
echo 'Summarize the following YouTube transcript:\n' > /tmp/out1
cat /tmp/out.en.ttml | grep ^\<p | cut -d\> -f 2 | cut -d \< -f 1 >> /tmp/out1
cat /tmp/out1 | ollama run mistral
)
@mattb
mattb / gist:3888345
Created October 14, 2012 11:53
Some pointers for Natural Language Processing / Machine Learning

Here are the areas I've been researching, some things I've read and some open source packages...

Nearly all text processing starts by transforming text into vectors: http://en.wikipedia.org/wiki/Vector_space_model

Often it uses transforms such as TFIDF to normalise the data and control for outliers (words that are too frequent or too rare confuse the algorithms): http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Collocations is a technique to detect when two or more words occur more commonly together than separately (e.g. "wishy-washy" in English) - I use this to group words into n-gram tokens because many NLP techniques consider each word as if it's independent of all the others in a document, ignoring order: http://matpalm.com/blog/2011/10/22/collocations_1/

@mattb
mattb / wordle-word-score.js
Created January 24, 2022 07:21
Calculating scores of first word choice in Wordle
// node wordle-word-score.js | sort -n
const fs = require('fs');
words = [];
dict = [];
idx = {};
fs.readFileSync('wordledict.txt', 'utf-8').split(/\r?\n/).forEach(line => dict.push(line));
fs.readFileSync('wordlewords.txt', 'utf-8').split(/\r?\n/).forEach(line => {
words.push(line);
line.split("").forEach((letter, i) => {
{
"parserOptions": {
"ecmaVersion": 8
},
"env": {
"jest": true,
"browser": true,
"es6": true,
"node": true
},
description "autossh tunnel"
author "Joni Kähärä "
start on (local-filesystems and net-device-up IFACE=eth0 and net-device-up IFACE=eth1) # assuming we have multiple interfaces
stop on runlevel [016]
respawn
respawn limit 5 60
exec autossh -M 0 -N -R 10000:192.168.1.1:22 -o "ServerAliveInterval 60" -o "ServerAliveCountMax 3" -o "StrictHostKeyChecking=no" -o "BatchMode=yes" -i /home/user/.ssh/id_rsa username@hostname
[error] found : anorm.TupleFlattener[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing)]
[error] required: anorm.TupleFlattener[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],T2] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing)]
[error] Note: anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing,Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing],Nothing] => (Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing) >: anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[anorm.~[Nothing
@mattb
mattb / simplenlg ruby DSL test
Created April 3, 2013 22:24
notice how it gets all the fiddly grammar with tenses and plural agreements correct.
# a jruby DSL for the Simple Natural Language Generator library - https://code.google.com/p/simplenlg/
realise clause {
subject 'two cats', :plural => true
verb 'live with'
object 'Matt', 'Ariel'
complement preposition_phrase {
complement 'San Francisco'
preposition 'in'
}
@mattb
mattb / gist:4588259
Created January 21, 2013 18:44
Source code used to produce https://gist.github.com/4588234
ids = [293401191054979074, 293401663698513921, 293401787464040448, 293401934436659200, 293402177140056064, 293402218432983040, 293402263777587202, 293402332853583872, 293402393025077250, 293402466354077698, 293402940637597696, 293403039295995904, 293403262642700288, 293403304124366848, 293403623084404737, 293403854266048512, 293403891633094656, 293404203659964416, 293404415535239168, 293404447453872128, 293404518945783808, 293404594669760512, 293404694401929217, 293405000087003137, 293405024229404673, 293405072929484800, 293405372021108736, 293405406661857281, 293405547682734080, 293405590468845570]
count = ids.size
tweets = ids.map { |i|
puts count
count -= 1
tweet = Twitter.status(i).to_hash
activity = Twitter.status_activity(i).to_hash
tweet.merge(activity)
}
open("tweets.json","w" ) { |f|
@mattb
mattb / gist:4588234
Last active December 11, 2015 10:38
Count of retweets plus favourites of @BarackObama's key phrases from the 2013 US Inauguration (one asterisk = 500 retweets+favourites) as of 10:42am PST on January 21st 2013. Source code: https://gist.github.com/4588259
************************************************* Our journey is not complete until our gay brothers and sisters are treated like anyone else under the law. (6699fav/17845rt)
************************** Our journey is not complete until our wives, our mothers, and daughters can earn a living equal to their efforts. (3059fav/9974rt)
**************** Thank you, God Bless you, and may He forever bless these United States of America. (2191fav/5866rt)
*************** Our journey is not complete until all our children know that they are cared for, and cherished, and always safe from harm. (1743fav/5970rt)
************* Our journey is not complete until we find a better way to welcome the immigrants who still see America as a land of opportunity. (1774fav/5016rt)
************ We have always understood that when times change, so must we. —President Obama (1422fav/5049rt)
************ We, the people, declare today that the most evident of truths—that all of us are created equal—is the star that guides us still. (1418fa
/usr/lib/jvm/java-7-openjdk-i386/bin/java -Xms384m -Xmx384m -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError org.elasticsearch.bootstrap.ElasticSearch