Vicki Boykis veekaybee

## simpleScaldingJob.sc
import com.twitter.scalding._

class WordCountJob(args: Args) extends Job(args) {

val lines = TypedPipe.from(TextLine("posts.txt"))

lines.flatMap { line => tokenize(line) }
    .groupBy { word => word }
    .size
    .groupAll

## readme.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                veekaybee
                / readme.md
            
            
              Created
              October 30, 2022 15:20
            
          
    To run:
dot -Tpng trie.dot -o trie.png

  
## pyscript.html
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Some plotting</title>
    <link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
    <script defer src="https://pyscript.net/alpha/pyscript.js"></script>
    <py-env>

## chatgpt.md

      
              1 file
            
          
              38 forks
            
          
              4 comments
            
          
              338 stars
            
          
                veekaybee
                / chatgpt.md
            
            
              Last active
              May 18, 2024 11:41
            
              
                Everything I understand about chatgpt
              
          
    ChatGPT Resources

Context

ChatGPT appeared like an explosion on all my social media timelines in early December 2022. While I keep up with machine learning as an industry, I wasn't focused so much on this particular corner, and all the screenshots seemed like they came out of nowhere. What was this model? How did the chat prompting work? What was the context of OpenAI doing this work and collecting my prompts for training data?
I decided to do a quick investigation. Here's all the information I've found so far. I'm aggregating and synthesizing it as I go, so it's currently changing pretty frequently.
Model Architecture


## searchrecs.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              21 stars
            
          
                veekaybee
                / searchrecs.md
            
            
              Last active
              January 22, 2024 13:53
            
              
                Understanding search and recommendations
              
          
    How are search and recommendations the same, and how are they different?


Question on Mastodon
Question on Twitter

TL;DR:

The design of both search and recommendations is to find and filter information
Search is a "recommendation with a null query"
Search is "I want this", recommendations is "you might like this"


## README.md

      
              3 files
            
          
              3 forks
            
          
              0 comments
            
          
              5 stars
            
          
                veekaybee
                / README.md
            
            
              Last active
              January 7, 2024 18:58
            
              
                whisper.ipynb
              
          
    Using Whisper to transcribe audio

This episode of Recsperts was transcribed with Whisper from OpenAI, an open-source neural net trained on almost 700 hours of audio. The model includes an encoder-decoder architecture by tokenizing audio into 30-second chunks, normalizing audio samples to the log-Mel scale, and passing the data into an encoder. A decoder is trained to predict the captioned text matching the encoder, and the model includes transcription, as well as timestamp-aligned transcription, and multilingual translation.

The transcription process outputs a single string file, so it's up to the end-user to parse out individual speakers, or run the model [through a sec

  
## largestreams.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                veekaybee
                / largestreams.md
            
            
              Last active
              August 9, 2023 01:34
            
              
                Counting cumulative elements in large streams
              
          
    Counting cumulative elements in large streams

An interview problem that I've gotten fairly often is, "Given a stream of elements, how do you get the median, or average, or sum of the elements in the stream?"
I've thought about this problem a lot and my naive implementation was to put the elements in a hashmap (dictionary) and then pass over the hashmap with whatever other function you need.
For example,
import typing

  
## learning_from_data.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                veekaybee
                / learning_from_data.md
            
            
              Created
              February 1, 2023 17:16
            
          
    Notes on Learning from Data


Algorithms find the best ways to do things, but they don't explain "how" they came to those conclusions.

This is a common way to formulate ML problems, using target functions that we don't know but we want to approximate and learn.

  
## machine_learning_design_patterns.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              54 stars
            
          
                veekaybee
                / machine_learning_design_patterns.md
            
            
              Last active
              May 8, 2024 19:32
            
          
    Machine Learning Design Patterns

This book is all about patterns for doing ML. It's broken up into several key parts, building and serving. Both of these are intertwined so it makes sense to read through the whole thing, there are very many good pieces of advice from seasoned professionals. The parts you can safely ignore relate to anything where they specifically use GCP. The other issue with the book it it's very heavily focused on deep learning cases. Not all modeling problems require these. Regardless, let's dive in. I've included the stuff that was relevant to me in the notes.
Most Interesting Bullets:


Machine learning models are not deterministic, so there are a number of ways we deal with them when building software, including setting random seeds in models during training and allowing for stateless functions, freezing layers, checkpointing, and generally making sure that flows are as reproducible as possib


## isolation_forest.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              1 star
            
          
                veekaybee
                / isolation_forest.md
            
            
              Created
              February 2, 2023 02:49
            
          
    Isolation forests versus decision trees

Isolation forest paper


Isolated points should be lower and closer to the root of the tree
	import com.twitter.scalding._

	class WordCountJob(args: Args) extends Job(args) {

	val lines = TypedPipe.from(TextLine("posts.txt"))

	lines.flatMap { line => tokenize(line) }
	.groupBy { word => word }
	.size
	.groupAll
	<!DOCTYPE html>
	<html lang="en">

	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Some plotting</title>
	<link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
	<script defer src="https://pyscript.net/alpha/pyscript.js"></script>
	<py-env>