Skip to content

Instantly share code, notes, and snippets.

View tdunning's full-sized avatar

Ted Dunning tdunning

View GitHub Profile
The idea of content-based recommendation is that instead of looking purely at a history of
how users interact with items where both users and items are considered as things we know
nothing about (other than their interactions), we can consider the features of the items.
By content here, we might consider actual textual descriptions, but we might also consider
more structured information about the objects like their color or whether they are shoes,
books or music.
If we look at the content associated with items, we can restate the user x item history as
a user x content-feature history. That is to say that we can look at what content features
our users interacted with as opposed to which items. Essentially, we are recommending features
Pragma version;
CREATE TABLE distributors (
did integer CHECK (did > 100),
name varchar(40)
);
insert into distributors values (200, 'a');
insert into distributors values (201, 'b');
select min(columns('d.*')) from distributors;
@tdunning
tdunning / csv-sample.csv
Created January 23, 2022 21:11
random data in CSV form
x1 x2 x3
0.7231422916301575 0.819657781416707 0.6567508886461839
0.4020425739176958 0.1549076251851813 0.4282647678658029
0.4629109586444531 0.9094363294197141 0.1236688659876839
0.747467460858015 0.2428975528400832 0.6360313817514556
julia> A = 3 # this is \Alpha
3
julia> Α = 4 # this is A
4
julia> Α == A # they aren't the same
false
julia> x′ = rand(2,2) # this is x\prime
library (dplyr)
data = read.csv('median-error.csv')
png("max-error-uniform.png", width=1200, height=1000, pointsize=25)
i = -3.8
boxplot(abs(error) ~ delta, (data %>% filter(n0==20)), ylim=c(0, 0.05), xlim=c(0.6,4.4), boxwex=0.1, at=(1:4)+i/11, xaxt='n', xlab=expression(delta), cex.lab=1.4)
axis(side=1, at=1:4, labels=c(50,100,200,500))
for (nx in c(20, 50, 100, 1000, 10000, 100000)) {
@tdunning
tdunning / figure.r
Created June 18, 2021 07:30
Snippet of R to recreate an analysis of t-digest interpolation on real data
# Analysis of how two t-digests see some sample data
png("figure.png", width=1200, height=1000, points=30)
# the first few actual data points with filler for the remainder
d = c(241, 543, 575, 702, 890, 1530, 1940, 2166, 2168, rep(3000,33))
# the cumulative distribution function
f = ecdf(d)
# plot the actual CDF
plot(x=d, y=f(d), xlim = c(700, 2300), ylim = c(0.08, 0.25), type='s',
xlab="Sample value", ylab="Cumulative Distribution Function",
cex.lab=1.3)
@tdunning
tdunning / lorenz-animator.jl
Created May 4, 2021 00:49
Animates the evolution of an initially tight group of points ... my intro to Julia
using DifferentialEquations
using Plots
using Statistics
using LinearAlgebra
function lorenz!(du, u, p, t)
x, y, z = u
σ, ρ, β = p
@tdunning
tdunning / shift-detection.r
Last active December 6, 2020 02:15
Sample code that shows how distributional changes in a single tail can be detected accurately using counts targeted at particular parts of a reference dataset
### Draws a figure illustrating change detection in the distribution of synthetic data.
### Each dot represents a single time period with 1000 samples. Before the change,
### the data is sampled from a unit normal distribution. After the change, 20 samples
### in each time period are taken from N(3,1). Comparing counts with a chi^2 test that
### is robust to small expected counts robustly detects this shift.
### log-likelihood ratio test for multinomial data
llr = function(k) {
2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))
}
@tdunning
tdunning / mcem.r
Last active December 7, 2020 22:59
Implementation of Monte Carlo EM algorithm for reconstructing a standard distribution from censored observations
### This is a demonstration of a Monte Carlo Expectation Maximization
### algorithm that can recover the mean and standard deviation of
### truncated normally distributed data. We get 10,000 samples from
### a unit normal distribution, but every sample below 0.5 is truncated
### to that value. Every sample above 2.5 is truncated to that value.
### These choices were made to get quick and visually appealling convergence
### but the algorithm still converges for any choice. The converges
### could be very, very slow if there is little information in the samples
### and the final answer could have substantial uncertainty. For instance,
### if we truncated at 4 and 6, almost all samples would be piled up at
### This code builds a simple physical model of the range of an 85kWh Tesla Model S and
### compares it to real data. The data here is digitized from
### https://www.tesla.com/blog/model-s-efficiency-and-range
### The model here accounts for aerodynamic drag, viscous drag, constant
### friction and constant power drain
### First the digitized data
x = read.csv(text="v,range
10.22976354700292, 393.9005561997566