Ted Dunning tdunning

## content-based recommendation paragraphs
The idea of content-based recommendation is that instead of looking purely at a history of
how users interact with items where both users and items are considered as things we know
nothing about (other than their interactions), we can consider the features of the items.
By content here, we might consider actual textual descriptions, but we might also consider
more structured information about the objects like their color or whether they are shoes,
books or music.

If we look at the content associated with items, we can restate the user x item history as
a user x content-feature history. That is to say that we can look at what content features
our users interacted with as opposed to which items. Essentially, we are recommending features

## gist:d1970aec133e96fe4c8cbb4515ecb8aa
Pragma version;
CREATE TABLE distributors (
did     integer CHECK (did > 100),
name    varchar(40)
);

insert into distributors values (200, 'a');
insert into distributors values (201, 'b');

select min(columns('d.*')) from distributors;

## csv-sample.csv

          
            x1
            x2
            x3

            
              0.7231422916301575
              0.819657781416707
              0.6567508886461839

            
              0.4020425739176958
              0.1549076251851813
              0.4282647678658029

            
              0.4629109586444531
              0.9094363294197141
              0.1236688659876839

            
              0.747467460858015
              0.2428975528400832
              0.6360313817514556

## alpha-is-not-a
julia> A = 3 # this is \Alpha
3

julia> Α = 4 # this is A
4

julia> Α == A # they aren't the same
false

julia> x′ = rand(2,2) # this is x\prime

## median-error.r
library (dplyr)

data = read.csv('median-error.csv')

png("max-error-uniform.png", width=1200, height=1000, pointsize=25)
i = -3.8
boxplot(abs(error) ~ delta, (data %>% filter(n0==20)), ylim=c(0, 0.05), xlim=c(0.6,4.4), boxwex=0.1, at=(1:4)+i/11, xaxt='n', xlab=expression(delta), cex.lab=1.4)
axis(side=1, at=1:4, labels=c(50,100,200,500))

for (nx in c(20, 50, 100, 1000, 10000, 100000)) {

## figure.r
# Analysis of how two t-digests see some sample data
png("figure.png", width=1200, height=1000, points=30)
# the first few actual data points with filler for the remainder
d = c(241, 543, 575, 702, 890, 1530, 1940, 2166, 2168, rep(3000,33))
# the cumulative distribution function
f = ecdf(d)
# plot the actual CDF
plot(x=d, y=f(d), xlim = c(700, 2300), ylim = c(0.08, 0.25), type='s',
     xlab="Sample value", ylab="Cumulative Distribution Function",
     cex.lab=1.3)

## lorenz-animator.jl

using DifferentialEquations
using Plots
using Statistics
using LinearAlgebra

function lorenz!(du, u, p, t)
    x, y, z = u
    σ, ρ, β = p

## shift-detection.r
### Draws a figure illustrating change detection in the distribution of synthetic data.
### Each dot represents a single time period with 1000 samples. Before the change,
### the data is sampled from a unit normal distribution. After the change, 20 samples
### in each time period are taken from N(3,1). Comparing counts with a chi^2 test that
### is robust to small expected counts robustly detects this shift.

### log-likelihood ratio test for multinomial data
llr = function(k) {
    2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))
}

## mcem.r
### This is a demonstration of a Monte Carlo Expectation Maximization
### algorithm that can recover the mean and standard deviation of
### truncated normally distributed data. We get 10,000 samples from
### a unit normal distribution, but every sample below 0.5 is truncated
### to that value. Every sample above 2.5 is truncated to that value.
### These choices were made to get quick and visually appealling convergence
### but the algorithm still converges for any choice. The converges
### could be very, very slow if there is little information in the samples
### and the final answer could have substantial uncertainty. For instance,
### if we truncated at 4 and 6, almost all samples would be piled up at

## tesla-range-sim
### This code builds a simple physical model of the range of an 85kWh Tesla Model S and
### compares it to real data. The data here is digitized from
### https://www.tesla.com/blog/model-s-efficiency-and-range

### The model here accounts for aerodynamic drag, viscous drag, constant
### friction and constant power drain

### First the digitized data
x = read.csv(text="v,range
10.22976354700292, 393.9005561997566
	The idea of content-based recommendation is that instead of looking purely at a history of
	how users interact with items where both users and items are considered as things we know
	nothing about (other than their interactions), we can consider the features of the items.
	By content here, we might consider actual textual descriptions, but we might also consider
	more structured information about the objects like their color or whether they are shoes,
	books or music.

	If we look at the content associated with items, we can restate the user x item history as
	a user x content-feature history. That is to say that we can look at what content features
	our users interacted with as opposed to which items. Essentially, we are recommending features
	Pragma version;
	CREATE TABLE distributors (
	did integer CHECK (did > 100),
	name varchar(40)
	);

	insert into distributors values (200, 'a');
	insert into distributors values (201, 'b');

	select min(columns('d.*')) from distributors;
x1	x2	x3
0.7231422916301575	0.819657781416707	0.6567508886461839
0.4020425739176958	0.1549076251851813	0.4282647678658029
0.4629109586444531	0.9094363294197141	0.1236688659876839
0.747467460858015	0.2428975528400832	0.6360313817514556
	julia> A = 3 # this is \Alpha
	3

	julia> Α = 4 # this is A
	4

	julia> Α == A # they aren't the same
	false

	julia> x′ = rand(2,2) # this is x\prime
	library (dplyr)

	data = read.csv('median-error.csv')

	png("max-error-uniform.png", width=1200, height=1000, pointsize=25)
	i = -3.8
	boxplot(abs(error) ~ delta, (data %>% filter(n0==20)), ylim=c(0, 0.05), xlim=c(0.6,4.4), boxwex=0.1, at=(1:4)+i/11, xaxt='n', xlab=expression(delta), cex.lab=1.4)
	axis(side=1, at=1:4, labels=c(50,100,200,500))

	for (nx in c(20, 50, 100, 1000, 10000, 100000)) {
	# Analysis of how two t-digests see some sample data
	png("figure.png", width=1200, height=1000, points=30)
	# the first few actual data points with filler for the remainder
	d = c(241, 543, 575, 702, 890, 1530, 1940, 2166, 2168, rep(3000,33))
	# the cumulative distribution function
	f = ecdf(d)
	# plot the actual CDF
	plot(x=d, y=f(d), xlim = c(700, 2300), ylim = c(0.08, 0.25), type='s',
	xlab="Sample value", ylab="Cumulative Distribution Function",
	cex.lab=1.3)

	using DifferentialEquations
	using Plots
	using Statistics
	using LinearAlgebra

	function lorenz!(du, u, p, t)
	x, y, z = u
	σ, ρ, β = p
	### Draws a figure illustrating change detection in the distribution of synthetic data.
	### Each dot represents a single time period with 1000 samples. Before the change,
	### the data is sampled from a unit normal distribution. After the change, 20 samples
	### in each time period are taken from N(3,1). Comparing counts with a chi^2 test that
	### is robust to small expected counts robustly detects this shift.

	### log-likelihood ratio test for multinomial data
	llr = function(k) {
	2 * sum(k) * (H(k) - H(rowSums(k)) - H(colSums(k)))
	}
	### This is a demonstration of a Monte Carlo Expectation Maximization
	### algorithm that can recover the mean and standard deviation of
	### truncated normally distributed data. We get 10,000 samples from
	### a unit normal distribution, but every sample below 0.5 is truncated
	### to that value. Every sample above 2.5 is truncated to that value.
	### These choices were made to get quick and visually appealling convergence
	### but the algorithm still converges for any choice. The converges
	### could be very, very slow if there is little information in the samples
	### and the final answer could have substantial uncertainty. For instance,
	### if we truncated at 4 and 6, almost all samples would be piled up at
	### This code builds a simple physical model of the range of an 85kWh Tesla Model S and
	### compares it to real data. The data here is digitized from
	### https://www.tesla.com/blog/model-s-efficiency-and-range

	### The model here accounts for aerodynamic drag, viscous drag, constant
	### friction and constant power drain

	### First the digitized data
	x = read.csv(text="v,range
	10.22976354700292, 393.9005561997566