  • Math and Stats Department
  • CSU Sacramento
title={SQL-on-Hadoop: full circle back to shared-nothing database architectures},
author={Floratou, Avrilia and Minhas, Umar Farooq and {\"O}zcan, Fatma},
journal={Proceedings of the VLDB Endowment},
publisher={VLDB Endowment}
Created Oct 15, 2018
Using W3 CSV standard to see how we would like to extend it to work with statistics type data.
date category task complete notes
2018-10-15 support type and share meeting notes from Friday with Duncan 0
2018-10-15 revise rewrite software alchemy example for clarity following Duncan's feedback 0
Created Sep 8, 2018
Fun with installing Rtesseract
clark@campus-108-089 ~/dev/Rtesseract (master)
$ ./configure
checking for pkg-config... /usr/local/bin/pkg-config
Package tesseract was not found in the pkg-config search path.
Perhaps you should add the directory containing `tesseract.pc'
to the PKG_CONFIG_PATH environment variable
No package 'tesseract' found
Package tesseract was not found in the pkg-config search path.
Perhaps you should add the directory containing `tesseract.pc'
to the PKG_CONFIG_PATH environment variable
#!/usr/bin/env Rscript
# 2018-06-04 11:26:12
# Automatically generated from R by autoparallel version 0.0.1
nworkers = 2
timeout = 600
Last active Nov 27, 2017
R for loops with possibly difficult vectorization
# Given observations of linear functions f and g at points a and b this
# calculates the integral of f * g from a to b.
# Looks like it will already work as a vectorized function. Sweet!
inner_one_piece = function(a, b, fa, fb, ga, gb)
# Roughly following my notes
fslope = (fb - fa) / (b - a)
gslope = (gb - ga) / (b - a)
# Following:
test_func <- function(par_mu, par_sd) {
samp <- rnorm(10^6, par_mu, par_sd)
c(s_mu = mean(samp), s_sd = sd(samp))
#' Split And Append Results To CSV File
#' x will be split by f and each group will be appended to a directory of
#' csv files named according to f
#' @param x data frame to split
#' @param f factor defining splits
#' @param ... further arguments to split
#' @param dirname character directory, will be created if doesn't exist
#' @return NULL
# Mon Aug 28 16:33:46 PDT 2017
# sweep() used to implement scale() is inefficient. Profiling shows that
# only 2% of the time is spent in colMeans. The only other thing to do is
# subtract the mean, which should be fast, but isn't because memory
# layout requires a transpose to use recycling (broadcasting).
# But I don't know how to do any better short of writing in C
Created Aug 10, 2017
Chunked version of covariance
cov_chunked = function(x, nchunks = 2L)
p = ncol(x)
indices = parallel::splitIndices(p, nchunks)
diagonal_blocks = lapply(indices, function(idx) cov(x[, idx, drop = FALSE]))
upper_right_indices = combn(indices, 2, simplify = FALSE)
# From wlandau
#' Recursively Find Global Variables
#' TODO: Modify this to work without requiring that the code be evaluated
#' Probably means we can't use codetools::findGlobals