Greg Rahn gregrahn

## dist.xml
<!--
  ~ Copyright (c) 2014 Scaling Data. All rights reserved. This gist is licensed under the Apache Software License v2.0.
  -->

<assembly
  xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2 http://maven.apache.org/xsd/assembly-1.1.2.xsd">

  <id>dist</id>

## strip_aac.pl
#!/usr/bin/env perl
use strict;
use File::Find::Rule;
use Capture::Tiny qw(capture);

sub atomic {
    my($file, @cmd) = @_;
    capture {
        system "atomicparsley", $file, @cmd;
    };

## list.md

      
              1 file
            
          
              10 forks
            
          
              6 comments
            
          
              49 stars
            
          
                pbailis
                / list.md
            
            
              Last active
              April 15, 2018 08:54
            
              
                Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers
              
          
    A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.
###Dataflow Engines:
Dryad--general-purpose distributed parallel dataflow engine

http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Spark--in memory dataflow

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

  
## cornellcs.txt
From bmc Mon Oct  2 15:12:34 2000
Subject: Undergrad systems curriculum
To: faculty@cs.cornell.edu
Date: Mon, 2 Oct 2000 15:12:34 -0700 (PDT)
X-Mailer: ELM [version 2.4ME+ PL31H (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length:  4065
Status: RO

## atepassar_recommender.py
#-*-coding: utf-8 -*-

'''

    This module represents the FriendsRecommender system for recommending
    new friends based on friendship similarity and state similarity.

'''
__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'

## impala-hll.md

      
              1 file
            
          
              1 fork
            
          
              2 comments
            
          
              14 stars
            
          
                avibryant
                / impala-hll.md
            
            
              Last active
              June 22, 2020 06:39
            
          
    Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select

  
## coronavirus_daily_by_type.R
#---------------- Plotting Daily Cumulative Cases of the Coronavirus----------------
# Installing the most update version of the coronavirus
# install.packages("devtools")
devtools::install_github("RamiKrispin/coronavirus")
data("coronavirus")


# Reformat and aggregate the data to daily by country and type of case
df_daily <- coronavirus %>%
  dplyr::group_by(date, type) %>%

## curriculum.md

      
              1 file
            
          
              6 forks
            
          
              11 comments
            
          
              33 stars
            
          
                hadley
                / curriculum.md
            
            
              Created
              September 27, 2013 20:24
            
              
                My first stab at a basic R programming curriculum. I think teaching just these topics without overall motivating examples would be extremely boring, but if you're a self-taught R user, this might be useful to help spot your gaps.
              
          
    Notes:


I've tried to break up in to separate pieces, but it's not always possible: e.g. knowledge of data structures and subsetting are tidy intertwined.


Level of Bloom's taxonomy listed in square brackets, e.g. http://bit.ly/15gqPEx. Few categories currently assess components higher in the taxonomy.


Programming R curriculum

Data structures


## connecting_to_a_ubiquiti_unifi_vpn_with_a_linux_machine.txt
This guide assumes that you have already set up a Ubiquiti Unifi VPN following the guide:
https://help.ubnt.com/hc/en-us/articles/115005445768-UniFi-L2TP-Remote-Access-VPN-with-USG-as-RADIUS-Server

To configure a Linux machine to be able to connect remotely I followed these steps. This guide was written for Debian 8.

- In Debian install the "xl2tpd" and "strongswan" packages.

- Edit /etc/ipsec.conf to add the connection:

    conn YOURVPNCONNECTIONNAME

## postsql.sql
 -- PostgreSQL 9.2 beta (for the new JSON datatype)
 --   You can actually use an earlier version and a TEXT type too
 -- PL/V8 http://code.google.com/p/plv8js/wiki/PLV8

 -- Inspired by
 -- http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
 -- http://ssql-pgaustin.herokuapp.com/#1

  -- JSON Types need to be mapped into corresponding PG types
  --
	<!--
	~ Copyright (c) 2014 Scaling Data. All rights reserved. This gist is licensed under the Apache Software License v2.0.
	-->

	<assembly
	xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.2 http://maven.apache.org/xsd/assembly-1.1.2.xsd">

	<id>dist</id>
	#!/usr/bin/env perl
	use strict;
	use File::Find::Rule;
	use Capture::Tiny qw(capture);

	sub atomic {
	my($file, @cmd) = @_;
	capture {
	system "atomicparsley", $file, @cmd;
	};
	From bmc Mon Oct 2 15:12:34 2000
	Subject: Undergrad systems curriculum
	To: faculty@cs.cornell.edu
	Date: Mon, 2 Oct 2000 15:12:34 -0700 (PDT)
	X-Mailer: ELM [version 2.4ME+ PL31H (25)]
	MIME-Version: 1.0
	Content-Type: text/plain; charset=US-ASCII
	Content-Transfer-Encoding: 7bit
	Content-Length: 4065
	Status: RO
	#--coding: utf-8 --

	'''

	This module represents the FriendsRecommender system for recommending
	new friends based on friendship similarity and state similarity.

	'''
	__author__ = 'Marcel Caraciolo <caraciol@gmail.com>'
	#---------------- Plotting Daily Cumulative Cases of the Coronavirus----------------
	# Installing the most update version of the coronavirus
	# install.packages("devtools")
	devtools::install_github("RamiKrispin/coronavirus")
	data("coronavirus")


	# Reformat and aggregate the data to daily by country and type of case
	df_daily <- coronavirus %>%
	dplyr::group_by(date, type) %>%
	This guide assumes that you have already set up a Ubiquiti Unifi VPN following the guide:
	https://help.ubnt.com/hc/en-us/articles/115005445768-UniFi-L2TP-Remote-Access-VPN-with-USG-as-RADIUS-Server

	To configure a Linux machine to be able to connect remotely I followed these steps. This guide was written for Debian 8.

	- In Debian install the "xl2tpd" and "strongswan" packages.

	- Edit /etc/ipsec.conf to add the connection:

	conn YOURVPNCONNECTIONNAME
	-- PostgreSQL 9.2 beta (for the new JSON datatype)
	-- You can actually use an earlier version and a TEXT type too
	-- PL/V8 http://code.google.com/p/plv8js/wiki/PLV8

	-- Inspired by
	-- http://people.planetpostgresql.org/andrew/index.php?/archives/249-Using-PLV8-to-index-JSON.html
	-- http://ssql-pgaustin.herokuapp.com/#1

	-- JSON Types need to be mapped into corresponding PG types
	--