Dan Osipov danosipov

## TypedDataCube.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              4 stars
            
          
                johnynek
                / TypedDataCube.md
            
            
              Last active
              August 29, 2015 14:04
            
              
                How to do data cubing in typed scalding?
              
          
    Suppose you have a key like (page, geo, day) and you want to make rollups/datacube so you can query for all pages, or all geos or all days.
Here is how you do it:
def opts[T](t: T): Seq[Option[T]] = Seq(Some(t), None)

val p: TypedPipe[(String, String, Int)] = ...

p.sumByLocalKeys

  
## gist:213b837c6e02c4982a9a

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                velvia
                / gist:213b837c6e02c4982a9a
            
            
              Last active
              September 21, 2015 09:28
            
              
                Notes for velvia/filo 50x performance improvement
              
          
    ...to be turned into a blog post later.  These are notes with references to commits, the blog post will have snippets of code so folks don't have to look things up.
How I tuned Filo for 50x speedup in 24 hours

Filo is an extreme serialization library for vector data.  Think of it as the good parts of Parquet without the HDFS and file format garbage -- just the serdes and fast columnar storage.
I recently added a JMH benchmark for reading a Filo binary buffer containing 10,000 Ints using the simplest apply() method to sum up all the Ints.
Oh, and before we get started - avoid throwing exceptions in inner loops, especially Try(....).getOrElse(...) patterns.  Even if they occur only occasionally they can be extremely expensive.

  
## hs_err_pid21513.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000010ff99024, pid=21513, tid=20739
#
# JRE version: Java(TM) SE Runtime Environment (8.0-b123) (build 1.8.0-ea-b123)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b65 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x399024]
#

## FutureGoodies.scala
/*  We've run into a few common pitfalls when dealing with Futures in Scala, so I wrote these three helpful
 *  classes to give some baked-in functionality.
 *
 *  I'd love to hear about other helpers you're using like these, or if you have improvement suggestions.
 *  github@andrewconner.org / @connerdelights
 */

import scala.concurrent.{ExecutionContext, CanAwait, Awaitable, Future, Promise}
import scala.concurrent.duration.Duration
import scala.util.Try

## KMeansJob.scala
import com.twitter.algebird.{Aggregator, Semigroup}
import com.twitter.scalding._

import scala.util.Random

/**
 * This job is a tutorial of sorts for scalding's Execution[T] abstraction.
 * It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
 *
 * http://en.wikipedia.org/wiki/K-means_clustering

## android-19-circle.yml
#
# Build configuration for Circle CI
#

general:
    artifacts:
        - /home/ubuntu/your-app-name/app/build/outputs/apk/

machine:
    environment:

## influxdb-grafana-howto.sh
#!/bin/bash

# Check out the blog post at:
#
#    http://www.philipotoole.com/influxdb-and-grafana-howto
#
# for full details on how to use this script.

AWS_EC2_HOSTNAME_URL=http://169.254.169.254/latest/meta-data/public-hostname
INFLUXDB_DATABASE=test1

## Retry.scala
import scala.concurrent.Await
import scala.concurrent.ExecutionContext
import scala.concurrent.Future
import scala.concurrent.blocking
import scala.concurrent.duration.Deadline
import scala.concurrent.duration.Duration
import scala.concurrent.duration.DurationInt
import scala.concurrent.duration.DurationLong
import scala.concurrent.future
import scala.concurrent.promise

## spark_flame_graphs.md

      
              1 file
            
          
              19 forks
            
          
              2 comments
            
          
              65 stars
            
          
                kayousterhout
                / spark_flame_graphs.md
            
            
              Last active
              August 22, 2022 13:39
            
          
    Generating Flame Graphs for Apache Spark

Flame graphs are a nifty debugging tool to determine where CPU time is being spent.  Using the Java Flight recorder, you can do this for Java processes without adding significant runtime overhead.
When are flame graphs useful?

Shivaram Venkataraman and I have found these flame recordings to be useful for diagnosing coarse-grained performance problems. We started using them at the suggestion of Josh Rosen, who quickly made one for the Spark scheduler when we were talking to him about why the scheduler caps out at a throughput of a few thousand tasks per second. Josh generated a graph similar to the one below, which illustrates that a significant amount of time is spent in serialization (if you click in the top right hand corner and search for "serialize", you can see that 78.6% of the sampled CPU time was spent in serialization). We used this insight to spee

  
## jargon.md

      
              1 file
            
          
              28 forks
            
          
              9 comments
            
          
              179 stars
            
          
                cb372
                / jargon.md
            
            
              Last active
              May 8, 2023 16:03
            
              
                Category theory jargon cheat sheet
              
          
    Category theory jargon cheat sheet

A primer/refresher on the category theory concepts that most commonly crop up in conversations about Scala or FP. (Because it's embarassing when I forget this stuff!)
I'll be assuming Scalaz imports in code samples, and some of the code may be pseudo-Scala.
Functor

A functor is something that supports map.
	#
	# A fatal error has been detected by the Java Runtime Environment:
	#
	# SIGSEGV (0xb) at pc=0x000000010ff99024, pid=21513, tid=20739
	#
	# JRE version: Java(TM) SE Runtime Environment (8.0-b123) (build 1.8.0-ea-b123)
	# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b65 mixed mode bsd-amd64 compressed oops)
	# Problematic frame:
	# V [libjvm.dylib+0x399024]
	#
	/* We've run into a few common pitfalls when dealing with Futures in Scala, so I wrote these three helpful
	* classes to give some baked-in functionality.
	*
	* I'd love to hear about other helpers you're using like these, or if you have improvement suggestions.
	* github@andrewconner.org / @connerdelights
	*/

	import scala.concurrent.{ExecutionContext, CanAwait, Awaitable, Future, Promise}
	import scala.concurrent.duration.Duration
	import scala.util.Try
	import com.twitter.algebird.{Aggregator, Semigroup}
	import com.twitter.scalding._

	import scala.util.Random

	/**
	* This job is a tutorial of sorts for scalding's Execution[T] abstraction.
	* It is a simple implementation of Lloyd's algorithm for k-means on 2D data.
	*
	* http://en.wikipedia.org/wiki/K-means_clustering
	#
	# Build configuration for Circle CI
	#

	general:
	artifacts:
	- /home/ubuntu/your-app-name/app/build/outputs/apk/

	machine:
	environment:
	#!/bin/bash

	# Check out the blog post at:
	#
	# http://www.philipotoole.com/influxdb-and-grafana-howto
	#
	# for full details on how to use this script.

	AWS_EC2_HOSTNAME_URL=http://169.254.169.254/latest/meta-data/public-hostname
	INFLUXDB_DATABASE=test1
	import scala.concurrent.Await
	import scala.concurrent.ExecutionContext
	import scala.concurrent.Future
	import scala.concurrent.blocking
	import scala.concurrent.duration.Deadline
	import scala.concurrent.duration.Duration
	import scala.concurrent.duration.DurationInt
	import scala.concurrent.duration.DurationLong
	import scala.concurrent.future
	import scala.concurrent.promise