Igor Berman IgorBerman

## gist:8172796

      
              1 file
            
          
              403 forks
            
          
              23 comments
            
          
              1643 stars
            
          
                debasishg
                / gist:8172796
            
            
              Last active
              May 7, 2024 22:18
            
              
                A collection of links for streaming algorithms and data structures
              
          
    General Background and Overview


Probabilistic Data Structures for Web Analytics and Data Mining : A great overview of the space of probabilistic data structures and how they are used in approximation algorithm implementation.
Models and Issues in Data Stream Systems
Philippe Flajolet’s contribution to streaming algorithms : A presentation by Jérémie Lumbroso that visits some of the hostorical perspectives and how it all began with Flajolet
Approximate Frequency Counts over Data Streams by Gurmeet Singh Manku & Rajeev Motwani : One of the early papers on the subject.
[Methods for Finding Frequent Items in Data Streams](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.9800&amp;rep=rep1&amp;t


## s3.sh
# You don't need Fog in Ruby or some other library to upload to S3 -- shell works perfectly fine
# This is how I upload my new Sol Trader builds (http://soltrader.net)
# Based on a modified script from here: http://tmont.com/blargh/2014/1/uploading-to-s3-in-bash

S3KEY="my aws key"
S3SECRET="my aws secret" # pass these in

function putS3
{
  path=$1

## service-checklist.md

      
              1 file
            
          
              185 forks
            
          
              25 comments
            
          
              715 stars
            
          
                acolyer
                / service-checklist.md
            
            
              Last active
              January 30, 2024 17:39
            
              
                Internet Scale Services Checklist
              
          
    Internet Scale Services Checklist

A checklist for designing and developing internet scale services, inspired by James Hamilton's 2007 paper "On Desgining and Deploying Internet-Scale Services."

http://mvdirona.com/jrh/talksandpapers/jamesrh_lisa.pdf

Basic tenets


 Does the design expect failures to happen regularly and handle them gracefully?
 Have we kept things as simple as possible?


## StreamingHLL.scala
import spark.streaming.StreamingContext._
import spark.streaming.{Seconds, StreamingContext}
import spark.SparkContext._
import spark.storage.StorageLevel
import spark.streaming.examples.twitter.TwitterInputDStream
import com.twitter.algebird.HyperLogLog._
import com.twitter.algebird._

/**
 * Example of using HyperLogLog monoid from Twitter's Algebird together with Spark Streaming's

## StreamingCMS.scala
import spark.streaming.{Seconds, StreamingContext}
import spark.storage.StorageLevel
import spark.streaming.examples.twitter.TwitterInputDStream
import com.twitter.algebird._
import spark.streaming.StreamingContext._
import spark.SparkContext._

/**
 * Example of using CountMinSketch monoid from Twitter's Algebird together with Spark Streaming's
 * TwitterInputDStream

## benchmark-commands.txt
Producer

Setup
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test-rep-one --partitions 6 --replication-factor 1
bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3

Single thread, no replication

bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196

## gist:739a864b3275e901d317

      
              1 file
            
          
              14 forks
            
          
              40 comments
            
          
              203 stars
            
          
                drkarl
                / gist:739a864b3275e901d317
            
            
              Last active
              October 17, 2023 10:43
            
              
                Ask HN: Best Linux server backup system?
              
          
    Linux Backup Solutions

I've been looking for the best Linux backup system, and also reading lots of HN comments.
Instead of putting pros and cons of every backup system I'll just list some deal-breakers which would disqualify them.
Also I would like that you, the HN community, would add more deal breakers for these or other backup systems if you know some more and at the same time, if you have data to disprove some of the deal-breakers listed here (benchmarks, info about something being true for older releases but is fixed on newer releases), please share it so that I can edit this list accordingly.
Amanda (comments by sammcj)


It has a lot of management overhead and that's a problem if you don't have time for a full time backup administrator.


## Maven multi-module build options
# Inspired from http://blog.akquinet.de/2010/05/26/mastering-the-maven-command-line-%E2%80%93-reactor-options/

# Build only specific modules:
mvn clean install -pl sub-module-name2
mvn clean install -pl sub-module-name2,sub-module-name3

# Build only starting from specific sub-module (resume from)
mvn clean install -rf sub-module-name2

# Build dependencies (also make)

## jndi-response.md

      
              1 file
            
          
              4 forks
            
          
              5 comments
            
          
              42 stars
            
          
                shipilev
                / jndi-response.md
            
            
              Last active
              February 6, 2022 17:12
            
          
Generate the file:

$ awk 'BEGIN { for(c=0;c<10000000;c++) printf "<p>LOL</p>" }' > 100M.html
$ (for I in `seq 1 100`; do cat 100M.html; done) | pv | gzip -9 > 10G.boomgz


Check it is indeed good:


## DirectOutputCommitter.scala
/*
 * Copyright 2015 Databricks, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.  You may obtain
 * a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
	# You don't need Fog in Ruby or some other library to upload to S3 -- shell works perfectly fine
	# This is how I upload my new Sol Trader builds (http://soltrader.net)
	# Based on a modified script from here: http://tmont.com/blargh/2014/1/uploading-to-s3-in-bash

	S3KEY="my aws key"
	S3SECRET="my aws secret" # pass these in

	function putS3
	{
	path=$1
	import spark.streaming.StreamingContext._
	import spark.streaming.{Seconds, StreamingContext}
	import spark.SparkContext._
	import spark.storage.StorageLevel
	import spark.streaming.examples.twitter.TwitterInputDStream
	import com.twitter.algebird.HyperLogLog._
	import com.twitter.algebird._

	/**
	* Example of using HyperLogLog monoid from Twitter's Algebird together with Spark Streaming's
	Producer

	Setup
	bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test-rep-one --partitions 6 --replication-factor 1
	bin/kafka-topics.sh --zookeeper esv4-hcl197.grid.linkedin.com:2181 --create --topic test --partitions 6 --replication-factor 3

	Single thread, no replication

	bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test7 50000000 100 -1 acks=1 bootstrap.servers=esv4-hcl198.grid.linkedin.com:9092 buffer.memory=67108864 batch.size=8196
	# Inspired from http://blog.akquinet.de/2010/05/26/mastering-the-maven-command-line-%E2%80%93-reactor-options/

	# Build only specific modules:
	mvn clean install -pl sub-module-name2
	mvn clean install -pl sub-module-name2,sub-module-name3

	# Build only starting from specific sub-module (resume from)
	mvn clean install -rf sub-module-name2

	# Build dependencies (also make)
	/*
	* Copyright 2015 Databricks, Inc.
	*
	* Licensed under the Apache License, Version 2.0 (the "License"); you may
	* not use this file except in compliance with the License. You may obtain
	* a copy of the License at
	*
	* http://www.apache.org/licenses/LICENSE-2.0
	*
	* Unless required by applicable law or agreed to in writing, software