Skip to content

Instantly share code, notes, and snippets.

Keybase proof

I hereby claim:

  • I am shagunsodhani on github.
  • I am shagun (https://keybase.io/shagun) on keybase.
  • I have a public key whose fingerprint is 052A 444C 26A2 DFAA 1F4E 9A79 48D9 E5AB AD77 C174

To claim this, I am signing this object:

@shagunsodhani
shagunsodhani / cogroup.md
Created November 28, 2015 11:34
Example of cogroup operation

result dataset

query url
query1 url1
query2 url2
query1 url3
query2 url4

revenue dataset

@shagunsodhani
shagunsodhani / GraphX.md
Created December 10, 2015 15:00
Notes about GraphX Paper

This week I read upon GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. Graph-parallel systems efficiently express iterative algorithms (by exploiting the static graph structure) but do not perform well on operations that require a more general view of the graph like operations that move data out of the graph. Data-parallel systems perform well on such tasks but directly implementing graph algorithms on data-parallel systems is inefficient due to complex joins and excessive data movement. This is the gap that GraphX fills in by allowing the same data to be viewed and operated upon both as a graph and as a table.

Preliminaries

Let G = (V, E) be a graph where V = {1, ..., n} is the set of vertices and E is the set of m directed edges. Each directed edge is a tuple of the form (i, j) ∈ E where i ∈ V is the source vertex and j ∈ V is the target vertex. The vertex p

@shagunsodhani
shagunsodhani / Pregel.md
Created December 20, 2015 11:55
Notes on Pregel Paper

The Pregel paper introduces a vertex-centric, large-scale graph computational model. Interestingly, the name Pregel comes from the name of the river which the Seven Bridges of Königsberg spanned.

Computational Model

The system takes as input a directed graph with properties assigned to both vertices and edges. The computation consists of a sequence of iterations, called supersteps. In each superstep, a user-defined function is invoked on each vertex in parallel. This function essentially implements the algorithm by specifying the behaviour of a single vertex V during a single superstep S. The function can read messages sent to the vertex V during the previous superstep (S-1), change the state of the vertex or its out-going edges', mutate the graph topology by adding/removing vertices or edges and by sending messages to other vertices that would be received in the next superstep (S+1). Since all computation during a superstep is performed locally, th

@shagunsodhani
shagunsodhani / Hive.md
Created January 18, 2016 06:42
Notes on Hive Paper

Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-like query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs.

Data Model

Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface.

Query Language

Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, car

A Few Useful Things to Know about Machine Learning

The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.

1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:

  • Representation for a learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt.
  • Evaluation function tells how good the machine learning model is.
  • Optimisation is the method to search for the most optimal learning model.
@shagunsodhani
shagunsodhani / The Unified Logging Infrastructure for Data Analytics at Twitter.md
Created February 1, 2016 12:41
Summary of "The Unified Logging Infrastructure for Data Analytics at Twitter" paper

The paper presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.

Messaging Infrastructure

Twitter uses Scribe as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses Zookeeper to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/cate

@shagunsodhani
shagunsodhani / Batch Normalization.md
Last active July 25, 2023 18:07
Notes for "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" paper

The Batch Normalization paper describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network.

Issues With Training Deep Neural Networks

Internal Covariate shift

Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.

It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.

@shagunsodhani
shagunsodhani / TAO.md
Created February 28, 2016 19:33
Notes on TAO: Facebook’s Distributed Data Store for the Social Graph

TAO

  • Geographically distributed, read-optimized, graph data store.
  • Favors availability and efficiency over consistency.
  • Developed by and used within Facebook (social graph).
  • Link to paper.

Before TAO

  • Facebook's servers directly accessed MySQL to read/write the social graph.
@shagunsodhani
shagunsodhani / FMP.md
Created March 6, 2016 17:58
Notes on Fractional Max-Pooling

Fractional Max-Pooling (FMP)

Introduction

  • Link to Paper
  • Spatial pooling layers are building blocks for Convolutional Neural Networks (CNNs).
  • Input to pooling operation is a Nin x Nin matrix and output is a smaller matrix Nout x Nout.
  • Pooling operation divides Nin x Nin square into N2out pooling regions Pi, j.
  • Pi, j ⊂ {1, 2, . . . , Nin} ∀ (i, j) ∈ {1, . . . , Nout}2