Shagun Sodhani shagunsodhani

## keybase.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shagunsodhani
                / keybase.md
            
            
              Last active
              January 16, 2016 03:56
            
          
    Keybase proof

I hereby claim:

I am shagunsodhani on github.
I am shagun (https://keybase.io/shagun) on keybase.
I have a public key whose fingerprint is 052A 444C 26A2 DFAA 1F4E  9A79 48D9 E5AB AD77 C174

To claim this, I am signing this object:

  
## cogroup.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                shagunsodhani
                / cogroup.md
            
            
              Created
              November 28, 2015 11:34
            
              
                Example of cogroup operation
              
          
    result dataset


query
url


query1
url1


query2
url2


query1
url3


query2
url4


revenue dataset

  
## GraphX.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                shagunsodhani
                / GraphX.md
            
            
              Created
              December 10, 2015 15:00
            
              
                Notes about GraphX Paper
              
          
    This week I read upon GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. Graph-parallel systems efficiently express iterative algorithms (by exploiting the static graph structure) but do not perform well on operations that require a more general view of the graph like operations that move data out of the graph. Data-parallel systems perform well on such tasks but directly implementing graph algorithms on data-parallel systems is inefficient due to complex joins and excessive data movement. This is the gap that GraphX fills in by allowing the same data to be viewed and operated upon both as a graph and as a table.
Preliminaries

Let G = (V, E) be a graph where V = {1, ..., n} is the set of vertices and E is the set of m directed edges. Each directed edge is a tuple of the form (i, j) ∈ E where i ∈ V is the source vertex and j ∈ V is the target vertex. The vertex p

  
## Pregel.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              6 stars
            
          
                shagunsodhani
                / Pregel.md
            
            
              Created
              December 20, 2015 11:55
            
              
                Notes on Pregel Paper
              
          
    The Pregel paper introduces a vertex-centric, large-scale graph computational model. Interestingly, the name Pregel comes from the name of the river which the Seven Bridges of Königsberg spanned.
Computational Model

The system takes as input a directed graph with properties assigned to both vertices and edges. The computation consists of a sequence of iterations, called supersteps. In each superstep, a user-defined function is invoked on each vertex in parallel. This function essentially implements the algorithm by specifying the behaviour of a single vertex V during a single superstep S. The function can read messages sent to the vertex V during the previous superstep (S-1), change the state of the vertex or its out-going edges', mutate the graph topology by adding/removing vertices or edges and by sending messages to other vertices that would be received in the next superstep (S+1). Since all computation during a superstep is performed locally, th

  
## Hive.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              6 stars
            
          
                shagunsodhani
                / Hive.md
            
            
              Created
              January 18, 2016 06:42
            
              
                Notes on Hive Paper
              
          
    Hive is an open-source data warehousing solution built on top of Hadoop.  It supports an SQL-like query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs.
Data Model

Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface.
Query Language

Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, car

  
## A Few Useful Things to Know about Machine Learning.md

      
              1 file
            
          
              16 forks
            
          
              3 comments
            
          
              65 stars
            
          
                shagunsodhani
                / A Few Useful Things to Know about Machine Learning.md
            
            
              Last active
              July 29, 2023 21:26
            
          
    A Few Useful Things to Know about Machine Learning

The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.
1. Learning = Representation + Evaluation + Optimization

All machine learning algorithms have three components:

Representation for a  learner is the set if classifiers/functions that can be possibly learnt. This set is called hypothesis space. If a function is not in hypothesis space, it can not be learnt.
Evaluation function tells how good the machine learning model is.
Optimisation is the method to search for the most optimal learning model.


## The Unified Logging Infrastructure for Data Analytics at Twitter.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              8 stars
            
          
                shagunsodhani
                / The Unified Logging Infrastructure for Data Analytics at Twitter.md
            
            
              Created
              February 1, 2016 12:41
            
              
                Summary of "The Unified Logging Infrastructure for Data Analytics at Twitter" paper
              
          
    The paper presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.
Messaging Infrastructure

Twitter uses Scribe as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses Zookeeper to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/cate

  
## Batch Normalization.md

      
              1 file
            
          
              11 forks
            
          
              10 comments
            
          
              84 stars
            
          
                shagunsodhani
                / Batch Normalization.md
            
            
              Last active
              July 25, 2023 18:07
            
              
                Notes for "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" paper
              
          
    The Batch Normalization paper describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network.
Issues With Training Deep Neural Networks

Internal Covariate shift

Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.
It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.

  
## TAO.md

      
              1 file
            
          
              6 forks
            
          
              0 comments
            
          
              24 stars
            
          
                shagunsodhani
                / TAO.md
            
            
              Created
              February 28, 2016 19:33
            
              
                Notes on TAO: Facebook’s Distributed Data Store for the Social Graph
              
          
    TAO


Geographically distributed, read-optimized, graph data store.
Favors availability and efficiency over consistency.
Developed by and used within Facebook (social graph).
Link to paper.

Before TAO


Facebook's servers directly accessed MySQL to read/write the social graph.


## FMP.md

      
              1 file
            
          
              2 forks
            
          
              2 comments
            
          
              11 stars
            
          
                shagunsodhani
                / FMP.md
            
            
              Created
              March 6, 2016 17:58
            
              
                Notes on Fractional Max-Pooling
              
          
    Fractional Max-Pooling (FMP)

Introduction


Link to Paper
Spatial pooling layers are building blocks for Convolutional Neural Networks (CNNs).
Input to pooling operation is a N_in x N_in matrix and output is a smaller matrix N_out x N_out.
Pooling operation divides N_in x N_in square into N²_out pooling regions P_{i, j}.
P_{i, j} ⊂ {1, 2, . . . , N_in} ∀ (i, j) ∈ {1, . . . , N_out}²