I hereby claim:
- I am shagunsodhani on github.
- I am shagun (https://keybase.io/shagun) on keybase.
- I have a public key whose fingerprint is 052A 444C 26A2 DFAA 1F4E 9A79 48D9 E5AB AD77 C174
To claim this, I am signing this object:
I hereby claim:
To claim this, I am signing this object:
result dataset
query | url |
---|---|
query1 | url1 |
query2 | url2 |
query1 | url3 |
query2 | url4 |
revenue dataset
This week I read upon GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. Graph-parallel systems efficiently express iterative algorithms (by exploiting the static graph structure) but do not perform well on operations that require a more general view of the graph like operations that move data out of the graph. Data-parallel systems perform well on such tasks but directly implementing graph algorithms on data-parallel systems is inefficient due to complex joins and excessive data movement. This is the gap that GraphX fills in by allowing the same data to be viewed and operated upon both as a graph and as a table.
Let G = (V, E) be a graph where V = {1, ..., n} is the set of vertices and E is the set of m directed edges. Each directed edge is a tuple of the form (i, j) ∈ E where i ∈ V is the source vertex and j ∈ V is the target vertex. The vertex p
The Pregel paper introduces a vertex-centric, large-scale graph computational model. Interestingly, the name Pregel comes from the name of the river which the Seven Bridges of Königsberg spanned.
The system takes as input a directed graph with properties assigned to both vertices and edges. The computation consists of a sequence of iterations, called supersteps. In each superstep, a user-defined function is invoked on each vertex in parallel. This function essentially implements the algorithm by specifying the behaviour of a single vertex V during a single superstep S. The function can read messages sent to the vertex V during the previous superstep (S-1), change the state of the vertex or its out-going edges', mutate the graph topology by adding/removing vertices or edges and by sending messages to other vertices that would be received in the next superstep (S+1). Since all computation during a superstep is performed locally, th
Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-like query language called HiveQL. These queries are compiled into MapReduce jobs that are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the effort that goes into writing and maintaining MapReduce jobs.
Hive supports database concepts like tables, columns, rows and partitions. Both primitive (integer, float, string) and complex data-types(map, list, struct) are supported. Moreover, these types can be composed to support structures of arbitrary complexity. The tables are serialized/deserialized using default serializers/deserializer. Any new data format and type can be supported by implementing SerDe and ObjectInspector java interface.
Hive query language (HiveQL) consists of a subset of SQL along with some extensions. The language is very SQL-like and supports features like subqueries, joins, car
The paper presents some key lessons and "folk wisdom" that machine learning researchers and practitioners have learnt from experience and which are hard to find in textbooks.
All machine learning algorithms have three components:
The paper presents Twitter's logging infrastructure, how it evolved from application specific logging to a unified logging infrastructure and how session-sequences are used as a common case optimization for a large class of queries.
Twitter uses Scribe as its messaging infrastructure. A Scribe daemon runs on every production server and sends log data to a cluster of dedicated aggregators in the same data center. Scribe itself uses Zookeeper to discover the hostname of the aggregator. Each aggregator registers itself with Zookeeper. The Scribe daemon consults Zookeeper to find a live aggregator to which it can send the data. Colocated with the aggregators is the staging Hadoop cluster which merges the per-category stream from all the server daemons and writes the compressed results to HDFS. These logs are then moved into main Hadoop data warehouse and are deposited in per-category, per-hour directory (eg /logs/cate
The Batch Normalization paper describes a method to address the various issues related to training of Deep Neural Networks. It makes normalization a part of the architecture itself and reports significant improvements in terms of the number of iterations required to train the network.
Covariate shift refers to the change in the input distribution to a learning system. In the case of deep networks, the input to each layer is affected by parameters in all the input layers. So even small changes to the network get amplified down the network. This leads to change in the input distribution to internal layers of the deep network and is known as internal covariate shift.
It is well established that networks converge faster if the inputs have been whitened (ie zero mean, unit variances) and are uncorrelated and internal covariate shift leads to just the opposite.