Skip to content

Instantly share code, notes, and snippets.

View v5tech's full-sized avatar


  • Xi'an China
  • 09:08 (UTC +08:00)
View GitHub Profile

Install pandoc on Mac OS X 10.8


Install haskell-platform

$ brew install haskell-platform
server {
listen 80 default; ## listen for ipv4; this line is default and implied
listen [::]:80 default ipv6only=on; ## listen for ipv6
# Make site accessible from http://localhost/
server_name localhost;
server_name_in_redirect off;
charset utf-8;
user www-data;
# As a thumb rule: One per CPU. If you are serving a large amount
# of static files, which requires blocking disk reads, you may want
# to increase this from the number of cpu_cores available on your
# system.
# The maximum number of connections for Nginx is calculated by:
# max_clients = worker_processes * worker_connections
worker_processes 1;
This gist covers a simple Hive eval UDF in Java, that mimics NVL2 functionality in Oracle.
NVL2 is used to handle nulls and conditionally substitute values.
1. Input data
2. Expected results
3. UDF code in java
4. Hive query to demo the UDF
5. Output
This gist covers a simple Pig eval UDF in Java, that mimics NVL2 functionality in Oracle.
1. Input data
2. UDF code in java
3. Pig script to demo the UDF
4. Expected result
5. Command to execute script
6. Output
This gist covers the Oozie SSH action.
It includes components of a sample Oozie workflow application- scripts/code,
sample data and commands; Oozie actions covered: secure shell action, email
My blog has documentation, and highlights of a very basic sample program.
This gist includes:
My blog has an introduction to reduce side join in Java map reduce-
This gist details how to inner join two large datasets on the map-side, leveraging the join capability
in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
through distributedcache, and can be implemented if both input datasets can be joined by the join key
and both input datasets are sorted in the same order, by the join key.
There are two critical pieces to engaging the join behavior:
One more gist related to controlling the number of mappers in a mapreduce task.
Background on Inputsplits
An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the
A common interview question for a Hadoop developer position is whether we can control the number of
mappers for a job. We can - there are a few ways of controlling the number of mappers, as needed.
Using NLineInputFormat is one way.
About NLineInputFormat