Skip to content

Instantly share code, notes, and snippets.

@ceteri
Created June 11, 2012 18:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ceteri/2911686 to your computer and use it in GitHub Desktop.
Save ceteri/2911686 to your computer and use it in GitHub Desktop.
Cascading for the Impatient, Part 1
public class
Main
{
public static void
main( String[] args )
{
String inPath = args[ 0 ];
String outPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
// create the sink tap
Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
flowConnector.connect( flowDef ).complete();
}
}
copyPipe = LOAD '$inPath' USING PigStorage('\t', 'tagsource');
STORE copyPipe INTO '$outPath' using PigStorage('\t', 'tagsource');
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
bash-3.2$ ls -lth
total 32
-rw-r--r-- 1 paco staff 1.7K Jun 28 15:14 build.gradle
-rw-r--r-- 1 paco staff 819B Jun 28 15:14 LICENSE.txt
-rw-r--r-- 1 paco staff 5.2K Jun 27 15:54 README.md
drwxr-xr-x 3 paco staff 102B Jun 26 14:46 src
drwxr-xr-x 3 paco staff 102B Jun 11 10:18 data
bash-3.2$ gradle -version
------------------------------------------------------------
Gradle 1.0
------------------------------------------------------------
Gradle build time: Tuesday, June 12, 2012 12:56:21 AM UTC
Groovy: 1.8.6
Ant: Apache Ant(TM) version 1.8.2 compiled on December 20 2010
Ivy: 2.2.0
JVM: 1.6.0_33 (Apple Inc. 20.8-b03-424)
OS: Mac OS X 10.6.8 x86_64
bash-3.2$ hadoop version
Warning: $HADOOP_HOME is deprecated.
Hadoop 1.0.3
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192
Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012
From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be
bash-3.2$ gradle clean jar
:clean UP-TO-DATE
:compileJava
:processResources UP-TO-DATE
:classes
:jar
BUILD SUCCESSFUL
Total time: 16.061 secs
bash-3.2$ hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain
Warning: $HADOOP_HOME is deprecated.
12/06/29 09:01:55 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.Main
12/06/29 09:01:55 INFO planner.HadoopPlanner: using application jar: /Users/paco/src/concur/impatient/part1/./build/libs/impatient.jar
12/06/29 09:01:55 INFO property.AppProps: using app.id: FEE428FA32D899D051AA404BA448DE3A
12/06/29 09:01:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/06/29 09:01:55 WARN snappy.LoadSnappy: Snappy native library not loaded
12/06/29 09:01:55 INFO mapred.FileInputFormat: Total input paths to process : 1
12/06/29 09:01:56 INFO util.Version: Concurrent, Inc - Cascading 2.0.1
12/06/29 09:01:56 INFO flow.Flow: [] starting
12/06/29 09:01:56 INFO flow.Flow: [] source: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"]
12/06/29 09:01:56 INFO flow.Flow: [] sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["output/rain"]"]
12/06/29 09:01:56 INFO flow.Flow: [] parallel execution is enabled: false
12/06/29 09:01:56 INFO flow.Flow: [] starting jobs: 1
12/06/29 09:01:56 INFO flow.Flow: [] allocating threads: 1
12/06/29 09:01:56 INFO flow.FlowStep: [] starting step: (1/1) output/rain
12/06/29 09:01:56 INFO mapred.FileInputFormat: Total input paths to process : 1
12/06/29 09:01:56 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001
12/06/29 09:01:56 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/06/29 09:01:56 INFO io.MultiInputSplit: current split input path: file:/Users/paco/src/concur/impatient/part1/data/rain.txt
12/06/29 09:01:56 INFO mapred.MapTask: numReduceTasks: 0
12/06/29 09:01:56 INFO hadoop.FlowMapper: sourcing from: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"]
12/06/29 09:01:56 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["output/rain"]"]
12/06/29 09:01:56 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/06/29 09:01:56 INFO mapred.LocalJobRunner:
12/06/29 09:01:56 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
12/06/29 09:01:56 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/paco/src/concur/impatient/part1/output/rain
12/06/29 09:01:59 INFO mapred.LocalJobRunner: file:/Users/paco/src/concur/impatient/part1/data/rain.txt:0+510
12/06/29 09:01:59 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/06/29 09:02:01 INFO util.Hadoop18TapUtil: deleting temp path output/rain/_temporary
bash-3.2$
bash-3.2$ head output/rain/part-00000
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
bash-3.2$
bash-3.2$ pig -version
Warning: $HADOOP_HOME is deprecated.
Apache Pig version 0.10.0 (r1328203)
compiled Apr 19 2012, 22:54:12
bash-3.2$ pig -p inPath=./data/rain.txt -p outPath=./output/rain ./src/scripts/copy.pig
Warning: $HADOOP_HOME is deprecated.
2012-08-27 13:24:21,632 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12
2012-08-27 13:24:21,633 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/ceteri/src/concur/Impatient/part1/pig_1346099061629.log
2012-08-27 13:24:21.724 java[69946:1903] Unable to load realm info from SCDynamicStore
2012-08-27 13:24:21,931 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2012-08-27 13:24:22,261 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2012-08-27 13:24:22,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2012-08-27 13:24:22,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2012-08-27 13:24:22,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2012-08-27 13:24:22,373 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2012-08-27 13:24:22,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2012-08-27 13:24:22,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job8693841438339640396.jar
2012-08-27 13:24:26,339 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job8693841438339640396.jar created
2012-08-27 13:24:26,350 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2012-08-27 13:24:26,369 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2012-08-27 13:24:26,377 [Thread-5] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2012-08-27 13:24:26,481 [Thread-5] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-08-27 13:24:26,481 [Thread-5] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2012-08-27 13:24:26,489 [Thread-5] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library not loaded
2012-08-27 13:24:26,492 [Thread-5] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2012-08-27 13:24:26,674 [Thread-6] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-08-27 13:24:26,686 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/Users/ceteri/src/concur/Impatient/part1/data/rain.txt:0+510
2012-08-27 13:24:26,717 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2012-08-27 13:24:26,720 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-08-27 13:24:26,720 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task attempt_local_0001_m_000000_0 is allowed to commit now
2012-08-27 13:24:26,722 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/ceteri/src/concur/Impatient/part1/output/rain
2012-08-27 13:24:26,871 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001
2012-08-27 13:24:26,871 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2012-08-27 13:24:29,657 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner -
2012-08-27 13:24:29,657 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_m_000000_0' done.
2012-08-27 13:24:29,658 [Thread-6] WARN org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in cleanup
2012-08-27 13:24:31,882 [main] WARN org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for job job_local_0001
2012-08-27 13:24:31,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-08-27 13:24:31,886 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.10.0 ceteri 2012-08-27 13:24:22 2012-08-27 13:24:31 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_local_0001 1 0 n/a n/a n/a 0 0 0 copyPipe MAP_ONLY file:///Users/ceteri/src/concur/Impatient/part1/output/rain,
Input(s):
Successfully read 0 records from: "file:///Users/ceteri/src/concur/Impatient/part1/data/rain.txt"
Output(s):
Successfully stored 0 records in: "file:///Users/ceteri/src/concur/Impatient/part1/output/rain"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local_0001
2012-08-27 13:24:31,887 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
bash-3.2$ cat output/rain/part-m-00000
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
bash-3.2$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment