Skip to content

Instantly share code, notes, and snippets.

@ceteri
Created September 25, 2012 20:20
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ceteri/3784194 to your computer and use it in GitHub Desktop.
Save ceteri/3784194 to your computer and use it in GitHub Desktop.
Cascading user list questions
bash-3.2$ gradle clean jar
:clean
:compileJava
:processResources UP-TO-DATE
:classes
:jar
BUILD SUCCESSFUL
Total time: 4.316 secs
bash-3.2$ more README.md
Cascading for the Impatient, Part 1
===================================
The goal is to create the simplest [Cascading 2.0](http://www.cascading.org/) app possible, while following best practices.
Here's a brief Java program which copies lines of text from file "A" to file "B". We'll keep building on this example until we have a MapReduce implementation of [TF-IDF](http://en.wikipedia.org/wiki/Tf*idf).
More detailed background information and step-by-step documentation is provided at https://github.com/ConcurrentCore/impatient/wiki
Build Instructions
==================
To generate an IntelliJ project use:
gradle ideaModule
To build the sample app from the command line use:
gradle clean jar
Before running this sample app, be sure to set your `HADOOP_HOME` environment variable. Then clear the `output` directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode:
rm -rf output
hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain
To view the results:
cat output/rain/*
To run the Pig version of the script, make sure `PIG_HOME` is set and run :
rm -rf output
pig -p inPath=data/rain.txt -p outPath=output/rain ./src/scripts/copy.pig
An example of log captured from a successful build+run is at https://gist.github.com/2911686
For more discussion, see the [cascading-user](https://groups.google.com/forum/?fromgroups#!forum/cascading-user) email forum.
Stay tuned for the next installments of our [Cascading for the Impatient](http://www.cascading.org/category/impatient/) series.
bash-3.2$ hadoop jar ./build/libs/impatient.jar
Warning: $HADOOP_HOME is deprecated.
12/09/25 13:15:59 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.Main
12/09/25 13:15:59 INFO planner.HadoopPlanner: using application jar: /Users/ceteri/src/concur/users/./build/libs/impatient.jar
12/09/25 13:15:59 INFO property.AppProps: using app.id: 309C6CC74EEA75F3A042DB2EA4835D06
2012-09-25 13:15:59.361 java[30572:1903] Unable to load realm info from SCDynamicStore
12/09/25 13:15:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/09/25 13:15:59 WARN snappy.LoadSnappy: Snappy native library not loaded
12/09/25 13:15:59 INFO mapred.FileInputFormat: Total input paths to process : 1
12/09/25 13:15:59 INFO util.Version: Concurrent, Inc - Cascading 2.0.1
12/09/25 13:15:59 INFO flow.Flow: [] starting
12/09/25 13:15:59 INFO flow.Flow: [] source: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"]
12/09/25 13:15:59 INFO flow.Flow: [] sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["data/out.txt"]"]
12/09/25 13:15:59 INFO flow.Flow: [] parallel execution is enabled: false
12/09/25 13:15:59 INFO flow.Flow: [] starting jobs: 1
12/09/25 13:15:59 INFO flow.Flow: [] allocating threads: 1
12/09/25 13:15:59 INFO flow.FlowStep: [] starting step: (1/1) data/out.txt
12/09/25 13:15:59 INFO mapred.FileInputFormat: Total input paths to process : 1
12/09/25 13:15:59 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001
12/09/25 13:15:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null
12/09/25 13:15:59 INFO io.MultiInputSplit: current split input path: file:/Users/ceteri/src/concur/users/data/rain.txt
12/09/25 13:15:59 INFO mapred.MapTask: numReduceTasks: 0
12/09/25 13:15:59 INFO hadoop.FlowMapper: sourcing from: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"]
12/09/25 13:15:59 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["data/out.txt"]"]
12/09/25 13:15:59 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/09/25 13:15:59 INFO mapred.LocalJobRunner:
12/09/25 13:15:59 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now
12/09/25 13:15:59 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/ceteri/src/concur/users/data/out.txt
12/09/25 13:16:02 INFO mapred.LocalJobRunner: file:/Users/ceteri/src/concur/users/data/rain.txt:0+510
12/09/25 13:16:02 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
12/09/25 13:16:04 INFO util.Hadoop18TapUtil: deleting temp path data/out.txt/_temporary
bash-3.2$ ls
LICENSE.txt README.md build build.gradle data docs dot src
bash-3.2$ more data/
out.txt/ rain.txt
bash-3.2$ more data/
out.txt/ rain.txt
bash-3.2$ more data/out.txt/
._SUCCESS.crc .part-00000.crc _SUCCESS part-00000
bash-3.2$ more data/out.txt/part-00000
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
bash-3.2$ fg
emacs src/main/java/impatient/Main.java
[1]+ Stopped emacs src/main/java/impatient/Main.java
bash-3.2$ cat src/main/java/impatient/Main.java
/*
* Copyright (c) 2007-2012 Concurrent, Inc. All Rights Reserved.
*
* Project and contact information: http://www.cascading.org/
*
* This file is part of the Cascading project.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package impatient;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextDelimited;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
import cascading.tuple.Fields;
public class
Main
{
public static void
main( String[] args )
{
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector();
// create the source tap
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), "./data/rain.txt" );
// create the sink tap
Tap outTap = new Hfs( new TextDelimited( true, "\t" ), "./data/out.txt" );
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
Flow flow = flowConnector.connect( flowDef );
flow.complete();
}
}
bash-3.2$
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment