Created
September 25, 2012 20:20
-
-
Save ceteri/3784194 to your computer and use it in GitHub Desktop.
Cascading user list questions
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bash-3.2$ gradle clean jar | |
:clean | |
:compileJava | |
:processResources UP-TO-DATE | |
:classes | |
:jar | |
BUILD SUCCESSFUL | |
Total time: 4.316 secs | |
bash-3.2$ more README.md | |
Cascading for the Impatient, Part 1 | |
=================================== | |
The goal is to create the simplest [Cascading 2.0](http://www.cascading.org/) app possible, while following best practices. | |
Here's a brief Java program which copies lines of text from file "A" to file "B". We'll keep building on this example until we have a MapReduce implementation of [TF-IDF](http://en.wikipedia.org/wiki/Tf*idf). | |
More detailed background information and step-by-step documentation is provided at https://github.com/ConcurrentCore/impatient/wiki | |
Build Instructions | |
================== | |
To generate an IntelliJ project use: | |
gradle ideaModule | |
To build the sample app from the command line use: | |
gradle clean jar | |
Before running this sample app, be sure to set your `HADOOP_HOME` environment variable. Then clear the `output` directory, then to run on a desktop/laptop with Apache Hadoop in standalone mode: | |
rm -rf output | |
hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain | |
To view the results: | |
cat output/rain/* | |
To run the Pig version of the script, make sure `PIG_HOME` is set and run : | |
rm -rf output | |
pig -p inPath=data/rain.txt -p outPath=output/rain ./src/scripts/copy.pig | |
An example of log captured from a successful build+run is at https://gist.github.com/2911686 | |
For more discussion, see the [cascading-user](https://groups.google.com/forum/?fromgroups#!forum/cascading-user) email forum. | |
Stay tuned for the next installments of our [Cascading for the Impatient](http://www.cascading.org/category/impatient/) series. | |
bash-3.2$ hadoop jar ./build/libs/impatient.jar | |
Warning: $HADOOP_HOME is deprecated. | |
12/09/25 13:15:59 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.Main | |
12/09/25 13:15:59 INFO planner.HadoopPlanner: using application jar: /Users/ceteri/src/concur/users/./build/libs/impatient.jar | |
12/09/25 13:15:59 INFO property.AppProps: using app.id: 309C6CC74EEA75F3A042DB2EA4835D06 | |
2012-09-25 13:15:59.361 java[30572:1903] Unable to load realm info from SCDynamicStore | |
12/09/25 13:15:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
12/09/25 13:15:59 WARN snappy.LoadSnappy: Snappy native library not loaded | |
12/09/25 13:15:59 INFO mapred.FileInputFormat: Total input paths to process : 1 | |
12/09/25 13:15:59 INFO util.Version: Concurrent, Inc - Cascading 2.0.1 | |
12/09/25 13:15:59 INFO flow.Flow: [] starting | |
12/09/25 13:15:59 INFO flow.Flow: [] source: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"] | |
12/09/25 13:15:59 INFO flow.Flow: [] sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["data/out.txt"]"] | |
12/09/25 13:15:59 INFO flow.Flow: [] parallel execution is enabled: false | |
12/09/25 13:15:59 INFO flow.Flow: [] starting jobs: 1 | |
12/09/25 13:15:59 INFO flow.Flow: [] allocating threads: 1 | |
12/09/25 13:15:59 INFO flow.FlowStep: [] starting step: (1/1) data/out.txt | |
12/09/25 13:15:59 INFO mapred.FileInputFormat: Total input paths to process : 1 | |
12/09/25 13:15:59 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001 | |
12/09/25 13:15:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null | |
12/09/25 13:15:59 INFO io.MultiInputSplit: current split input path: file:/Users/ceteri/src/concur/users/data/rain.txt | |
12/09/25 13:15:59 INFO mapred.MapTask: numReduceTasks: 0 | |
12/09/25 13:15:59 INFO hadoop.FlowMapper: sourcing from: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"] | |
12/09/25 13:15:59 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["data/out.txt"]"] | |
12/09/25 13:15:59 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting | |
12/09/25 13:15:59 INFO mapred.LocalJobRunner: | |
12/09/25 13:15:59 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now | |
12/09/25 13:15:59 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/ceteri/src/concur/users/data/out.txt | |
12/09/25 13:16:02 INFO mapred.LocalJobRunner: file:/Users/ceteri/src/concur/users/data/rain.txt:0+510 | |
12/09/25 13:16:02 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. | |
12/09/25 13:16:04 INFO util.Hadoop18TapUtil: deleting temp path data/out.txt/_temporary | |
bash-3.2$ ls | |
LICENSE.txt README.md build build.gradle data docs dot src | |
bash-3.2$ more data/ | |
out.txt/ rain.txt | |
bash-3.2$ more data/ | |
out.txt/ rain.txt | |
bash-3.2$ more data/out.txt/ | |
._SUCCESS.crc .part-00000.crc _SUCCESS part-00000 | |
bash-3.2$ more data/out.txt/part-00000 | |
doc_id text | |
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. | |
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. | |
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. | |
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. | |
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] | |
bash-3.2$ fg | |
emacs src/main/java/impatient/Main.java | |
[1]+ Stopped emacs src/main/java/impatient/Main.java | |
bash-3.2$ cat src/main/java/impatient/Main.java | |
/* | |
* Copyright (c) 2007-2012 Concurrent, Inc. All Rights Reserved. | |
* | |
* Project and contact information: http://www.cascading.org/ | |
* | |
* This file is part of the Cascading project. | |
* | |
* Licensed under the Apache License, Version 2.0 (the "License"); | |
* you may not use this file except in compliance with the License. | |
* You may obtain a copy of the License at | |
* | |
* http://www.apache.org/licenses/LICENSE-2.0 | |
* | |
* Unless required by applicable law or agreed to in writing, software | |
* distributed under the License is distributed on an "AS IS" BASIS, | |
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
* See the License for the specific language governing permissions and | |
* limitations under the License. | |
*/ | |
package impatient; | |
import java.util.Properties; | |
import cascading.flow.Flow; | |
import cascading.flow.FlowDef; | |
import cascading.flow.hadoop.HadoopFlowConnector; | |
import cascading.pipe.Pipe; | |
import cascading.property.AppProps; | |
import cascading.scheme.hadoop.TextDelimited; | |
import cascading.tap.Tap; | |
import cascading.tap.hadoop.Hfs; | |
import cascading.tuple.Fields; | |
public class | |
Main | |
{ | |
public static void | |
main( String[] args ) | |
{ | |
Properties properties = new Properties(); | |
AppProps.setApplicationJarClass( properties, Main.class ); | |
HadoopFlowConnector flowConnector = new HadoopFlowConnector(); | |
// create the source tap | |
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), "./data/rain.txt" ); | |
// create the sink tap | |
Tap outTap = new Hfs( new TextDelimited( true, "\t" ), "./data/out.txt" ); | |
// specify a pipe to connect the taps | |
Pipe copyPipe = new Pipe( "copy" ); | |
// connect the taps, pipes, etc., into a flow | |
FlowDef flowDef = FlowDef.flowDef() | |
.addSource( copyPipe, inTap ) | |
.addTailSink( copyPipe, outTap ); | |
// run the flow | |
Flow flow = flowConnector.connect( flowDef ); | |
flow.complete(); | |
} | |
} | |
bash-3.2$ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment