oza/LT1.md

## LT1.md

      
    Raw
  

              LT1.md
            
          
    Kudu

What's Kudu?


From http://getkudu.io/

Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
Distributed Insertable/Updatable columnar store.
Schema on write.

Complementing Hadoop/HDFS and HBase.


Community


Some Hadoop developers join the development.
Weekly status: https://groups.google.com/forum/#!topic/kudu-user/c6Q8RyNwY8A
Follow @getkudu and @tlipcon to get the latest information!

Build from source


Documantation
It works!

$ sudo apt-get -y install git autoconf automake libboost-thread-dev curl gcc g++ \
  libssl-dev libsasl2-dev libtool ntp
$ sudo apt-get -y install asciidoctor xsltproc
$ git clone http://github.com/cloudera/kudu
$ cd kudu
$ thirdparty/build-if-necessary.sh
$ thirdparty/installed/bin/cmake . -DCMAKE_BUILD_TYPE=release -DCMAKE_INSTALL_PREFIX=/hadoop1/build/opt/kudu
$ make -j4
$ make DESTDIR=/hadoop1/build/opt/kudu install
$ make docs

Installing Kudo from deb package


http://getkudu.io/docs/installation.html#_build_from_source
RPM via yum is also available

$ sudo wget http://archive.cloudera.com/beta/kudu/ubuntu/trusty/amd64/kudu/cloudera.list -O /etc/apt/sources.list.d/cloudera.list
$ sudo apt-get update
$ sudo apt-get install kudu                     # Base Kudu files
$ sudo apt-get install kudu-master              # Service scripts for managing kudu-master
$ sudo apt-get install kudu-tserver             # Service scripts for managing kudu-tserver
$ sudo apt-get install libkuduclient0           # Kudu C++ client shared library
$ sudo apt-get install libkuduclient-dev       # Kudu C++ client SDK

Running Kudu daemons

$ sudo service kudu-master start
$ sudo service kudu-tserver start
$ sudo ps aux | grep kudu
kudu     11348  0.1  0.1 455092 15744 ?        Sl   Nov09   0:22 /usr/lib/kudu/sbin/kudu-master --flagfile=/etc/kudu/conf/master.gflagfile
kudu     11424  0.1  0.0 1016388 6828 ?        Sl   Nov09   0:22 /usr/lib/kudu/sbin/kudu-tserver --flagfile=/etc/kudu/conf/tserver.gflagfile


Result
Checking Web UI

Writing Client applications


C++ client


Java Client <- for Hadoop and Spark


Developing Kudu Application


ImportCSV.java


RowCounter.java


Directory structure of Java client


.
|-- kudu-client
|-- kudu-client-tools
|-- kudu-csd
|-- kudu-mapreduce


Loading data to HDFS and run MapReduce jobs http://getkudu.io/docs/quickstart.html
How to use MR job examples?

$ cd java
$ mvn package -DskipTests
$ cp kudu-client-tools/target/kudu-client-tools-0.6.0-SNAPSHOT-jar-with-dependencies.jar $UNDER_HADOOP_CLASSPATH

$ hadoop jar share/hadoop/mapreduce/kudu-client-tools-0.6.0-SNAPSHOT-jar-with-dependencies.jar org.kududb.mapreduce.tools.ImportCsv
ERROR: Wrong number of arguments: 0
Usage: importcsv <colAa,colB,colC> <table.name> <input.dir>

Imports the given input directory of CSV data into the specified table.

The column names of the CSV data must be specified in the form of comma-separated column names.
Other options that may be specified with -D include:
  -Dimportcsv.skip.bad.lines=false - fail if encountering an invalid line
  '-Dimportcsv.separator=|' - eg separate on pipes instead of tabs
  -Dimportcsv.job.name=jobName - use the specified mapreduce job name for the import.

Additionally, the following options are available:  -Dkudu.operation.timeout.ms=TIME - timeout for read and write operations, defaults to 10000 
  -Dkudu.admin.operation.timeout.ms=TIME - timeout for admin operations , defaults to 10000 
  -Dkudu.socket.read.timeout.ms=TIME - timeout for socket reads , defaults to 5000 
  -Dkudu.master.addresses=ADDRESSES - addresses to reach the Masters, defaults to 127.0.0.1 which is usually wrong.
  -D kudu.num.replicas=NUM - number of replicas to use when configuring a new table, defaults to 3

$ hadoop jar share/hadoop/mapreduce/kudu-client-tools-0.6.0-SNAPSHOT-jar-with-dependencies.jar org.kududb.mapreduce.toolstCsv "1,2,3" test1 hdfs://127.0.0.1:50070//user/ubuntu/csvdata/MonthlyPassengerData_200507_to_201506.csv
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/hadoop/share/hadoop/mapreduce/kudu-client-tools-0.6.0-SNAPSHOT-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/ubuntu/tez/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/11/10 03:22:41 INFO impl.TimelineClientImpl: Timeline service address: http://0.0.0.0:8188/ws/v1/timeline/
15/11/10 03:22:41 INFO client.RMProxy: Connecting to ResourceManager at /172.31.15.42:8081
15/11/10 03:22:41 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200
15/11/10 03:22:41 INFO client.AsyncKuduClient: Discovered tablet Kudu Master for table Kudu Master with partition ["", "")
Exception in thread "main" java.lang.RuntimeException: Could not obtain the table from the master, is the master running and is this table created? tablename=test1 and master address= 127.0.0.1
	at org.kududb.mapreduce.KuduTableOutputFormat.setConf(KuduTableOutputFormat.java:114)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:76)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:559)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1306)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1303)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1303)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1324)
	at org.kududb.mapreduce.tools.ImportCsv.run(ImportCsv.java:110)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.kududb.mapreduce.tools.ImportCsv.main(ImportCsv.java:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


How can we create table?


Currently, CREATE TABLE can be done via impala-shell


impala-kudu


In a example of "Running from Spark", it still CREATE TABLE with Impala commands


A doc in example is good one.


We can add it since it's open source!

Opened KUDU-1258


Links


A slide by Kudu team
The potential significance of Cloudera Kudu
Introduction to Cloudera Kudu
Cloudera Kudu deep dive