fs111/gist:7013230

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Lingual - ANSI SQL for apache hadoop

Big Data Beers Berlin, October 16th 2013

Speaker


André Kelpe | @fs111 | andre[at]concurrentinc[dot]com
works for concurrent inc (http://concurrentinc.com)

company behind Cacading and Lingual


Cascading


Apache Licensed processing framework Big Data applications
higer level abstraction on top of map/reduce for Apache Hadoop
in production for years (twitter, ebay, etsy...)

Lingual


http://docs.cascading.org/lingual/1.0/

Open Source, Apache licensed SQL on top of Cascading


Goals:

Immediate Data Access

SQL access via Shell or JDBC driver


Simplify SQL Migration

move SQL workflows on your Hadoop Cluster via Cascading flows or JDBC
driver


Simplify System and Data Integration

Read and write from hdfs, jdbc, memcached


Getting lingual


CLI install:
curl http://files.concurrentinc.com/lingual/1.0/lingual-client/install-lingual-client.sh | bash
Cascading SDK: http://www.cascading.org/sdk/
Vagrant Cluster: https://github.com/Cascading/vagrant-cascading-hadoop-cluster

4 node hadoop cluster with SDK pre-installed
we use this during demo
git clone https://github.com/Cascading/vagrant-cascading-hadoop-cluster.git
cd vagrant-cascading-hadoop-cluster
vagrant up
HDFS: http://master.local:50070/dfshealth.jsp
Jobtracker: http://master.local:50030/jobtracker.jsp


Catalog


manages schemas, tables and stereotypes
we see that later

Shell


SQLine based SQL shell to run queries

Demo

> vagrant ssh master
> sudo hadoop namenode -format
> sudo start-all.sh
--> show cluster

> cd /vagrant
> export LINGUAL_PLATFORM=hadoop
> lingual shell 
> !tables
> wget wget http://data.cascading.org/employees.tgz
> tar xf employees.tgz
> hadoop fs -copyFromLocal employees employees
--> show HDFS again

> wget http://data.cascading.org/create-employees.sh
> bash create-employees.sh
--> show catalog on HDFS

> lingual shell
> !tables
> !columns "EMPLOYEES"."TITLES"
> select * from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
--> show job running on cluster 
--> show /results
> select count(*) from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
> select distinct TITLE from "EMPLOYEES"."TITLES";

> SELECT TITLE, avg(SALARY) AVERAGE_SALARY from "EMPLOYEES"."EMPLOYEES" as e
   JOIN "EMPLOYEES"."SALARIES" as s ON e."EMP_NO" = s."EMP_NO"
   JOIN EMPLOYEES."TITLES" as t on e."EMP_NO" = t."EMP_NO"
   GROUP BY TITLE
   ORDER BY 2,1
--> multiple jobs

> lingual catalog --provider -add cascading:cascading-jdbc-derby:2.2.0-+:provider
> lingual catalog --schema working --add
> lingual catalog --schema working --stereotype titles -add --columns TITLE,CNT --types string,int
> lingual catalog --schema working --format derby --add --properties columnnames=title:cnt --provider derby
> lingual catalog --schema working --protocol jdbc --add \
      "--properties=tabledesc.tablename=title_counts,tabledesc.columnnames=title:cnt,tabledesc.columndefs=varchar(100) not null:int not null" \
      --provider=derby
> lingual catalog --schema working --table title_counts --stereotype titles -add "jdbc:derby://chronos.local:1527/lingualdemo;create=true" --protocol jdbc  --format derby
> lingual shell
> !tables
> !columns "working"."title_counts"
> insert into "working"."title_counts" select title, count( title ) as cnt from employees.titles group by title;
> select * from "working"."title_counts";
-> show standard JDBC tool here

Links

Concurrent


http://concurrentinc.com
@concurrentinc

Lingual


http://cascading.org/linugal
http://docs.cascading.org/lingual/1.0/
https://github.com/Cascading/vagrant-cascading-hadoop-cluster/
https://github.com/Cascading/cascading-jdbc
https://groups.google.com/forum/#!forum/lingual-user

Cascading


@Cascading
http://cascading.org
http://docs.cascading.org/impatient/
https://groups.google.com/forum/#!forum/cascading-user