- André Kelpe | @fs111 | andre[at]concurrentinc[dot]com
- works for concurrent inc (http://concurrentinc.com)
- company behind Cacading and Lingual
- Apache Licensed processing framework Big Data applications
- higer level abstraction on top of map/reduce for Apache Hadoop
- in production for years (twitter, ebay, etsy...)
- http://docs.cascading.org/lingual/1.0/
- Open Source, Apache licensed SQL on top of Cascading
- Goals:
- Immediate Data Access
- SQL access via Shell or JDBC driver
- Simplify SQL Migration
- move SQL workflows on your Hadoop Cluster via Cascading flows or JDBC driver
- Simplify System and Data Integration
- Read and write from hdfs, jdbc, memcached
- Immediate Data Access
- CLI install: curl http://files.concurrentinc.com/lingual/1.0/lingual-client/install-lingual-client.sh | bash
- Cascading SDK: http://www.cascading.org/sdk/
- Vagrant Cluster: https://github.com/Cascading/vagrant-cascading-hadoop-cluster
- 4 node hadoop cluster with SDK pre-installed
- we use this during demo
- git clone https://github.com/Cascading/vagrant-cascading-hadoop-cluster.git
- cd vagrant-cascading-hadoop-cluster
- vagrant up
- HDFS: http://master.local:50070/dfshealth.jsp
- Jobtracker: http://master.local:50030/jobtracker.jsp
- manages schemas, tables and stereotypes
- we see that later
- SQLine based SQL shell to run queries
> vagrant ssh master
> sudo hadoop namenode -format
> sudo start-all.sh
--> show cluster
> cd /vagrant
> export LINGUAL_PLATFORM=hadoop
> lingual shell
> !tables
> wget wget http://data.cascading.org/employees.tgz
> tar xf employees.tgz
> hadoop fs -copyFromLocal employees employees
--> show HDFS again
> wget http://data.cascading.org/create-employees.sh
> bash create-employees.sh
--> show catalog on HDFS
> lingual shell
> !tables
> !columns "EMPLOYEES"."TITLES"
> select * from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
--> show job running on cluster
--> show /results
> select count(*) from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
> select distinct TITLE from "EMPLOYEES"."TITLES";
> SELECT TITLE, avg(SALARY) AVERAGE_SALARY from "EMPLOYEES"."EMPLOYEES" as e
JOIN "EMPLOYEES"."SALARIES" as s ON e."EMP_NO" = s."EMP_NO"
JOIN EMPLOYEES."TITLES" as t on e."EMP_NO" = t."EMP_NO"
GROUP BY TITLE
ORDER BY 2,1
--> multiple jobs
> lingual catalog --provider -add cascading:cascading-jdbc-derby:2.2.0-+:provider
> lingual catalog --schema working --add
> lingual catalog --schema working --stereotype titles -add --columns TITLE,CNT --types string,int
> lingual catalog --schema working --format derby --add --properties columnnames=title:cnt --provider derby
> lingual catalog --schema working --protocol jdbc --add \
"--properties=tabledesc.tablename=title_counts,tabledesc.columnnames=title:cnt,tabledesc.columndefs=varchar(100) not null:int not null" \
--provider=derby
> lingual catalog --schema working --table title_counts --stereotype titles -add "jdbc:derby://chronos.local:1527/lingualdemo;create=true" --protocol jdbc --format derby
> lingual shell
> !tables
> !columns "working"."title_counts"
> insert into "working"."title_counts" select title, count( title ) as cnt from employees.titles group by title;
> select * from "working"."title_counts";
-> show standard JDBC tool here
- http://concurrentinc.com
- @concurrentinc