Skip to content

Instantly share code, notes, and snippets.

@fs111
Created October 16, 2013 19:18
Show Gist options
  • Save fs111/7013230 to your computer and use it in GitHub Desktop.
Save fs111/7013230 to your computer and use it in GitHub Desktop.
These are the notes from my talk about lingual during the big data beer meetup in Berlin: http://www.meetup.com/Big-Data-Beers/events/143392512/

Lingual - ANSI SQL for apache hadoop

Big Data Beers Berlin, October 16th 2013

Speaker

  • André Kelpe | @fs111 | andre[at]concurrentinc[dot]com
  • works for concurrent inc (http://concurrentinc.com)
    • company behind Cacading and Lingual

Cascading

  • Apache Licensed processing framework Big Data applications
  • higer level abstraction on top of map/reduce for Apache Hadoop
  • in production for years (twitter, ebay, etsy...)

Lingual

  • http://docs.cascading.org/lingual/1.0/
    • Open Source, Apache licensed SQL on top of Cascading
  • Goals:
    • Immediate Data Access
      • SQL access via Shell or JDBC driver
    • Simplify SQL Migration
      • move SQL workflows on your Hadoop Cluster via Cascading flows or JDBC driver
    • Simplify System and Data Integration
      • Read and write from hdfs, jdbc, memcached

Getting lingual

Catalog

  • manages schemas, tables and stereotypes
  • we see that later

Shell

  • SQLine based SQL shell to run queries

Demo

> vagrant ssh master
> sudo hadoop namenode -format
> sudo start-all.sh
--> show cluster

> cd /vagrant
> export LINGUAL_PLATFORM=hadoop
> lingual shell 
> !tables
> wget wget http://data.cascading.org/employees.tgz
> tar xf employees.tgz
> hadoop fs -copyFromLocal employees employees
--> show HDFS again

> wget http://data.cascading.org/create-employees.sh
> bash create-employees.sh
--> show catalog on HDFS

> lingual shell
> !tables
> !columns "EMPLOYEES"."TITLES"
> select * from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
--> show job running on cluster 
--> show /results
> select count(*) from "EMPLOYEES"."TITLES" where TITLE = 'Engineer';
> select distinct TITLE from "EMPLOYEES"."TITLES";

> SELECT TITLE, avg(SALARY) AVERAGE_SALARY from "EMPLOYEES"."EMPLOYEES" as e
   JOIN "EMPLOYEES"."SALARIES" as s ON e."EMP_NO" = s."EMP_NO"
   JOIN EMPLOYEES."TITLES" as t on e."EMP_NO" = t."EMP_NO"
   GROUP BY TITLE
   ORDER BY 2,1
--> multiple jobs

> lingual catalog --provider -add cascading:cascading-jdbc-derby:2.2.0-+:provider
> lingual catalog --schema working --add
> lingual catalog --schema working --stereotype titles -add --columns TITLE,CNT --types string,int
> lingual catalog --schema working --format derby --add --properties columnnames=title:cnt --provider derby
> lingual catalog --schema working --protocol jdbc --add \
      "--properties=tabledesc.tablename=title_counts,tabledesc.columnnames=title:cnt,tabledesc.columndefs=varchar(100) not null:int not null" \
      --provider=derby
> lingual catalog --schema working --table title_counts --stereotype titles -add "jdbc:derby://chronos.local:1527/lingualdemo;create=true" --protocol jdbc  --format derby
> lingual shell
> !tables
> !columns "working"."title_counts"
> insert into "working"."title_counts" select title, count( title ) as cnt from employees.titles group by title;
> select * from "working"."title_counts";
-> show standard JDBC tool here

Links

Concurrent

Lingual

Cascading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment