Skip to content

Instantly share code, notes, and snippets.

@andrearota
Last active October 11, 2016 07:41
Show Gist options
  • Save andrearota/2b0d89148b8594458cb7730047552323 to your computer and use it in GitHub Desktop.
Save andrearota/2b0d89148b8594458cb7730047552323 to your computer and use it in GitHub Desktop.
Examples for CCA175

Creates external table from text files tab separated

CREATE EXTERNAL TABLE webpages (
  PAGE_ID INTEGER,
  NAME STRING,
  ASSOC_FILE STRING
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 'hdfs:/loudacre/webpage'

Creates external table from AVRO files, having the .avsc schema

CREATE EXTERNAL TABLE accounts_avro
STORED AS AVRO 
LOCATION '/loudacre/accounts_avro'
TBLPROPERTIES('avro.schema.url'='hdfs:/loudacre/accounts.avsc')

Creates external table from PARQUET files

CREATE EXTERNAL TABLE accounts_parquet2
LIKE PARQUET 'hdfs:/loudacre/accounts_parquet/fb35d46e-d3eb-4fe2-8e45-fc2d935c6a75.parquet'
LOCATION 'hdfs:/loudacre/accounts_parquet'

Creates an empty external table structure to be fill with INSERT INTO x SELECT

CREATE EXTERNAL TABLE accounts_by_areacode2 (
  ACCT_ID INTEGER,
  FIRST_NAME STRING,
  LAST_NAME STRING,
  PHONE_NUMBER STRING
)
PARTITIONED BY (
  AREA_CODE INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/loudacre/accounts_by_areacode2'

Fill the partitioned table with

INSERT INTO TABLE accounts_by_areacode2 PARTITION(area_code)
SELECT acct_num as acct_id, first_name, last_name,
phone_number, SUBSTR(phone_number,1,3) AS area_code
FROM accounts_avro

Add partitions to an already exisiting table

Repeat for any folder containing partitioned data.

ALTER TABLE accounts_by_areacode2 
ADD PARTITION(area_code='A')
LOCATION 'hdfs:/loudacre/accounts_by_areacode2/area_code=A';

ALTER TABLE accounts_by_areacode2 
ADD PARTITION(area_code='B')
LOCATION 'hdfs:/loudacre/accounts_by_areacode2/area_code=B';
  • .groupByKey and .countByKey
  • .mapValues: apply function the value part, leaving the key part untouched
  • .sort and .sortByKey
  • .toDebugString

Handling null values

For string columns:

--null-string <null-string>	The string to be written for a null value for string columns

For non-string columns:

--null-non-string <null-string>	The string to be written for a null value for non-string columns

If not declared, null is written as null string.

Handling free form queries

sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
-m 1 --target-dir /user/foo/joinresults

Split string into array by a delimiter

"hello world".split(' ')

Join strings

','.join(["hello", "world", "foo"]) //> 'hello,world,foo'

Return true if the string contains the substring

"hello" in "hello world"

Local map, filter, flatMap

Python has some functional operators like map and filter.

map(lambda x: x**2, [1, 2, 3])
filter(lambda x: x%2 == 0, [1, 2, 3])

Strip spaces

" hello world ".strip()
" hello world ".lstrip()
" hello world ".rstrip()

Regular expressions

import re
match = re.search("(\d+)", "foo123")

for m in match.groups():
    print m

if m is not None:
    print m.group(0)

# String interpolation
"%d and %d" % (1, 3) //> '1 and 3'

General string manipulation

# Find a substring
"Hello World".find("Wo") //> 6

# Case manipulation
"Hello World".upper() //> HELLO WORLD
"Hello World".lower() //> hello world

# Substitution
"Hello world".replace('world', 'Andrea') //> 'Hello Andrea'

File operations

# Write line by line
f = open("/tmp/hello.txt", "w")
f.write("bar")
f.close()

# Read line by line
f = open("/tmp/hello.txt", "r")
for line in iter(f):
    print line
f.close()

# Read whole file
f = open("/tmp/hello.txt", "r")
contents = f.read()
f.close()

Split string into array by a delimiter

"hello world".split(' ')

Join strings

Array("Hello", "World").mkString(",")

Return true if the string contains the substring

"hello world".contains("world")

Local map, filter, flatMap

Scala has some functional operators like map and filter.

Array(1, 2, 3).map(x => x*x)
Array(1, 2, 3).filter(x => x % 2 == 0)

Strip spaces

" hello world ".trim()
" hello world ".stripPrefix(" ")
" hello world ".stripSuffix(" ")

Regular expressions

import scala.util.matching.Regex
val pattern = new Regex("(\\d+)")
val text = "Foo123"

patter findAllIn text mkString(",")

# String interpolation
val a = 1
val b = 2
print(s"${a},${b}")

General string manipulation

# Find a substring
"hello world".find("llo")

# Case manipulation
"Hello World".toUpperCase() //> HELLO WORLD
"Hello World".toLowerCase() //> hello world

# Substitution
"Hello world".replace('world', 'Andrea') //> 'Hello Andrea'

File operations

Write text file
import java.io._
val file = new File("/tmp/temp.txt")
val bw = new BufferedWriter(new FileWriter(file))
bw.write("Hello")
bw.newLine()
bw.write("World")
bw.close()
Write binary file
import java.io._
val data = Array[Byte](0x00, 0x55, 0x00)
val file = new File("/tmp/temp.bin")
val os = new FileOutputStream(file)
os.write(data)
os.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment