Skip to content

Instantly share code, notes, and snippets.

@mwinkle
mwinkle / gist:9808807
Last active August 29, 2015 13:57
This is the result of work done with https://issues.apache.org/jira/browse/HIVE-3554 and https://issues.apache.org/jira/browse/HIVE-5795 that enable easy traversal of subdirectories, and exclusion of header rows. This makes Hive a turnkey way to query any type of log data partitioned by time series, without having to explicitly program the parti…
CREATE EXTERNAL TABLE websites_logs_raw (datereq STRING,
timereq STRING,
s_sitename STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
s_port STRING,
cs_username STRING,
c_ip STRING,
cs_User_Agent STRING,
@mwinkle
mwinkle / AzureDB-HDInsight-Integration.txt
Created April 6, 2014 18:01
Interacting with Azure SQL DB From HDInsight
Sqoop using Command line (SQL to Hive)
bin\sqoop import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureWorks2012;user=[username];password=[password]" --table SalesOrderDetail --hive-import -m 10 -- --schema Sales
Sqoop using Remote PowerShell
New-AzureHDInsightSqoopJobDefinition –Command 'import --connect "jdbc:sqlserver://[yourserver].database.windows.net:1433;database=AdventureWorks2012;user=[username];password=[password]" --table SalesOrderDetail --hive-import -m 10 -- --schema Sales'
Storing from Pig to Mongo
Obtain jar's from https://github.com/mongodb/mongo-hadoop
REGISTER wasb://container@account/mongo-hadoop/mongo-java-driver-2.11.4.jar;
REGISTER wasb://container@account/mongo-hadoop/mongo-hadoop-core-1.2.1-SNAPSHOT-hadoop_2.2.jar;
REGISTER wasb://container@account/mongo-hadoop/mongo-hadoop-pig-1.2.1-SNAPSHOT-hadoop_2.2.jar;
-- pig code
@mwinkle
mwinkle / HDInsight-Transient-Clusters.ps1
Created April 6, 2014 18:05
This gist shows how to pass in arbitrary storage accounts, as well as HCatalog Metadata stores on a per job basis via PowerShell.
$subscriptionName = "<subscriptionname>"
$clusterName = "<cluster name>"
Select-AzureSubscription -SubscriptionName $subscriptionName
Use-AzureHDInsightCluster $clusterName
$querystring = "show tables; select name, salary from emp;"
[HashTable] $defines
$defines = @{}
$defines.Add("hive.metastore.uris", "")
$defines.Add("javax.jdo.option.ConnectionURL", "jdbc:sqlserver://<name>.database.windows.net;database=<dbname>; encrypt=true;trustServerCertificate=true;create=false")
$defines.Add("fs.azure.account.key.<name1>.blob.core.windows.net", "<clear text key1>")
@mwinkle
mwinkle / gist:ac9dbb152a1e10e06c16
Created July 2, 2014 06:00
HDInsight Hive with Groovy inline UDF's
set hive.aux.jars.path=file:///c:/apps/dist/hive-0.13.0.2.1.3.0-1887/lib/groovy-all-2.1.6.jar
COMPILE `import org.apache.hadoop.hive.ql.exec.UDF \;
public class Pyth extends UDF {
public double evaluate(double a, double b){
return Math.sqrt((a*a) + (b*b)) \;
}
} ` AS GROOVY NAMED Pyth.groovy;
CREATE TEMPORARY FUNCTION Pyth as 'Pyth';
@mwinkle
mwinkle / HDI_SAS_Cluster.ps1
Last active August 29, 2015 14:03
Creating an HDInsight Cluster using a SAS protected storage container
#This script lets you access data in a second storage account, where you are only granted access
#via a Shared Access Signature to a given container
#For more details on SAS, see this http://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-shared-access-signature-part-1/
#Configure SAS key
$sasStorageAccountName = "YOUR_STORAGE_ACCCOUNT_NAME"
$sasContainerName = "YOUR_SAS_CONTAINER_NAME"
$context = New-AzureStorageContext -StorageAccountName $sasStorageAccountName -StorageAccountKey (Get-AzureStorageKey -StorageAccountName $sasStorageAccountName).Primary
$sas = New-AzureStorageContainerSASToken -Name $sasContainerName -Permission rl -Context $context -ExpiryTime (get-date).AddDays(3)
@mwinkle
mwinkle / gist:8fb31f6c7c096a60dc44
Created August 15, 2014 01:03
Uploading data to Azure Blob Store
$storageAccountName = "STORAGE_ACCOUNT_NAME"
$destinationContainer = "DESTINATION_CONTAINER_NAME"
$context = New-AzureStorageContext $storageAccountName -StorageAccountKey (get-azurestoragekey $storageAccountName).Primary
Set-AzureStorageBlobContent -Blob "path/to/your/blob" -Container $destinationContainer -File "c:\your\local\path" -Context $context -Force
@mwinkle
mwinkle / hivehbasetable.hql
Created November 14, 2014 17:38
Hive Table in Hbase
CREATE EXTERNAL TABLE reverse_ip_hbase (
rowkey STRING,
IpAddress string,
Continent string,
Country string,
CountryIso2 string,
CountryConfidence INT,
Region string,
State string,
StateConfidence INT,
@mwinkle
mwinkle / transact-hive.hql
Last active August 29, 2015 14:14
Transactional Hive
set hive.support.concurrency=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=2 ;
CREATE TABLE AcidTest (name string, num int) clustered by (num) into 2 buckets STORED AS orc TBLPROPERTIES('transactional'='true');
INSERT INTO TABLE AcidTest VALUES ('one',1), ('two',2),('three',3),('four',4);
@mwinkle
mwinkle / Giraph on HDinsight on Linux
Created February 25, 2015 05:44
Deploying Giraph on an HDInsight Linux Cluster
special thanks to http://giraph.apache.org/quick_start.html, and thanks to this for the last tip, http://stackoverflow.com/a/27003213/500945
sudo apt-get install openjdk-7-jdk
sudo apt-get install git
sudo apt-get install maven
git clone https://github.com/apache/giraph.git
mvn -Phadoop_2 -fae -DskipTests -Dhadoop=non_secure clean package
# need to put the sample file in storage