Skip to content

Instantly share code, notes, and snippets.

mwinkle /
Last active Aug 18, 2016
PySpark UDF for calling Text Analytics, a Microsoft Cognitive Service
from pyspark.sql.functions import udf
import httplib, urllib, base64, json
def whichLanguage(text):
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': '{your subscription key here}',
params = urllib.urlencode({
View gist:f11407dfebbba952adc3
su hdfs
hadoop fs –mkdir /user/root
hadoop fs –chmod 777 /user/root
hadoop fs –chmod 777 /user/guest
text_file = sc.textFile("hdfs://")
counts = text_file.flatMap( lambda line: line.split(" ")) \
.map(lambda word: (word, 1) ) \
.reduceByKey(lambda a, b : a + b)
View vbProcessor.vb
' sample U-SQL UDO (Processor) written in VB.NET
Imports Microsoft.Analytics.Interfaces
Public Class vbProcessor
Inherits IProcessor
Private CountryTranslation As New Dictionary(Of String, String) From
{{"Deutschland", "Germany"},
{"Schwiiz", "Switzerland"},
mwinkle / fSharpProcessor.fs
Last active Nov 12, 2015
An example of using F# to implement a U-SQL UDO (in this case, a processor).
View fSharpProcessor.fs
// sample U-SQL UDO (Processor) written in F#
// Note, currently (11/2015) requires deployment of F#.Core
namespace fSharpProcessor
open Microsoft.Analytics.Interfaces
type myProcessor() =
inherit IProcessor()
View python_processing.sql
# assumes table is my_json, with one column containing all of the json body
add file wasb:///example/apps/;
SELECT transform(json_body)
USING 'd:\python27\python.exe'
AS id, lessonbranch, elapsedseconds, activity, the_date
FROM my_json;
mwinkle /
Last active Aug 29, 2015
python Processing Json docs
# this is a python streaming program designed to be called from a Hive query
# this will process a complex json document, and will return the right set of columns and rows
# a second GIST will contain the hive query that can be used to process this
import sys
import json
# this returns five columns
# id, lessonbranch, elapsedseconds, activity, datetime
mwinkle /
Created Mar 14, 2015
Python file used to consolidate json files to a single line wtih no CR's
import sys
lines = []
for line in sys.stdin:
if len(lines) > 0:
cleaned_lines = [line.strip() for line in lines]
single_line = ' '.join(cleaned_lines)
mwinkle / Giraph on HDinsight on Linux
Created Feb 25, 2015
Deploying Giraph on an HDInsight Linux Cluster
View Giraph on HDinsight on Linux
special thanks to, and thanks to this for the last tip,
sudo apt-get install openjdk-7-jdk
sudo apt-get install git
sudo apt-get install maven
git clone
mvn -Phadoop_2 -fae -DskipTests -Dhadoop=non_secure clean package
# need to put the sample file in storage
mwinkle / transact-hive.hql
Last active Aug 29, 2015
Transactional Hive
View transact-hive.hql
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on=true;
set hive.compactor.worker.threads=2 ;
CREATE TABLE AcidTest (name string, num int) clustered by (num) into 2 buckets STORED AS orc TBLPROPERTIES('transactional'='true');
INSERT INTO TABLE AcidTest VALUES ('one',1), ('two',2),('three',3),('four',4);