Skip to content

Instantly share code, notes, and snippets.

- Breath of experience at Cloudera
- Prior to Cloudera
- Apache Bigtop, Apache Hive (decimal), Apache Spark, Apache Sentry
- OSS contributions
- Apache Spot
- Experience writing the book
- Radar post
- Table of contents
- Experience dealing with very complex client deployments
- Fraud Detection, Event analytics, Scaling
@markgrover
markgrover / spot.xml
Last active October 19, 2016 15:01
Goes under content/projects/spot.xml in incubator.svn
<?xml version="1.0" encoding="UTF-8"?>
<document>
<properties>
<title>Spot Incubation Status</title>
<link href="http://purl.org/DC/elements/1.0/" rel="schema.DC"/>
</properties>
<body>
<section id="Spot+Project+Incubation+Status">
<title>Spot Project Incubation Status</title>
<p>This page tracks the project status, incubator-wise. For more general
@markgrover
markgrover / gist:85313f901209720c1e27
Created February 9, 2016 19:26
Custom patch for https://github.com/markgrover/spark/tree/kafka09-integration to run a full test build against a snapshot version of kafka 0.9.0.1
diff --git a/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala b/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
index 1e5a45c64545d5491dde491c8d74871951833fa0..142578cb081c2f976482c84204b3fce12b3caf2a 100644
--- a/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
+++ b/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
@@ -94,7 +94,7 @@ class DirectKafkaStreamSuite
// TODO: Renable when we move to Kafka 0.9.0.1. This test wouldn't pass until KAFKA-3029
// (Make class org.apache.kafka.common.TopicPartition Serializable) is resolved.
- test("basic stream receiving with multiple topics and smallest starting offset, using new Kafka" +
+ ignore("basic stream receiving with multiple topics and smallest starting offset, using new Kafka" +
@markgrover
markgrover / gist:350991da6edc00789e46
Created September 2, 2015 23:03
Contents of environ of executor
LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/hadoop/../../../CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/hadoop/../../../GPLEXTRAS-5.5.0-1.cdh5.5.0.p0.300/lib/hadoop/lib/native:/opt/cloudera/parcels/GPLEXTRAS-5.5.0-1.cdh5.5.0.p0.300/lib/impala/lib:/opt/cloudera/parcels/GPLEXTRAS-5.5.0-1.cdh5.5.0.p0.300/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/hadoop/lib/native
CDH_HCAT_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/hive-hcatalog
TOMCAT_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/bigtop-tomcat
YARN_RESOURCEMANAGER_OPTS=
CDH_SOLR_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/solr
CDH_PIG_HOME=/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.1007/lib/pig
PARCELS_ROOT=/opt/cloudera/parcels
FLUME_CLASSPATH=/opt/cloudera/parcels/GPLEXTRAS-5.5.0-1.cdh5.5.0.p0.300/lib/hadoop/lib/*
TERM=vt100
SHELL=/bin/bash
This file has been truncated, but you can view the full file.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/spark-1.5.0-SNAPSHOT-bin-nm/lib/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.6-1.cdh5.4.6.p0.155/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.6-1.cdh5.4.6.p0.155/jars/avro-tools-1.7.6-cdh5.4.6-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/08/14 15:28:42 INFO SparkContext: Running Spark version 1.5.0-SNAPSHOT
15/08/14 15:28:42 DEBUG MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, value=[Rate of su
This file has been truncated, but you can view the full file.
Container: container_1439589080879_0002_01_000001 on mgrover-haa3-2.vpc.cloudera.com_8041
===========================================================================================
LogType:stderr
Log Upload Time:Fri Aug 14 15:38:48 -0700 2015
LogLength:4917798
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/yarn/nm/usercache/root/filecache/71/spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
@markgrover
markgrover / gist:4576265b25b303174f37
Created May 13, 2015 14:38
job.properties for the workflow.xml
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=quickstart.cloudera:8032
workflowRoot=${nameNode}/user/${user.name}/oozie-workflows
# jobStart and jobEnd must be in UTC, because Oozie does not yet support
# custom timezones
# jobStart=2014-10-13T20:30Z
# jobEnd=2014-10-17T20:30Z
# This should be set to an hour boundary. In this case, it is set to 8 hours
@markgrover
markgrover / gist:cf62e475f15e3d6612ce
Last active August 29, 2015 14:21
Workflow.xml with CONDITIONS variable
<workflow-app xmlns="uri:oozie:workflow:0.4" name="process-clickstream-data-wf">
<global>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
</global>
<start to="import_facts"/>
<action name="import_facts">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
@markgrover
markgrover / gist:113196fecd1ec5bd0b38
Last active August 29, 2015 14:16
Error log with sqoop export on Avro
[root@mgrover-haa2-4 ~]# sqoop export --connect jdbc:mysql://$MYSQL_SERVER:3306/movie_dwh --username root --table avg_movie_rating --export-dir /data/movielens/aggregated_ratings -m 16 --update-key movie_id --update-mode allowinsert
Warning: /opt/cloudera/parcels/CDH-5.3.1-1.cdh5.3.1.p0.5/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
15/03/01 16:15:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.1
15/03/01 16:15:01 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/03/01 16:15:01 INFO tool.CodeGenTool: Beginning code generation
15/03/01 16:15:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `avg_movie_rating` AS t LIMIT 1
15/03/01 16:15:01 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `avg_movie_rating` AS t LIMIT 1
15/03/01 16:15:01 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
@markgrover
markgrover / gist:86f54663ece0943bc8ed
Created March 1, 2015 06:17
Code for sqooping into Hive parquet tables
#!/bin/bash
# This code is taken from github.com/hadooparchitecturebook/hadoop-arch-book.
SQOOP_METASTORE_HOST=localhost
sudo -u hdfs hadoop fs -mkdir -p /etl/movielens/user_rating_fact
sudo -u hdfs hadoop fs -chown -R $USER: /etl/movielens/user_rating_fact
sqoop job --delete user_rating_import --meta-connect jdbc:hsqldb:hsql://${SQOOP_METASTORE_HOST}:16000/sqoop || :