Skip to content

Instantly share code, notes, and snippets.

View sujithjay's full-sized avatar
🥑

Sujith Jay Nair sujithjay

🥑
View GitHub Profile
@sujithjay
sujithjay / Google Protobuf MacOS Installation for Building the Hadoop Project
Created April 2, 2020 09:13
Google Protobuf MacOS Installation for Building the Hadoop Project
wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.bz2
tar xvf protobuf-2.5.0.tar.bz2
cd protobuf-2.5.0
./configure CC=clang CXX=clang++ CXXFLAGS='-std=c++11 -stdlib=libc++ -O3 -g' LDFLAGS='-stdlib=libc++' LIBS="-lc++ -lc++abi"
make -j 4
sudo make install
protoc --version
@sujithjay
sujithjay / 2.4.3.scala
Last active March 25, 2020 23:31
DataFrameNaFunctions.fill in Spark 2.4.5 & Spark 2.4.3
/**
* Returns a new `DataFrame` that replaces null or NaN values in specified
* numeric, string columns. If a specified column is not a numeric, string
* or boolean column it is ignored.
*/
private def fillValue[T](value: T, cols: Seq[String]): DataFrame = {
// the fill[T] which T is Long/Double,
// should apply on all the NumericType Column, for example:
// val input = Seq[(java.lang.Integer, java.lang.Double)]((null, 164.3)).toDF("a","b")
// input.na.fill(3.1)
@sujithjay
sujithjay / UnionOverOr.md
Created October 11, 2019 16:58
Union over Or
/* Part 0: Initialise */
def randomInt = scala.util.Random.nextInt(10000)

val dataframe = sc.parallelize(
  Seq.fill(100000){(randomInt,randomInt,randomInt)}
  ).toDF("cID", "c2", "c3")

val anotherDataframe = sc.parallelize(
@sujithjay
sujithjay / ExplainOutput.md
Last active June 28, 2018 08:43
Spark SQL Joins: Code Snippets
scala> widetable.explain()
== Physical Plan ==
*SortMergeJoin [customer_id#2], [customer_id#17], Inner
:- *Sort [customer_id#2 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(customer_id#2, 200)
:     +- *Scan JDBCRelation(foodmart.sales_fact_1998) [numPartitions=1] [product_id#0,time_id#1,customer_id#2,promotion_id#3,store_id#4,store_sales#5,store_cost#6,unit_sales#7] ReadSchema: struct<product_id:int,time_id:int,customer_id:int,promotion_id:int,store_id:int,store_sales:decim...
+- *Sort [customer_id#17 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(customer_id#17, 200)
 +- *Scan JDBCRelation(foodmart.customer) [numPartitions=1] [customer_id#17,fullname#45] ReadSchema: struct
/**
* A tiny class that extends a list with four combinatorial operations:
* ''combinations'', ''subsets'', ''permutations'', ''variations''.
*
* You can find all the ideas behind this code at blog-post:
*
* http://vkostyukov.ru/posts/combinatorial-algorithms-in-scala/
*
* How to use this class.
*
@sujithjay
sujithjay / ASeq.scala
Last active March 13, 2018 11:52 — forked from Mortimerp9/gist:5649109
"Cheap" implementation of an immutable.IndexedSeq backed by an Array. The output of ASeq looks like an Array according to the types, but is not mutable nor cast back to a mutable Array.
import scala.reflect.ClassTag
import scala.collection.mutable.WrappedArray
import scala.collection.mutable.ArrayLike
def ASeq[T](elt: T*)(implicit ct: ClassTag[T]): IndexedSeq[T] = {
val a = elt.toArray.clone
a.deep.asInstanceOf[IndexedSeq[T]]
}
val a = Array(1,2,3) //> a : Array[Int] = Array(1, 2, 3)
@sujithjay
sujithjay / BroadcastHashJoin.md
Last active February 18, 2018 07:21
Spark SQL Joins: Code Snippets
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
@sujithjay
sujithjay / CoalesceVsCase.md
Last active December 13, 2017 11:43
SQL Queries on Subtyped Data Models using Case Expressions and Coalesce

Disjoint subtyping is a scenario that is often encountered in data modeling. In one frequently used modeling approach, an entity of a certain type is represented by a database table, and each subtype of this entity is represented by another table. Subtyping is disjoint if an instance of a type corresponds to at most one instance of a subtype.

When querying on a data model with subtypes, a common (verbose?!) way of doing this is using case expressions. The idea of this post is to introduce an alternative to this approach using coalesce. I will be illustrating this across multiple examples.

Example #1

In the toy schema depicted below, list all orders with their corresponding vendors and customers

Order Table

Id | Status | Quantity | Type |

@sujithjay
sujithjay / MariaDBvsMySQL.md
Last active November 30, 2017 08:49
A Point of Distinction between MariaDB and MySQL

A Point of Distinction between MariaDB and MySQL

TL; DR

In MariaDB, query with ORDER BY in a FROM subquery produces an unordered result. In effect, ORDER BY is ignored in FROM subqueries. MySQL does not ignore ORDER BY in FROM subqueries.

Longer Version

Older versions of MariaDB(< 10.2.0) did not have window functions such as rank(), dense_rank(), row_number() among others. To understand where you would use such a function, dense_rank() for instance, consider the following example: