Skip to content

Instantly share code, notes, and snippets.

@squito
squito / _summary.md
Last active February 5, 2020 14:15
spark sql timestamp semantics, and how they changed from 2.0.0 to 2.0.1 (see query_output_2_0_0.txt vs query_output_2_0_1.txt) changed by SPARK-16216

Spark "Timestamp" Behavior

Reading data in different timezones

Note that the ansi sql standard defines "timestamp" as equivalent to "timestamp without time zone". However Spark's behavior depends on both the version of spark and the file format

format \ spark version <= 2.0.0 >= 2.0.1
@squito
squito / can_build_from_puzzler.scala
Last active October 6, 2016 19:42
can_build_from_puzzler
val dictionary = Map(
"a" -> Set("apple", "ant"),
"b" -> Set("banana", "barn")
)
// lets count how many times each letter occurs in all words in our dictionary
val letters = dictionary.values.flatMap {x => x.flatMap {_.toCharArray} }
val letterCounts = letters.groupBy(identity).mapValues(_.size)
letterCounts.toArray.sorted.foreach{println}
@squito
squito / bash_trim_array.sh
Last active March 15, 2024 13:19
remove an arg from a command line argument in bash
#!/bin/bash
# this is a demo of how to remove an argument given with the [-arg value] notation for a specific
# [arg] (-T in this case, but easy to modify)
echo $@
echo $#
i=0
ORIGINAL_ARGS=("$@")
TRIMMED_ARGS=()
@squito
squito / on_master.txt
Last active June 28, 2016 20:18
scheduler performance results
# really, plus https://github.com/squito/spark/commit/8ce85969b680424ebda51ff9fe8f6e9ab9a9c4a9, b/c otherwise
# its getting unfairly penalized for my stupid framework
# but it really should also have 8b41649 (offers.toIndexedSeq) to be a fair comparison
[info] SchedulerPerformanceSuite:
Iteration 0 finished in 470 ms
Iteration 1 finished in 150 ms
Iteration 2 finished in 122 ms
Iteration 3 finished in 122 ms
Iteration 4 finished in 101 ms
@squito
squito / bash_vars.sh
Last active September 12, 2021 00:34
working with bash variables
#!/bin/bash
my_func() {(
# this takes a big shortcut around doing testing & unsetting -- because this entire function
# is wrapped in "()", it executes in a subsell, so we can unconditionally unset, without
# effecting vars outside
unset MASTER
echo "do something with MASTER=${MASTER-unset}"
)}
@squito
squito / getPaths.scala
Last active March 15, 2019 06:31
get paths of a spark-sql query
import java.lang.reflect.Method
import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
import org.apache.spark.sql.sources.{HadoopFsRelation, BaseRelation}
import org.apache.spark.sql.DataFrame
def getPaths(relation: BaseRelation): Iterator[String] = {
relation match {
case hr: HadoopFsRelation =>
hr.paths.toIterator
@squito
squito / chained.py
Last active December 11, 2015 23:35
"chaining" python exception
import traceback
import sys
def a(x): b(x)
def b(x): c(x)
def c(x): d(x)
@squito
squito / reflector.scala
Last active September 24, 2020 01:09
utils for accessing field & methods that are private in the scala repl via reflection
/* For example, I want to do this:
*
* sqlContext.catalog.client.getTable("default", "blah").properties
*
* but none of that is public to me in the shell. Using this, I can now do:
*
* sqlContext.reflectField("catalog").reflectField("client").reflectMethod("getTable", Seq("default", "blah")).reflectField("properties")
*
* not perfect, but usable.
*/
@squito
squito / CanIReadOpenDeletedFiles.scala
Created October 26, 2015 15:53
CanIReadOpenDeletedFiles.scala
import java.io._
object CanIReadOpenDeletedFile {
def main(args: Array[String]): Unit = {
try {
val f = new File("deleteme")
val out = new FileOutputStream(f)
out.write(1)
out.close()
@squito
squito / GroupedRDD.scala
Last active August 29, 2015 14:17
GroupedRDD
import java.io.{IOException, ObjectOutputStream}
import scala.language.existentials
import scala.reflect.ClassTag
import org.apache.spark._
import org.apache.spark.rdd.RDD
case class GroupedRDDPartition(