Skip to content

Instantly share code, notes, and snippets.

View mwiewior's full-sized avatar

Marek Wiewiórka mwiewior

View GitHub Profile
@mwiewior
mwiewior / cache-oblivious.md
Created February 17, 2024 15:03 — forked from debasishg/cache-oblivious.md
Papers related to cache oblivious data structures

Cache Oblivious and Cache Aware Data Structure and Algorithms

  1. Cache-Oblivious Algorithms and Data Structures - Erik Demaine (One of the earliest papers in cache oblivious data structures and algorithms that introduces the cache oblivious model in detail and examines static and dynamic cache oblivious data structures built between 2000-2003)

  2. Cache Oblivious B-Trees - Bender, Demaine, Farch-Colton (This paper presents two dynamic search trees attaining near-optimal performance on any hierarchical memory. One of the fundamental papers in the field where both search trees discussed match the optimal search bound of Θ(1+log (B+1)N) memory transfers)

  3. Cache Oblivious Search Trees via Binary Trees of Small Height - Brodal, Fagerberg, Jacob (The data structure discussed in this paper works on the version of [2] but avoids the use o

resource "google_project_iam_member" "tbd-editor-member" {
#checkov:skip=CKV_GCP_49: "Ensure no roles that enable to impersonate and manage all service accounts are used at a project level"
#checkov:skip=CKV_GCP_117: "Ensure basic roles are not used at project level."
# This is only used for workshops!!!
project = google_project.tbd_project.project_id
role = "roles/owner"
member = "serviceAccount:${google_service_account.tbd-terraform.email}"
}
@mwiewior
mwiewior / spark-amm.sh
Created July 9, 2020 15:45 — forked from ottomata/spark-amm.sh
spark + ammonite
#!/usr/bin/env bash
export SPARK_HOME="${SPARK_HOME:-/usr/lib/spark2}"
export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}"/conf}"
source ${SPARK_HOME}/bin/load-spark-env.sh
export HIVE_CONF_DIR=${SPARK_CONF_DIR}
export HADOOP_CONF_DIR=/etc/hadoop/conf
AMMONITE=~/bin/amm # This is amm binary release 2.11-1.6.7
@mwiewior
mwiewior / README.md
Created July 1, 2020 12:23 — forked from bradfordcp/README.md
Setting up Apache Spark to use Apache Shiro for authentication of Spark Master dashboard.

Securing Apache Spark with Apache Shiro

  1. Download shiro-core-1.2.5.jar Apache Shiro Downloads
  2. Download shiro-web-1.2.5.jar Apache Shiro Downloads
  3. Note the location of the JAR files and shiro.ini. I placed it in the root of my Spark download
  4. Update the spark-env.sh file with the Shiro JARs and add an entry for the path where the shiro.ini resides
  5. Start the Spark master sbin/start-master.sh
  6. Navigate to the Spark master dashboard
  7. Authenticate with credentials in shiro.ini

Note this was developed / tested with Apache Spark 1.4.1, but should work with newer versions as well.

@mwiewior
mwiewior / carbon.scala
Created July 31, 2019 17:12 — forked from agaszmurlo/carbon.scala
Carbon data varia
// ./spark-shell -v --master yarn-client --driver-memory 1G --executor-memory 2G --executor-cores 2 \
// --jars /tmp/apache-carbondata-1.6.0-SNAPSHOT-bin-spark2.3.2-hadoop2.7.2.jar \
// --conf spark.hadoop.hive.metastore.uris=thrift://cdh01.cl.ii.pw.edu.pl:9083 \
// --conf spark.hadoop.yarn.timeline-service.enabled=false \
// --conf spark.driver.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
// --conf spark.yarn.am.extraJavaOptions=-Dhdp.version=3.1.0.0-78 \
// --conf spark.hadoop.metastore.catalog.default=hive
import org.apache.spark.sql.SparkSession
spark-shell -v --master=local[$cores] --driver-memory=12g --conf "spark.sql.catalogImplementation=in-memory" --packages org.biodatageeks:bdg-sequila_2.11:0.5.3-spark-2.4.0-SNAPSHOT --repositories http://repo.hortonworks.com/content/repositories/releases/,http://zsibio.ii.pw.edu.pl/nexus/repository/maven-snapshots/
import org.apache.spark.sql.SequilaSession
import org.biodatageeks.utils.{SequilaRegister, UDFRegister,BDGInternalParams}
val ss = SequilaSession(spark)
SequilaRegister.register(ss)
ss.sqlContext.setConf("spark.biodatageeks.bam.useGKLInflate","true")
ss.sqlContext.setConf("spark.biodatageeks.bam.useSparkBAM","false")
@mwiewior
mwiewior / scala-sbt-project-structure.sh
Created June 1, 2018 16:15 — forked from WarFox/scala-sbt-project-structure.sh
Script to create Scala SBT project directory structure
#!/usr/bin/env bash
touch build.sbt ; touch README.md; mkdir -p project; touch project/plugins.sbt; mkdir -p src/{main/{scala,resources,java},test/{scala,resources,java}}/
@mwiewior
mwiewior / map-pushdow.sc
Created April 20, 2018 19:15 — forked from joao-parana/map-pushdow.sc
Using CatalystExtension Points in Spark
// Este script é para rodar no Ammonite.
// Crie o arquivo catalyst_04.sc com este conteúdo
// Dentro da shell REPL do Ammonitem, você deve invocar assim:
// import $file.catalyst_04, catalyst_04._
//
// Mas antes execute estes tres comandos abaixo
// import coursier.MavenRepository
// interp.repositories() ++= Seq(MavenRepository("file:/Users/admin/.m2/repository"))
// import $ivy.`org.apache.spark::spark-sql:2.3.0`
@mwiewior
mwiewior / slack.sh
Created March 4, 2018 18:14 — forked from andkirby/slack.sh
Shell/Bash script for sending slack messages.
#!/usr/bin/env bash
####################################################################################
# Slack Bash console script for sending messages.
####################################################################################
# Installation
# $ curl -s https://gist.githubusercontent.com/andkirby/67a774513215d7ba06384186dd441d9e/raw --output /usr/bin/slack
# $ chmod +x /usr/bin/slack
####################################################################################
# USAGE
# Send message to slack channel/user
@mwiewior
mwiewior / extraStrategies.md
Created October 13, 2017 08:54 — forked from marmbrus/extraStrategies.md
Example of injecting custom planning strategies into Spark SQL.

First a disclaimer: This is an experimental API that exposes internals that are likely to change in between different Spark releases. As a result, most datasources should be written against the stable public API in org.apache.spark.sql.sources. We expose this mostly to get feedback on what optimizations we should add to the stable API in order to get the best performance out of data sources.

We'll start with a simple artificial data source that just returns ranges of consecutive integers.

/** A data source that returns ranges of consecutive integers in a column named `a`. */
case class SimpleRelation(
    start: Int, 
    end: Int)(
    @transient val sqlContext: SQLContext)