Skip to content

Instantly share code, notes, and snippets.

View jaceklaskowski's full-sized avatar
:octocat:
Enjoying developer life...

Jacek Laskowski jaceklaskowski

:octocat:
Enjoying developer life...
View GitHub Profile
@jaceklaskowski
jaceklaskowski / tensorflow.md
Last active August 7, 2017 06:29
Notes about TensorFlow (before settling on Apache BEAM and Databricks' TensorFrames)

TensorFlow

What is TensorFlow?

  • Google's TensorFlow is a open source Deep Learning neural network machine learning library
    • Grew out of Google's DistBelief v2 = Google's Brain project
  • Building a system that simplifies deployment of large-scale machine learning models to a variety of hardware (thousands of servers in datacenters, smartphones, GPUs).
  • Much like Theano - a popular deep learning framework.
  • Data Flow Graph (aka Computational Graph or TensorFlow Graph of Computation) with nodes for data or operations and edges for flow of data between nodes called tensor.
  • Tensor is a multi-dimentional array that flows between nodes.
@jaceklaskowski
jaceklaskowski / spark-exercise-custom-defaultsource.md
Last active January 9, 2018 19:19
Exercise: Creating Custom Format for DataFrameReader in Apache Spark
  1. Create a Scala/sbt project
  • Use IntelliJ IDEA
  1. Add libraryDependencies for Spark 2.0.0 (RC2)
  2. Create class mf.DefaultSource (or similar)
  3. publishLocal (or similar)
  4. ./bin/spark-shell --packages organization:spark-mf-format_2.11:1.0.0
  5. spark.read.format("mf").load("mojFormat.mf")

For the bravests:

@jaceklaskowski
jaceklaskowski / spark-summit-sf-2016-talks.md
Last active November 2, 2016 12:48
Reviews of Spark Summit 2016 Talks -- Must-watches

Awesome Talks -- Watch it!

  1. Deep Dive: Apache Spark Memory Management - An excellent talk about Spark's memory management in the past releases and the upcoming 2.0. No code. The slides were awesome with a superb presentation style. Very informatory.
  2. A Deep Dive Into Structured Streaming -- a superb talk about the upcoming Structured Streaming in Spark 2.0.
  3. Structuring Spark: Dataframes, Datasets And Streaming -- another superb talk about the reasons for structuring Spark using Datasets by the one and only Michael Armbrust.
  4. Large-Scale Deep Learning with TensorFlow by Jeff Dean (Google) -- just yesterday I was thinking about feature vectors and how close they map to the real objects (they are supposed to represent) and that gave me the Aha moment that the more features the better but you need to be careful with over-featuring the m
@jaceklaskowski
jaceklaskowski / sparksummit-west-2016.md
Last active March 24, 2017 11:43
Spark Summit West 2016 Sparked My Interest -- Spark Summit West 2016 in San Francisco (to review at the earliest convenience)
@jaceklaskowski
jaceklaskowski / spark-hackathon.md
Created May 14, 2016 14:52
Apache Spark Hackathon
@jaceklaskowski
jaceklaskowski / spark-jobserver-docker-macos.md
Last active August 1, 2018 11:28
How to run spark-jobserver on Docker and Mac OS (using docker-machine)
@jaceklaskowski
jaceklaskowski / apache-spark-meetup.md
Last active October 15, 2015 06:50
What people asked to cover at Apache Spark meetups

Warsaw Scala Enthusiasts meetup about Apache Spark themed Let's Scala few Apache Spark apps together! and the follow-up Let's Scala few Apache Spark apps together - part 2!.

Many, many people answered the question:

EN: What and how would you like to learn at the meetup (about Apache Spark)?

The answers are as follows (and are going to be the foundation for the agenda):

  1. Set up a cluster using many laptops and see how much it could handle.
  2. MLlib with a simple classification like logistic regression.
@jaceklaskowski
jaceklaskowski / jvm-tools.md
Created September 4, 2015 09:55
I should have known these tools earlier - a story about jps, jstat and jmap

From http://stackoverflow.com/a/32393044/1305344:

object size extends App {
  (1 to 1000000).map(i => ("foo"+i, ()))
  val input = readLine("prompt> ")
}

Run it with sbt 'runMain size' and then use jps (to know the pids), jstat -gc pid (to query for gc) and jmap (similar to jstat) to analise resource allocation.

@jaceklaskowski
jaceklaskowski / spark-intro.md
Last active February 29, 2020 19:38
Introduction to Apache Spark

Introducting Apache Spark

  • What use cases are a good fit for Apache Spark? How to work with Spark?
    • create RDDs, transform them, and execute actions to get result of a computation
    • All computations in memory = "memory is cheap" (we do need enough of memory to fit all the data in)
      • the less disk operations, the faster (you do know it, don't you?)
    • You develop such computation flows or pipelines using a programming language - Scala, Python or Java <-- that's where ability to write code is paramount
    • Data is usually on a distributed file system like Hadoop HDFS or NoSQL databases like Cassandra
    • Data mining = analysis / insights / analytics
  • log mining
@jaceklaskowski
jaceklaskowski / sphinx-dockerd.md
Last active April 23, 2021 06:33
Writing docs using Sphinx (inside Docker)