VS Code Plugins and configuration tips
https://code.visualstudio.com/docs/setup/setup-overview
/* | |
Usage: you'll want to search for the strings <bucket> and <prefix>, and insert the S3 bucket where your access | |
logs are being delivered. Use (or delete) <prefix> to filter to a subset of your logs. | |
*/ | |
/* | |
These commented out configuration settings you can either run yourself in the REPL and source this file using | |
`.read parse_s3_access_logs.sql`, or you can uncomment them and supply values for yourself. |
// need to add the Apache WS XMLSchema library to spark/jars (does not have dependencies) | |
// https://repo1.maven.org/maven2/org/apache/ws/xmlschema/xmlschema-core/2.2.5/xmlschema-core-2.2.5.jar | |
import org.apache.ws.commons.schema.XmlSchemaCollection | |
import java.io.StringReader | |
import scala.collection.JavaConverters._ | |
import org.apache.ws.commons.schema._ | |
import org.apache.ws.commons.schema.constants.Constants | |
import org.apache.spark.sql.types._ |
import pandas as pd | |
from tqdm import tqdm | |
import csv | |
import random | |
import string | |
from pyspark.sql import SparkSession | |
from pyspark.sql.functions import * | |
random.seed(1999) |
package fpmax | |
import scala.util.Try | |
import scala.io.StdIn.readLine | |
object App0 { | |
def main: Unit = { | |
println("What is your name?") | |
val name = readLine() |
package com.databricks.spark.jira | |
import scala.io.Source | |
import org.apache.spark.rdd.RDD | |
import org.apache.spark.sql._ | |
import org.apache.spark.sql.functions._ | |
import org.apache.spark.sql.sources.{TableScan, BaseRelation, RelationProvider} |
flatMap
, especially if the following operation will result in high memory usage. The flatMap
op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap
to a number of partitions that will safely allow for appropriate partition memory sizes, based upon theThis is a guide for Scala and Java development on Windows, using Windows Subsystem for Linux, although a bunch of it is applicable to a VirtualBox / Vagrant / Docker subsystem environment. This is not complete, but is intended to be as step by step as possible.
Read the entire Decent Security guide, and follow the instructions, especially:
Flame graphs are a nifty debugging tool to determine where CPU time is being spent. Using the Java Flight recorder, you can do this for Java processes without adding significant runtime overhead.
Shivaram Venkataraman and I have found these flame recordings to be useful for diagnosing coarse-grained performance problems. We started using them at the suggestion of Josh Rosen, who quickly made one for the Spark scheduler when we were talking to him about why the scheduler caps out at a throughput of a few thousand tasks per second. Josh generated a graph similar to the one below, which illustrates that a significant amount of time is spent in serialization (if you click in the top right hand corner and search for "serialize", you can see that 78.6% of the sampled CPU time was spent in serialization). We used this insight to spee