Ricardo Gaspar ricardogaspar2

## Schema2CaseClass.scala
/**
  * Generate Case class from DataFrame.schema
  *
  *  val df:DataFrame = ...
  *
  *  val s2cc = new Schema2CaseClass
  *  import s2cc.implicit._
  *
  *  println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
  *

## start.sh
AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)

docker build -t my_app .
docker run -it --rm \
   -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
   -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY

## spark_tips_and_tricks.md

      
              1 file
            
          
              20 forks
            
          
              1 comment
            
          
              74 stars
            
          
                dusenberrymw
                / spark_tips_and_tricks.md
            
            
              Last active
              February 8, 2023 05:11
            
              
                Tips and tricks for Apache Spark.
              
          
    Spark Tips & Tricks

Misc. Tips & Tricks


If values are integers in [0, 255], Parquet will automatically compress to use 1 byte unsigned integers, thus decreasing the size of saved DataFrame by a factor of 8.
Partition DataFrames to have evenly-distributed, ~128MB partition sizes (empirical finding).  Always err on the higher side w.r.t. number of partitions.
Pay particular attention to the number of partitions when using flatMap, especially if the following operation will result in high memory usage. The flatMap op usually results in a DataFrame with a [much] larger number of rows, yet the number of partitions will remain the same. Thus, if a subsequent op causes a large expansion of memory usage (i.e. converting a DataFrame of indices to a DataFrame of large Vectors), the memory usage per partition may become too high. In this case, it is beneficial to repartition the output of flatMap to a number of partitions that will safely allow for appropriate partition memory sizes, based upon the


## ansible-macos-homebrew-packages.yml
---
- name: Install MacOS Packages
  hosts: localhost
  become: false
  vars:
    brew_cask_packages:
      - atom
      - docker
      - dropbox
      - firefox

## FixStoreApps.ps1
param (
[switch]$Relaunched = $false
)

$ScriptPath = (Get-Variable MyInvocation).Value.MyCommand.Path

function StartOperation {
    Write-Host
    Write-Host Now attempting to regenerate missing manifest files...
    Write-Host

## HelloAvro.java
package mgraciano;

import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaBuilder;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;

## mixing-overloads-default-params.md

      
              1 file
            
          
              1 fork
            
          
              1 comment
            
          
              5 stars
            
          
                lossyrob
                / mixing-overloads-default-params.md
            
            
              Last active
              April 14, 2021 23:26
            
              
                Mixing overloads and default parameters in Scala
              
          
    Mixing  overloads and default parameters in Scala

The problem

There are some situations that arise where you want default arguments in object apply methods, but you also want to overload the apply. For instance, in GeoTrellis, we have an S3LayerWriter which allows you to write an RDD of rasters out of Amazon's S3 storage backend. In order to operate, it needs an AttributeStore, which is the type responsible for reading and writing metadata. A simplified (not real) signature of the attribute store looks like
case class AttributeStore(bucket: String, prefix: String)

  
## PooledSocketStreamPublisher.scala
package net.atos.sparti.pub

import java.io.PrintStream
import java.net.Socket

import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
import org.apache.commons.pool2.{ObjectPool, PooledObject, BasePooledObjectFactory}
import org.apache.spark.streaming.dstream.DStream

class PooledSocketStreamPublisher[T](host: String, port: Int)

## README.md

      
              1 file
            
          
              7 forks
            
          
              1 comment
            
          
              28 stars
            
          
                kmader
                / README.md
            
            
              Last active
              October 31, 2023 14:21
            
              
                Beating Serialization in Spark
              
          
    Serialization

As all objects must be Serializable to be used as part of RDD operations in Spark, it can be difficult to work with libraries which do not implement these featuers.
Java Solutions

Simple Classes

For simple classes, it is easiest to make a wrapper interface that extends Serializable. This means that even though UnserializableObject cannot be serialized we can pass in the following object without any issue
public interface UnserializableWrapper extends Serializable {
 public UnserializableObject create(String parm1, String parm2);


## gist:8396624
# Scaladoc Developer Guide
## Introduction
Scaladoc is the tool that enables developers to automatically generate documentation for their Scala (and Java) projects. It is Scala's equivalent of the widely-used Javadoc tool. This means that Javadoc (and even doxygen) users will be familiar with Scaladoc from day 1: for them, it is most beneficial to check out the Scaladoc/Javadoc comparison tables and if necessary, skim through this document to understand specific features.

The rest of this tutorial is aimed at developers new to Scaladoc and other similar tools. It assumes a basic understanding of the Scala language, which is necessary to follow the examples given throughout the tutorial. For the user perspective on the Scaladoc-generated documentation, such as finding a class, understanding the page layout, navigating through diagrams, please refer to the Scaladoc User Guide.

The tutorial will start by a short motivation and then will explain the main concept in Scaladoc: the doc comment.

### Why document?
	/**
	* Generate Case class from DataFrame.schema
	*
	* val df:DataFrame = ...
	*
	* val s2cc = new Schema2CaseClass
	* import s2cc.implicit._
	*
	* println(s2cc.schemaToCaseClass(df.schema, "MyClass"))
	*
	AWS_ACCESS_KEY_ID=$(aws --profile default configure get aws_access_key_id)
	AWS_SECRET_ACCESS_KEY=$(aws --profile default configure get aws_secret_access_key)

	docker build -t my_app .
	docker run -it --rm \
	-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
	-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
	---
	- name: Install MacOS Packages
	hosts: localhost
	become: false
	vars:
	brew_cask_packages:
	- atom
	- docker
	- dropbox
	- firefox
	param (
	[switch]$Relaunched = $false
	)

	$ScriptPath = (Get-Variable MyInvocation).Value.MyCommand.Path

	function StartOperation {
	Write-Host
	Write-Host Now attempting to regenerate missing manifest files...
	Write-Host
	package mgraciano;

	import java.io.File;
	import java.io.IOException;
	import org.apache.avro.Schema;
	import org.apache.avro.SchemaBuilder;
	import org.apache.avro.file.DataFileReader;
	import org.apache.avro.file.DataFileWriter;
	import org.apache.avro.generic.GenericData;
	import org.apache.avro.generic.GenericDatumReader;
	package net.atos.sparti.pub

	import java.io.PrintStream
	import java.net.Socket

	import org.apache.commons.pool2.impl.{DefaultPooledObject, GenericObjectPool}
	import org.apache.commons.pool2.{ObjectPool, PooledObject, BasePooledObjectFactory}
	import org.apache.spark.streaming.dstream.DStream

	class PooledSocketStreamPublisher[T](host: String, port: Int)
	# Scaladoc Developer Guide
	## Introduction
	Scaladoc is the tool that enables developers to automatically generate documentation for their Scala (and Java) projects. It is Scala's equivalent of the widely-used Javadoc tool. This means that Javadoc (and even doxygen) users will be familiar with Scaladoc from day 1: for them, it is most beneficial to check out the Scaladoc/Javadoc comparison tables and if necessary, skim through this document to understand specific features.

	The rest of this tutorial is aimed at developers new to Scaladoc and other similar tools. It assumes a basic understanding of the Scala language, which is necessary to follow the examples given throughout the tutorial. For the user perspective on the Scaladoc-generated documentation, such as finding a class, understanding the page layout, navigating through diagrams, please refer to the Scaladoc User Guide.

	The tutorial will start by a short motivation and then will explain the main concept in Scaladoc: the doc comment.

	### Why document?