Skip to content

Instantly share code, notes, and snippets.

View SemanticBeeng's full-sized avatar
🎯
Focusing

SemanticBeeng SemanticBeeng

🎯
Focusing
View GitHub Profile
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.functions.broadcast
import shapeless.ops.hlist.Prepend
import shapeless.{::, HList, HNil}
object flow {
type JoinList = HList
case class AnnotatedDataFrame[D, J <: JoinList](toDF: DataFrame) extends Serializable
evergreen documentation
Hi again Jens.
I studied some but did not yet use for an application. An application idea I'd like is some form of dynamic resume.
https://planet42.github.io/Laika/03-preparing-content/03-theme-settings.html#the-helium-theme
Here you mention of the possibility to use Bootstrap based themes: do you have any example of this kind, please?
I tried to write my thoughts down in an "one page proposal" style.
Have succeeded only moderately.
Please advise if this make sense.
In my work to design big data management and analytics products I often make the case that "knowledge science" has to come before "data science".
Unless the meaning of the data is under governance the numbers produced by the data/ML analyses will not be as useful.
Instead, semantic data governance enables:
* better use of the raw data from both business and engineering POV
iptables -L -nv --line-numbers
```
Chain INPUT (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 12 792 ICMP-flood icmp -- * * 0.0.0.0/0 0.0.0.0/0
2 10 400 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate INVALID
3 953 519K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
4 204 9472 AUTO_WHITELIST tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x17/0x02
5 13 1322 AUTO_WHITELIST udp -- * * 0.0.0.0/0 0.0.0.0/0
"Interpretable Machine Learning with XGBoost" https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27
"Interpreting complex models with SHAP values" https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83
"Interpreting your deep learning model by SHAP" https://towardsdatascience.com/interpreting-your-deep-learning-model-by-shap-e69be2b47893
"SHAP for explainable machine learning" https://meichenlu.com/2018-11-10-SHAP-explainable-machine-learning/
"Detecting Bias with SHAP - What do Developer Salaries Tell us about the Gender Pay Gap?" https://databricks.com/blog/2019/06/17/detecting-bias-with-shap.html
https://github.com/slundberg/shap
# Cross language/framework/platform data fabric
## Requirements / Goals
1. #DataSchema abstract over data types from simple tabular ("data frame") to multi-dimension tensors/arrays, graph, etc (see HDF5)
2. #DataSchema specifiable throygh by a functioanal / declarative language (like Kotlingrad + Petastorm/UniSchema)
3. #DataSchema with bindings to languages (Scala, Python) and frameworks (Parquet, ApachHudi, Tensorflow, ApacheSpark, PyTorch)
4. #DataSchema to define both in-memory #DataFabric and schema for data at rest (Parquet, ApacheHudi, PetaStorm, etc)
5. Runtime derived from the "shared runtime" paradigm of #ApacheArrow (no conversions, zero-copy, JVM off-heap)
6. Runtime treats IO/persistence as a separate effect (abstracted away from algo/application logic)
package io.yields.common.meta
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import scala.annotation._
import scala.meta._
/**
@SemanticBeeng
SemanticBeeng / arrow panda marshalling
Last active May 31, 2018 17:10
arrow panda marshalling
# https://arrow.apache.org/docs/python/memory.html
# https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
# https://arrow.apache.org/docs/python/ipc.html
# https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_io.py
# https://github.com/apache/arrow/blob/master/python/pyarrow/serialization.py
# https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
# https://stackoverflow.com/questions/46837472/converting-pandas-dataframe-to-structured-arrays
import pyarrow as pa
import pandas as pd
@SemanticBeeng
SemanticBeeng / structured numpy arrays
Last active May 12, 2018 15:48
structured numpy arrays
# #resource
# https://docs.scipy.org/doc/numpy-1.14.0/user/basics.rec.html
conda install -c conda-forge traits=4.6.0
traits: 4.6.0-py36_1 conda-forge
import numpy as np
from traits.api import Array, Tuple, List, String
from traitschema import Schema
#! /bin/bash
# Root backup directories (sources, locals. destinations and mount points) for backups executed on a/this machine
# Root of backups executed on this machine (local copies for $BCKP_DIRs of all the backups)
export BCKP_DIRS=/data/bckp_dirs
# Root of backup source directories for data from other machines (see $BCKP_SRC)
export BCKP_SRCS=/mnt/backups/bckp_srcs
# Root of backup remote destination directories (remote copies for $BCKP_DIRs of all the backups)