liopic/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    PyData Berlin
FRIDAY


Text analysis

libraries

nltk - not as well maintain, old academic code
spacy - has languages models
gensim included corpus - Lee Background
NLP
stopwords, adding my own to spacy
POS, NER, displaying


gensim. Dictionary
topic modeling (unsuperv) vs text classification (superv)
text classification (julios way)


Storytelling and visualization

sell your ideas

you want to point for some action, what's the goal?


no powerpoint
not defined what a data scientist is yet
example bad graph
Context: Audience, goals, next action

polymorphic messages by politicians (everybody understand something extra)
Stories, graphics and animations
business-money, science-truth, politics-power


5 seconds test

even is there is a lot of text, there was an action


What would you like to show? -> famous decision tree
storytelling with data - book
best visualization -> just a number
seaborn example gallery
dont

use secondary y-axis
pie charts (and the audience fight!)
stacked bars (only compares the first/lower part and total)


tools

github.com/rougier/python-visualization-landscape
others: pygal, plotly


uberwach / levering-up-viz-story
Avoid clutter

Gestalt principles: proximity, similarity, enclosure, closure, connection


Attention: preattentive attributes


Hands-on Introduction to 1st data science project (intermediate)


Demand forecasting


Favorita

2 week prediction


Explain pipeline


EDA


Feature Engineering


All together


SATURDAY


Keynote: Hacking the Iron Curtain

Romanian hackers and machines
Intimate relation between hw and sf


ML and populism

Tony Blair Institute
Not clear what populism is
Frequency mentions, NER
ft-idf
removing nationality mentions (stop-words)
pyLDA to visualize
gensim
T-SNE - visualization of high-dimensional into 2D
For foreign languages, used google translate


Simple diagrams of CNN

Example convolutions

setosa.io/ev/image-kernels/


different graphics, different style
data art

grapgcore.ai what does machine learn looks like
chumo.github.io/Sinapsis


hand-made diagramas
deepsense.ai
Tensorboard is not useful
http://ethereon.github.io/netscope/quickstart
keras2ascii


Launch Jupyter in the Cloud

hotelbeds
Docker + Terraform

terraform script: install Terraform, account key Google, SSH key setup
Pull and run docker


https://github.com/Cheukting/jupyter-cloud-demo

terraform init
terraform plan - check the plan
terraform apply - upload


https://github.com/Cheukting/GCP-GPU-Jupyter


SQL like it's 1992 - James Powell

SQL is like the original big data tech, originally
a game in SQL: player (civilization), units, spaces
queries

player
engine: compute fire orders, compute manouvre orders, compute build orders, ...

transactions


data modeling

players, units, system, wormholes (syst -> syst), ships (civ x unit)
"asof" property programatically added in all tables, postgreSQL

history.table gives state at a point of time


A/B testing at Zalando

sprious correlations

http://www.tylervigen.com/spurious-correlations


A/B tests

controlled, randomized, with correctly chosen sample size

can the interesting relation happen to be just random?


In Zalando

50+ live tests with internal tool
opensourcing expan - stats analysis of randomised control trials


Example of fair coin

shows the quality of the randomization
noisy conclusions


Are we finished? Early stopping is not an easy affair
not only statistics but also $ cost (example of 1-5 give 1$, 6 give back 6$)

vouchers, discounts, etc


Understanding Self-Attention in NLP

drop RNN and LSTM, you only need attention
NN for NLP

RNN
LSTM, also used in encoded-decoder
GRU
all for avoiding losing info, gatting mechanism


Also CNN!
Deepmind scans only parts interesting of an image
when reading, we tend to focus on specific words
attention, for simple uses too

summary wikipedia
generation of new wiki pages
QA, testual entailment, reasoning

openai language-unsupervised blog


it also helps see which words are more relevant, how the system learns
self-attention (without RNN nor CNN)

transformer novel neural network ai.googleblog


Position
https://ricardokleinklein.github.io/2017/11/16/Attention-is-all-you-need.html
also DISAN
self-attention for relation extraction

2 entities in a sentence, find the relation reason
TACRED Dataset
self-attention + position-aware
pytorch


Keynote: Building in Privacy and Data Protection - GDPR

GDPR is not about data, is about human beings and their rights

effects on individuals, effects on society


usually Alice->bob and Eve is the adversary: here the adversary is Bob!
imbalance in power, data protection is necessary
70 opening clauses "variables" for member states

despite objective is real harmonization
some points abstract on purpose


security: risk of harm reputation, etc
identifiability: >=1 factor specific for physical, genetic, mental, economic, ...
more important withdrawal of consent than signup
right of information

the data you recolect, but also if you get data from other sources


right of access of the data
right of rectification
keep the invoices (in DE for 10years) but not use the data
right of data portability
the "controller": the dev that determines the process personal data

the "processor": the one that processes the data
"the processing of personal data should be designed to serve mankind"


its not a matrix of checks, difficult to implement
https://blog.xot.nl/2012/09/10/eight-privacy-design-strategies/
privacypatterns.org
gdpr for web developers smashing magazine
Internet PRivacy E network ... future pieces

building in security? We are not there!
demanded by controllers (?)


What about small companies or NGO forums? "There should be OS libraries... we want innovation!"


Lightning talks

5 tribes of ML explainers

featurists
speculators

check how your model change with a variable
Interpretable ML by Molnar (free book)


localizers
convoluters
trainalyzers

with training examples


python lang weirdness, utf8 full use, annotations
conda in 5 slides
use python 3
skin penetration, bad data, don't trust it


SUNDAY


Keynote - Fairness and Diversity

women received less high-paid jobs ads on google
polarize opinions, disrupt democracy
ML learns a preexisting bias -> and as feedback emergent bias
individual and group fairness

different groups should receive similar/proportional treatment


personalization

use constrained personalization (don't allow the algorithm to go to one side end)
multi-armed bandits: exploration vs exploitation tradeoff


Ranking

choosing best 3 vs choosing average


Industrial ML

creating a Crpyto-ML startup in november
sequential models

regression
moved to NN and DNN
RNN to predict prices


in production
distributed

ML is compute heavy
horizontally
celery: producer-consumer architecture

via rabbit
http://www.celeryproject.org/


smart data-pipelines

more need to pull data, pre and post processing, tasks coupling...
from cronjobs to dependency tasks

airflow

better alternative than luigi (need extra scheduler) o crons
not fully stable, making now 2.0 (there is nothing like this, angular v1 feel)
not a streaming solution
visualization

rt visualiztion of jobs
scheduler view
dependency view


leverages celery
polling options


elastic devops insfrastructure

docker

docker-composer
kubernetes

minikube
google cloud best for kubernetes


cloud formation , terraform


seldon, to control data science system

https://docs.google.com/presentation/d/1CUrELIqLqnfiA54kRqyhv0fD3qt8hsjriH9Dm0ttE0A/edit#slide=id.g3d0e1c759c_0_154


https://github.com/axsauze/crypto-ml
https://axsauze.github.io/industrial-machine-learning/#/


mobile.de production personalized web

personalization

inspiration/discovery
memory of past interactions


track event in hadoop

create user car preferences
user interactions -> for segmentation


different user intents

novice, just web browser, expert...


car buying journey
user events behaviour

duplicated views


predict how close to buy today

given your last 30 days
event counts, %views events, active days, etc...
Automatic feature selection
making windows
72% accuracy


predict tomorrow, in a week, etc. far lower accuracy
python & big data

started with Hive
transform, aggregate, apply
moved to Spark

pySpark
from 5-10h to 1-2h
less code lines
easier queries and logic


use of apache arrow


pyGAM: balancing interptretability

generalized additive models
pyGAM

follows scikitlearn way


clients are dubtious

how certain are you?
what happens in the worst scenario?


powerful models are black boxes
shows effects of features
shows predictions intervals
GAMs

Lineal models
GAM is similar, but with sum of functions: splines + smoothing

y= f1(x1)  + f2(x2,x3) + ... + c
splines

|-| degree 0
_/_  degree 1
...


smoothing

penalizes excessive wiggliness


vs .prophet?

prophet more specific to timeseries


Going full stack, Technical Readiness Level

where data science we put in?

front-end: analytics, smart ux, tracking
back-end: data driven tools
infrastructure: streaming transformations


data consumption cycle: algorithm, EDA, model build, product devel, analytics and again
full-stack data science may contain lots of things

a lot


Technologt Readiness LEvels (European Commission model)
Product Readiness Levels
Discovery vs delivery
Start asking: Can we solve the problem (thinking thru the whole process)?

whats the MVP?
how we build the MVP?
How we ship the MVP?
How we cycle back, improving the MVP?


Failure is normal


Pandas + Arrow + Numba

pandas only support np types

np focus on numerical types


objects are bad (memory distribution)
ExtensionDtype
ExtensionArray
Apache Arrow

Exploit SIMD, cache...
strings, nullable int, list of X

everything nullable


still young


numba

@jit decorator
from numba import jit
@numba.jitclass{}


fletcher


Meaningful histogram with Physt

histograms

precise and compact


physt

histograms as objects

h1


import, export to json
allows spaces, masks
binary operations like +
unary operations normalization, etc
multiple dimentions

h2

2d heatmap


h3(df)
h()


indexing as numpy
NICE bins

"human" approximation
"integers"
"exponential"


Read only?

add more values


Easy plotting

line, scatter
show_values=True
errors=True
show_stats=True
color map cmap
lw=0 to avoid zeros


pixi.js + jupyter widgets

python visualization landscape 2017
jupyter lab
ipywidgets
ipy maps
ipyutils import SimpleShape
front app in typescript