PyData Berlin
FRIDAY
-
Text analysis
- libraries
- nltk - not as well maintain, old academic code
- spacy - has languages models
- gensim included corpus - Lee Background
- NLP
- stopwords, adding my own to spacy
- POS, NER, displaying
- gensim. Dictionary
- topic modeling (unsuperv) vs text classification (superv)
- text classification (julios way)
- libraries
-
Storytelling and visualization
- sell your ideas
- you want to point for some action, what's the goal?
- no powerpoint
- not defined what a data scientist is yet
- example bad graph
- Context: Audience, goals, next action
- polymorphic messages by politicians (everybody understand something extra)
- Stories, graphics and animations
- business-money, science-truth, politics-power
- 5 seconds test
- even is there is a lot of text, there was an action
- What would you like to show? -> famous decision tree
- storytelling with data - book
- best visualization -> just a number
- seaborn example gallery
- dont
- use secondary y-axis
- pie charts (and the audience fight!)
- stacked bars (only compares the first/lower part and total)
- tools
- github.com/rougier/python-visualization-landscape
- others: pygal, plotly
- uberwach / levering-up-viz-story
- Avoid clutter
- Gestalt principles: proximity, similarity, enclosure, closure, connection
- Attention: preattentive attributes
- sell your ideas
-
Hands-on Introduction to 1st data science project (intermediate)
-
- Demand forecasting
- Favorita
- 2 week prediction
-
- Explain pipeline
-
- EDA
-
- Feature Engineering
-
- All together
-
SATURDAY
-
Keynote: Hacking the Iron Curtain
- Romanian hackers and machines
- Intimate relation between hw and sf
-
ML and populism
- Tony Blair Institute
- Not clear what populism is
- Frequency mentions, NER
- ft-idf
- removing nationality mentions (stop-words)
- pyLDA to visualize
- gensim
- T-SNE - visualization of high-dimensional into 2D
- For foreign languages, used google translate
-
Simple diagrams of CNN
- Example convolutions
- setosa.io/ev/image-kernels/
- different graphics, different style
- data art
- grapgcore.ai what does machine learn looks like
- chumo.github.io/Sinapsis
- hand-made diagramas
- deepsense.ai
- Tensorboard is not useful
- http://ethereon.github.io/netscope/quickstart
- keras2ascii
- Example convolutions
-
Launch Jupyter in the Cloud
- hotelbeds
- Docker + Terraform
- terraform script: install Terraform, account key Google, SSH key setup
- Pull and run docker
- https://github.com/Cheukting/jupyter-cloud-demo
- terraform init
- terraform plan - check the plan
- terraform apply - upload
- https://github.com/Cheukting/GCP-GPU-Jupyter
-
SQL like it's 1992 - James Powell
- SQL is like the original big data tech, originally
- a game in SQL: player (civilization), units, spaces
- queries
- player
- engine: compute fire orders, compute manouvre orders, compute build orders, ...
- transactions
- data modeling
- players, units, system, wormholes (syst -> syst), ships (civ x unit)
- "asof" property programatically added in all tables, postgreSQL
- history.table gives state at a point of time
-
A/B testing at Zalando
- sprious correlations
- A/B tests
- controlled, randomized, with correctly chosen sample size
- can the interesting relation happen to be just random?
- controlled, randomized, with correctly chosen sample size
- In Zalando
- 50+ live tests with internal tool
- opensourcing expan - stats analysis of randomised control trials
- Example of fair coin
- shows the quality of the randomization
- noisy conclusions
- Are we finished? Early stopping is not an easy affair
- not only statistics but also $ cost (example of 1-5 give 1$, 6 give back 6$)
- vouchers, discounts, etc
-
Understanding Self-Attention in NLP
- drop RNN and LSTM, you only need attention
- NN for NLP
- RNN
- LSTM, also used in encoded-decoder
- GRU
- all for avoiding losing info, gatting mechanism
- Also CNN!
- Deepmind scans only parts interesting of an image
- when reading, we tend to focus on specific words
- attention, for simple uses too
- summary wikipedia
- generation of new wiki pages
- QA, testual entailment, reasoning
- openai language-unsupervised blog
- it also helps see which words are more relevant, how the system learns
- self-attention (without RNN nor CNN)
- transformer novel neural network ai.googleblog
- Position
- https://ricardokleinklein.github.io/2017/11/16/Attention-is-all-you-need.html
- also DISAN
- self-attention for relation extraction
- 2 entities in a sentence, find the relation reason
- TACRED Dataset
- self-attention + position-aware
- pytorch
-
Keynote: Building in Privacy and Data Protection - GDPR
- GDPR is not about data, is about human beings and their rights
- effects on individuals, effects on society
- usually Alice->bob and Eve is the adversary: here the adversary is Bob!
- imbalance in power, data protection is necessary
- 70 opening clauses "variables" for member states
- despite objective is real harmonization
- some points abstract on purpose
- security: risk of harm reputation, etc
- identifiability: >=1 factor specific for physical, genetic, mental, economic, ...
- more important withdrawal of consent than signup
- right of information
- the data you recolect, but also if you get data from other sources
- right of access of the data
- right of rectification
- keep the invoices (in DE for 10years) but not use the data
- right of data portability
- the "controller": the dev that determines the process personal data
- the "processor": the one that processes the data
- "the processing of personal data should be designed to serve mankind"
- its not a matrix of checks, difficult to implement
- https://blog.xot.nl/2012/09/10/eight-privacy-design-strategies/
- privacypatterns.org
- gdpr for web developers smashing magazine
- Internet PRivacy E network ... future pieces
- building in security? We are not there!
- demanded by controllers (?)
- What about small companies or NGO forums? "There should be OS libraries... we want innovation!"
- GDPR is not about data, is about human beings and their rights
-
Lightning talks
- 5 tribes of ML explainers
- featurists
- speculators
- check how your model change with a variable
- Interpretable ML by Molnar (free book)
- localizers
- convoluters
- trainalyzers
- with training examples
- python lang weirdness, utf8 full use, annotations
- conda in 5 slides
- use python 3
- skin penetration, bad data, don't trust it
- 5 tribes of ML explainers
SUNDAY
-
Keynote - Fairness and Diversity
- women received less high-paid jobs ads on google
- polarize opinions, disrupt democracy
- ML learns a preexisting bias -> and as feedback emergent bias
- individual and group fairness
- different groups should receive similar/proportional treatment
- personalization
- use constrained personalization (don't allow the algorithm to go to one side end)
- multi-armed bandits: exploration vs exploitation tradeoff
- Ranking
- choosing best 3 vs choosing average
-
Industrial ML
- creating a Crpyto-ML startup in november
- sequential models
- regression
- moved to NN and DNN
- RNN to predict prices
- in production
- distributed
- ML is compute heavy
- horizontally
- celery: producer-consumer architecture
- via rabbit
- http://www.celeryproject.org/
- smart data-pipelines
- more need to pull data, pre and post processing, tasks coupling...
- from cronjobs to dependency tasks
- airflow
- better alternative than luigi (need extra scheduler) o crons
- not fully stable, making now 2.0 (there is nothing like this, angular v1 feel)
- not a streaming solution
- visualization
- rt visualiztion of jobs
- scheduler view
- dependency view
- leverages celery
- polling options
- airflow
- elastic devops insfrastructure
- docker
- docker-composer
- kubernetes
- minikube
- google cloud best for kubernetes
- cloud formation , terraform
- seldon, to control data science system
- docker
- https://github.com/axsauze/crypto-ml
- https://axsauze.github.io/industrial-machine-learning/#/
-
mobile.de production personalized web
- personalization
- inspiration/discovery
- memory of past interactions
- track event in hadoop
- create user car preferences
- user interactions -> for segmentation
- different user intents
- novice, just web browser, expert...
- car buying journey
- user events behaviour
- duplicated views
- predict how close to buy today
- given your last 30 days
- event counts, %views events, active days, etc...
- Automatic feature selection
- making windows
- 72% accuracy
- predict tomorrow, in a week, etc. far lower accuracy
- python & big data
- started with Hive
- transform, aggregate, apply
- moved to Spark
- pySpark
- from 5-10h to 1-2h
- less code lines
- easier queries and logic
- use of apache arrow
- personalization
-
pyGAM: balancing interptretability
- generalized additive models
- pyGAM
- follows scikitlearn way
- clients are dubtious
- how certain are you?
- what happens in the worst scenario?
- powerful models are black boxes
- shows effects of features
- shows predictions intervals
- GAMs
- Lineal models
- GAM is similar, but with sum of functions: splines + smoothing
- y= f1(x1) + f2(x2,x3) + ... + c
- splines
- |-| degree 0
- _/_ degree 1
- ...
- smoothing
- penalizes excessive wiggliness
- vs .prophet?
- prophet more specific to timeseries
-
Going full stack, Technical Readiness Level
- where data science we put in?
- front-end: analytics, smart ux, tracking
- back-end: data driven tools
- infrastructure: streaming transformations
- data consumption cycle: algorithm, EDA, model build, product devel, analytics and again
- full-stack data science may contain lots of things
- a lot
- Technologt Readiness LEvels (European Commission model)
- Product Readiness Levels
- Discovery vs delivery
- Start asking: Can we solve the problem (thinking thru the whole process)?
- whats the MVP?
- how we build the MVP?
- How we ship the MVP?
- How we cycle back, improving the MVP?
- Failure is normal
- where data science we put in?
-
Pandas + Arrow + Numba
- pandas only support np types
- np focus on numerical types
- objects are bad (memory distribution)
- ExtensionDtype
- ExtensionArray
- Apache Arrow
- Exploit SIMD, cache...
- strings, nullable int, list of X
- everything nullable
- still young
- numba
- @jit decorator
- from numba import jit
- @numba.jitclass{}
- fletcher
- pandas only support np types
-
Meaningful histogram with Physt
- histograms
- precise and compact
- physt
- histograms as objects
- h1
- import, export to json
- allows spaces, masks
- binary operations like +
- unary operations normalization, etc
- multiple dimentions
- h2
- 2d heatmap
- h3(df)
- h()
- h2
- indexing as numpy
- NICE bins
- "human" approximation
- "integers"
- "exponential"
- Read only?
- add more values
- Easy plotting
- line, scatter
- show_values=True
- errors=True
- show_stats=True
- color map cmap
- lw=0 to avoid zeros
- histograms as objects
- histograms
-
pixi.js + jupyter widgets
- python visualization landscape 2017
- jupyter lab
- ipywidgets
- ipy maps
- ipyutils import SimpleShape
- front app in typescript
Sorry for confusion, the plotting part about physt was rather fast...