martin-kokos/pydata_berlin_2018.md

## pydata_berlin_2018.md

      
    Raw
  

              pydata_berlin_2018.md
            
          
    Populism & ML (political sciences)


scholars do not agree on definition
scrape manifests and speeches, label by expert
correlate populism x time in office x exit strategies
don't evaluate countries bellow democratic threshold (outliers)
"Populists in Europe"
used: TF-IDF, scattertext, gensim, pyLDA
cat and mouse problem - populists can avoid looking like populists

Smart contracts (Ethereum)


python library populus for unit testing smart contracts written in Solidity
remix.ethereum.org - Solidity IDE
metamask.io - dApps in browser

Visualising CNN


https://github.com/chumo

NLP and psychology


bunch.ai - Culture analytics

ctparse


natural time representation parser

Spacy&prodigy (NLP)

Maximizing failure probability


underthink
overexpect
outsource
wire all together

Spacy features


generic entity types
prodigy annotation (and modeling) tool
"time to first evidence" concept

mobile.de (eBay) personalized recommendations


Bayes approach
user segmentation

buying journey (market)


new contacts -> watchdogs
watchdogs ->buyers
another talk last year https://www.youtube.com/watch?v=v7MBunqwBSY https://www.slideshare.net/FlorianWilhelm2/which-car-fits-my-life-pydata-berlin-2017

Apache Arrow (data pipes)


Spark > Apache Hive
Arrow eliminates data conversion by providing shared data structure with bindings
200x performance in piping data
data pipelines blog https://www.inovex.de/blog/

Spark, Beam, TF (data pipes)


feature preparation pipeline: TFX, Kubeflow, TF.Transform
Dask is better if one uses only python

Going Full-stack (product management)


no such thing as full-stack
only marginalizing/compromising technologies (DB design, security, etc)
SRE (Site Reliability Engineer)
Product Readiness Level (from NASA's TRL)
https://pydata.org/berlin2018/schedule/presentation/16/

Data systems performance (technical progress)


delta encoding, etc.
Be clear when communicating, no buzzwords
Define what is desired, eg. Clustering by attributes
Daimond.ai
prof. Jens Dittrich on YouTube

Extending pandas (data handling)


df.info()

0.23+


ExtensionDtype
ExtensionArray

Apache Arrow


user defined functions
more native types
efficient memory, I/O,

Numba


acceleration with just decorator for for-loops
jitclass for data store

Archer


custom data types to avoid py objects
similar: cyberpandas, geopandas

Rasa workshop (chatbot)


Open Source AI conversational framework
very nice
just see the presentation

Multi-armed Bandits worshop


https://github.com/kraktos/MAB/blob/master/ma_bandit.ipynb