Skip to content

Instantly share code, notes, and snippets.

@liopic
Last active July 14, 2018 06:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save liopic/6f1a6d50d41bd07efc18c876329ab254 to your computer and use it in GitHub Desktop.
Save liopic/6f1a6d50d41bd07efc18c876329ab254 to your computer and use it in GitHub Desktop.
PyDataBerlin 2018

PyData Berlin

FRIDAY

  • Text analysis

    • libraries
      • nltk - not as well maintain, old academic code
      • spacy - has languages models
      • gensim included corpus - Lee Background
      • NLP
      • stopwords, adding my own to spacy
      • POS, NER, displaying
    • gensim. Dictionary
    • topic modeling (unsuperv) vs text classification (superv)
    • text classification (julios way)
  • Storytelling and visualization

    • sell your ideas
      • you want to point for some action, what's the goal?
    • no powerpoint
    • not defined what a data scientist is yet
    • example bad graph
    • Context: Audience, goals, next action
      • polymorphic messages by politicians (everybody understand something extra)
      • Stories, graphics and animations
      • business-money, science-truth, politics-power
    • 5 seconds test
      • even is there is a lot of text, there was an action
    • What would you like to show? -> famous decision tree
    • storytelling with data - book
    • best visualization -> just a number
    • seaborn example gallery
    • dont
      • use secondary y-axis
      • pie charts (and the audience fight!)
      • stacked bars (only compares the first/lower part and total)
    • tools
      • github.com/rougier/python-visualization-landscape
      • others: pygal, plotly
    • uberwach / levering-up-viz-story
    • Avoid clutter
      • Gestalt principles: proximity, similarity, enclosure, closure, connection
    • Attention: preattentive attributes
  • Hands-on Introduction to 1st data science project (intermediate)

      1. Demand forecasting
      • Favorita
        • 2 week prediction
      1. Explain pipeline
      1. EDA
      1. Feature Engineering
      1. All together

SATURDAY

  • Keynote: Hacking the Iron Curtain

    • Romanian hackers and machines
    • Intimate relation between hw and sf
  • ML and populism

    • Tony Blair Institute
    • Not clear what populism is
    • Frequency mentions, NER
    • ft-idf
    • removing nationality mentions (stop-words)
    • pyLDA to visualize
    • gensim
    • T-SNE - visualization of high-dimensional into 2D
    • For foreign languages, used google translate
  • Simple diagrams of CNN

    • Example convolutions
      • setosa.io/ev/image-kernels/
    • different graphics, different style
    • data art
      • grapgcore.ai what does machine learn looks like
      • chumo.github.io/Sinapsis
    • hand-made diagramas
    • deepsense.ai
    • Tensorboard is not useful
    • http://ethereon.github.io/netscope/quickstart
    • keras2ascii
  • Launch Jupyter in the Cloud

  • SQL like it's 1992 - James Powell

    • SQL is like the original big data tech, originally
    • a game in SQL: player (civilization), units, spaces
    • queries
      • player
      • engine: compute fire orders, compute manouvre orders, compute build orders, ...
        • transactions
    • data modeling
      • players, units, system, wormholes (syst -> syst), ships (civ x unit)
      • "asof" property programatically added in all tables, postgreSQL
        • history.table gives state at a point of time
  • A/B testing at Zalando

    • sprious correlations
    • A/B tests
      • controlled, randomized, with correctly chosen sample size
        • can the interesting relation happen to be just random?
    • In Zalando
      • 50+ live tests with internal tool
      • opensourcing expan - stats analysis of randomised control trials
    • Example of fair coin
      • shows the quality of the randomization
      • noisy conclusions
    • Are we finished? Early stopping is not an easy affair
    • not only statistics but also $ cost (example of 1-5 give 1$, 6 give back 6$)
      • vouchers, discounts, etc
  • Understanding Self-Attention in NLP

    • drop RNN and LSTM, you only need attention
    • NN for NLP
      • RNN
      • LSTM, also used in encoded-decoder
      • GRU
      • all for avoiding losing info, gatting mechanism
    • Also CNN!
    • Deepmind scans only parts interesting of an image
    • when reading, we tend to focus on specific words
    • attention, for simple uses too
      • summary wikipedia
      • generation of new wiki pages
      • QA, testual entailment, reasoning
        • openai language-unsupervised blog
    • it also helps see which words are more relevant, how the system learns
    • self-attention (without RNN nor CNN)
      • transformer novel neural network ai.googleblog
    • Position
    • https://ricardokleinklein.github.io/2017/11/16/Attention-is-all-you-need.html
    • also DISAN
    • self-attention for relation extraction
      • 2 entities in a sentence, find the relation reason
      • TACRED Dataset
      • self-attention + position-aware
      • pytorch
  • Keynote: Building in Privacy and Data Protection - GDPR

    • GDPR is not about data, is about human beings and their rights
      • effects on individuals, effects on society
    • usually Alice->bob and Eve is the adversary: here the adversary is Bob!
    • imbalance in power, data protection is necessary
    • 70 opening clauses "variables" for member states
      • despite objective is real harmonization
      • some points abstract on purpose
    • security: risk of harm reputation, etc
    • identifiability: >=1 factor specific for physical, genetic, mental, economic, ...
    • more important withdrawal of consent than signup
    • right of information
      • the data you recolect, but also if you get data from other sources
    • right of access of the data
    • right of rectification
    • keep the invoices (in DE for 10years) but not use the data
    • right of data portability
    • the "controller": the dev that determines the process personal data
      • the "processor": the one that processes the data
      • "the processing of personal data should be designed to serve mankind"
    • its not a matrix of checks, difficult to implement
    • https://blog.xot.nl/2012/09/10/eight-privacy-design-strategies/
    • privacypatterns.org
    • gdpr for web developers smashing magazine
    • Internet PRivacy E network ... future pieces
      • building in security? We are not there!
      • demanded by controllers (?)
    • What about small companies or NGO forums? "There should be OS libraries... we want innovation!"
  • Lightning talks

    • 5 tribes of ML explainers
      • featurists
      • speculators
        • check how your model change with a variable
        • Interpretable ML by Molnar (free book)
      • localizers
      • convoluters
      • trainalyzers
        • with training examples
    • python lang weirdness, utf8 full use, annotations
    • conda in 5 slides
    • use python 3
    • skin penetration, bad data, don't trust it

SUNDAY

  • Keynote - Fairness and Diversity

    • women received less high-paid jobs ads on google
    • polarize opinions, disrupt democracy
    • ML learns a preexisting bias -> and as feedback emergent bias
    • individual and group fairness
      • different groups should receive similar/proportional treatment
    • personalization
      • use constrained personalization (don't allow the algorithm to go to one side end)
      • multi-armed bandits: exploration vs exploitation tradeoff
    • Ranking
      • choosing best 3 vs choosing average
  • Industrial ML

  • mobile.de production personalized web

    • personalization
      • inspiration/discovery
      • memory of past interactions
    • track event in hadoop
      • create user car preferences
      • user interactions -> for segmentation
    • different user intents
      • novice, just web browser, expert...
    • car buying journey
    • user events behaviour
      • duplicated views
    • predict how close to buy today
      • given your last 30 days
      • event counts, %views events, active days, etc...
      • Automatic feature selection
      • making windows
      • 72% accuracy
    • predict tomorrow, in a week, etc. far lower accuracy
    • python & big data
      • started with Hive
      • transform, aggregate, apply
      • moved to Spark
        • pySpark
        • from 5-10h to 1-2h
        • less code lines
        • easier queries and logic
      • use of apache arrow
  • pyGAM: balancing interptretability

    • generalized additive models
    • pyGAM
      • follows scikitlearn way
    • clients are dubtious
      • how certain are you?
      • what happens in the worst scenario?
    • powerful models are black boxes
    • shows effects of features
    • shows predictions intervals
    • GAMs
      • Lineal models
      • GAM is similar, but with sum of functions: splines + smoothing
        • y= f1(x1) + f2(x2,x3) + ... + c
        • splines
          • |-| degree 0
          • _/_ degree 1
          • ...
        • smoothing
          • penalizes excessive wiggliness
      • vs .prophet?
        • prophet more specific to timeseries
  • Going full stack, Technical Readiness Level

    • where data science we put in?
      • front-end: analytics, smart ux, tracking
      • back-end: data driven tools
      • infrastructure: streaming transformations
    • data consumption cycle: algorithm, EDA, model build, product devel, analytics and again
    • full-stack data science may contain lots of things
      • a lot
    • Technologt Readiness LEvels (European Commission model)
    • Product Readiness Levels
    • Discovery vs delivery
    • Start asking: Can we solve the problem (thinking thru the whole process)?
      • whats the MVP?
      • how we build the MVP?
      • How we ship the MVP?
      • How we cycle back, improving the MVP?
    • Failure is normal
  • Pandas + Arrow + Numba

    • pandas only support np types
      • np focus on numerical types
    • objects are bad (memory distribution)
    • ExtensionDtype
    • ExtensionArray
    • Apache Arrow
      • Exploit SIMD, cache...
      • strings, nullable int, list of X
        • everything nullable
      • still young
    • numba
      • @jit decorator
      • from numba import jit
      • @numba.jitclass{}
    • fletcher
  • Meaningful histogram with Physt

    • histograms
      • precise and compact
    • physt
      • histograms as objects
        • h1
      • import, export to json
      • allows spaces, masks
      • binary operations like +
      • unary operations normalization, etc
      • multiple dimentions
        • h2
          • 2d heatmap
        • h3(df)
        • h()
      • indexing as numpy
      • NICE bins
        • "human" approximation
        • "integers"
        • "exponential"
      • Read only?
        • add more values
      • Easy plotting
        • line, scatter
        • show_values=True
        • errors=True
        • show_stats=True
        • color map cmap
        • lw=0 to avoid zeros
  • pixi.js + jupyter widgets

    • python visualization landscape 2017
    • jupyter lab
    • ipywidgets
    • ipy maps
    • ipyutils import SimpleShape
    • front app in typescript
@janpipek
Copy link

janpipek commented Jul 9, 2018

Sorry for confusion, the plotting part about physt was rather fast...

  • lw=0 is just a parameter (passed to matplotlib) to hide lines (make them of zero width)
  • show_zero=False is the thing you are aiming at :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment