Skip to content

Instantly share code, notes, and snippets.

@sdoering
Last active January 17, 2018 16:20
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sdoering/37203f3301c6f0b9f48f76a976a2119f to your computer and use it in GitHub Desktop.
Save sdoering/37203f3301c6f0b9f48f76a976a2119f to your computer and use it in GitHub Desktop.
My notes on the pydata berlin 2017 conference

pyData Berlin 2017

Automotive (Recommendation, Search, Active Learning)

On Bandits, Bayes, and swipes: gamification of search

  • Speaker: Stefan Otte
  • Video: YouTube

Quote

"With an example for a Tinder like App in the Automotive Industrie we see how to use active learning to work with Small Data. Active learning is an underappreciated subfield of ML where the algorithm actively gathers labeled data, e.g. it can query the user for the most informative data."

Notes

Very interesting talk to show how to build an alternative approach for search as well as a "Need Analyser". Also imaginable for Fashion and other Consumer goods where the search space might be unreasonable big. Also a possible way to battle the "cold start problem" from recommendation algorithms.

“Which car fits my life?” - mobile.de’s approach to recommendations

  • Speakers: Florian Wilhelm, Arnab Dutta
  • Video: YouTube

Quote

As Germany’s largest online vehicle marketplace mobile.de uses recommendations at scale to help users find the perfect car. We elaborate on collaborative & content-based filtering as well as a hybrid approach addressing the problem of a fast-changing inventory. We then dive into the technical implementation of the recommendation engine, outlining the various challenges faced and experiences made.

Notes

Really interesting talk on how to implement a hybrid approach (Collaborative and Content Based Filterung) to recommendation with fast changing inventory as well as to battle the cold start problem on new inventory items.

Really nice to see the full stack of technology being used in production to recommend 1.6 million cars to 12.5 Million users per month. Done without "fancy" neural nets but with proven technological choices.

Also noteworthy the initiative to identify "surfing users" without real buying intent to exclude these from the ML models for optimisation of recommendation results.

Florian Wilhelm (from inovex) might be an interesting speaker for a S2-Campus.


Blockchain (more then "just" bitcoin)

Blockchains for Artificial Intelligence

  • Speaker: Trent McConaghy
  • Video: YouTube

Further Links

Quote

"In recent years, big data has transformed AI, to an almost unreasonable level. Now, blockchain technology could transform AI too, in its own particular ways. Some applications of blockchains to AI are mundane yet useful, like audit trails on AI models. Some appear almost unreasonable, like AI that can own itself — AI DAOs (decentralized autonomous organizations) leading to the first AI millionaires."

Notes

Lot's of interesting aspects one would not immediately associate with blockchains when thinking Bitcoin and such things. This talk and the accompanying links are recommended for people interested in the topic of blockchains.

Trent McConaghy might be an interesting speaker for a S2-Campus.


Chatbots (ML, Open Source Framework)

Conversational AI: Building clever chatbots

  • Speaker: Tom Bocklisch
  • Video: YouTube

Further Links

Quote

"Most chatbots and voice skills are based on a state machine and too many if/else statements. Tom showed how to move past that and build flexible, robust experiences using machine learning throughout the stack."

Notes

Showing the structure of a chatbot and how to train one was interesting. Also showing the possibilities arising from the framework/libary developed by rasa was interesting (even with being a plug for their system). As the are in the process of also open sourcing the other components of their framework this might enable everyone to built interesting bots that react in a way humans are used to.


Natural Language Processing

A word is worth a thousand pictures: Convolutional methods for text

  • Speaker: Tal Perry
  • Video: YouTube

Further Links

Quote

"Over the last three years, the field of NLP has gone through a huge revolution thanks to deep learning. The leader of this revolution has been the recurrent neural network and particularly its manifestation as an LSTM. Concurrently the field of computer vision has been reshaped by convolutional neural networks. This post explores what we “text people” can learn from our friends who are doing vision."

Notes

RNNS work great for text but convolutions can do it faster. Any part of a sentence can influence the semantics of a word. For that reason we want our network to see the entire input at once. Getting that big a receptive can make gradients vanish and our networks fail so we need to account for that. We can solve the vanishing gradient problem with DenseNets or Dilated Convolutions. Sometimes we need to generate text. We can use “deconvolutions” to generate arbitrarily long outputs.

Is That a Duplicate Quora Question?

  • Speaker: Abhishek Thakur
  • Video: YouTube

Further Links

Quote

"Quora released its first ever dataset publicly on 24th Jan, 2017. This dataset consists of question pairs which are either duplicate or not. Duplicate questions mean the same thing. In this talk, we discuss methods which can be used to detect duplicate questions using Quora dataset. Of course, these methods can be used for other similar datasets."

Notes

The most important part - esp. when using logistic regression and xgboost is the feature engineering. Doing this work up front massively improves results and it is paramount to try different feature sets as well as try to generate interesting "new" features from the data that might help for modelling.

Then the speaker showed different deep learning approaches in short from simple network with dense layers only to LSTM, GRU and 1D CNN. These models gave an accuracy of around 0.80.

The speaker was able to get an accuracy of 0.85 with a deep neural network which comprised of two translation layers, one for each question, initialized by GloVe embeddings, two LSTMs without GloVe embeddings and two 1D convolutional layers which were also initialized by GloVe embeddings. This was followed by a series of dense layers with dropout and batch normalization.

Analysing user comments on news articles with Doc2Vec and Machine Learning classification

  • Speaker: Robert Meyer
  • Video: YouTube

Further Links

Quote

"Using the Doc2Vec framework to analyze user comments on German online news articles and uncovered some interesting relations among the data. Furthermore feeding the resulting Doc2Vec document embeddings as inputs to a supervised machine learning classifier.

Can we determine for a particular user comment from which news site it originated?"

Notes

Quite interesting talk to see the problems in doing NLP on very short documents (comments) and trying to classify them. Showed that the supervised ml classifier didn't do that great of a job in regards to accuracy, as it isn't able to sort very short sentence- or sub-sentence-comments into their correct bucket. On longer comments thought it worked quite well and prototypical comments could be identified.

For us doing NLP as a first step to content based recommendations would therefore mean to use longer texts as inputs for such a (potential) system.

Find the text similarity you need with the next generation of word embeddings in Gensim

  • Speaker: Lev Konstantinovskiy
  • Video: YouTube

Further Links

Quote

"What is the closest word to "king"? Is it "Canute" or is it "crowned"? There are many ways to define "similar words" and "similar texts". Depending on your definition you should choose a word embedding to use. There is a new generation of word embeddings added to Gensim open source NLP package using morphological information and learning-to-rank: Facebook's FastText, VarEmbed and WordRank."

Notes

Really interesting comparison of the different algorithms and their best use cases. Helps a lot if you need to find similarity in texts and between words.


Python (more in general)

Towards Pythonic Innovation in Recommender Systems

  • Speaker: Carlotta Schatten
  • Video: YouTube

Quote

"Recommender Systems are nowadays ubiquitous in our lives. Although many implementations of basic algorithms are well known, recent advances in the field are often less documented. This talks aims at reviewing available Recommender Systems libraries in Python, including cutting edge Time- and Context-aware state of the art models."

Notes

Interesting as such she had pre built evaluation metrics for recommendation libraries (esp. comming from a more java background) and then comparing different python libraries in regards to these evaluations.
She was able to show, that pythonic recommendation libraries are not fit for industrial usage. Personal point of view: Recommendation at scale is a solved problem. We have the tools and techniques and they should (at scale) not be build with python. Therefore the evaluation might be valid - but the presenter was asking the wrong question to begin with.


Tutorials (stats, performance for pandas & NLP)

Introductory tutorial on data exploration and statistical models

  • Trainer: Alexandru Agachi
  • Video: YouTube

Further Links

Quote

  • Descriptive statistics. Describe each variable depending on its type, as well as the dataset overall.
  • Visualization for categorical and quantitative variables. Learn effective visualization techniques for each type of variable in the dataset.
  • Statistical modeling for quantitative and categorical, explanatory and response variables: chi-square tests of independence, linear regression and logistic regression. Learn to test hypotheses, and to interpret our models, their strengths, and their limitations.
  • Then expand to the application of machine learning techniques, including decision trees, random forests, lasso regression, and clustering. Here you will explore the advantages and disadvantages of each of these techniques, as well as apply them to the dataset.

Notes

As a start I was there to refresh some of my statistics knowledge and received a really comprehensive introduction topped with a Jupyter Notebook containing more then enough knowledge to be a great resource.

Pandas from the Inside / "Big Pandas"

  • Trainer: Stephen Simmons
  • Video: YouTube

Further Links

Quote

Pandas is great for data analysis in Python. It promises intuitive DataFrames from R; speed like numpy; groupby like SQL. But there are plenty of pitfalls. This tutorial looks inside pandas to see how DataFrames actually work when building, indexing and grouping tables. You will learn how to write fast, efficient code, and how to scale up to bigger problems with libraries like Dask.

Notes

Very interesting to see speed/performance comparison for known pandas operations as well as speedy alternatives. Also great to get to know dask as (for some usecases) replacement for pandas on "bigg-ish" data.

Topic Modelling (and more) with NLP framework Gensim

  • Trainer: Bhargav Srinivasa Desikan
  • Video: YouTube

Further Links

Quote

Topic Modelling is a great way to analyse completely unstructured textual data - and with the python NLP framework Gensim, it's very, very easy to do this. The purpose of this tutorial is to guide one through the whole process of topic modelling - right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualising them. Advanced topic modelling techniques will also be covered in this tutorial, such as Dynamic Topic Modelling, Topic Coherence, Document Word Coloring, and LSI/HDP.

Notes

Haven't done topic modelling for quite some time. Esp. getting to know spaCy as an alternative to NLTK in the python ecosystem was great. Also a really good refresher on the basics of topic modelling.


Other interesting (and fun) things

TNaaS - Tech Names as a Service

  • Speaker: Vincent D. Warmerdam
  • Video: YouTube

Further Links

Quote

There is a striking phonetic similarity between big data technology and pokemon names. Can you create a service that generates strings that sound like potential pokemon names? And what might be the simplest possible way to make that into a service? Also, would it be possible to generate pokemon names that start with three random characters and end with 'base' (KREBASE, MONBASE would be appropriate but IEYBASE would not be).

Turns out that this is an interesting problem from a ML standpoint and that it is rediculously easy to build in the cloud. In my talk I will explain the ML behind it;

markov chains
probibalistic graphs
rnn/lstm
bidirectional lstm

Notes

Great talk and also interesting from an implementation standpoint (AWS lambda).


Keynotes

Keynote : Are all your worries about Artificial Intelligence wrong?

  • Speaker: Toby Walsh
  • Video: YouTube

Further Links

Quote

"Toby Walsh is one of the world's leading experts in artificial intelligence (AI). He was named by the Australian newspaper as a "rock star" of the digital revolution. He is also a passionate advocate for limits to ensure AI is used to improve, not hurt, our lives. In 2015, Professor Walsh was behind an open letter calling for a ban on autonomous weapons or 'killer robots' that was signed by more than 20,000 AI researchers and high profile scientists, entrepreneurs and intellectuals, including Stephen Hawking, Noam Chomsky, Apple co-founder Steve Wozniak, and Tesla founder Elon Musk."

Notes

Best keynote imho at this conference. Mr. Walsh did everything he could to counter unreasonable fears without being a pacifier but also showing what should be feared and regulated.
The problem isn't as he put it "Smart AI" that will maybe be invented some decades into the future, but autonomous machines as well as "Stupid AI". AI and also autonomous machines have the ability to save lives as well as make all our lives better. But stupid AI, autonomous (killer-)machines and machines being developed and tested without proper regulation or even public debate might lead to more problems down the road.

Toby Walsh would be an awfully interesting speaker for a S2 campus. Or the Next.

Keynote : The Future of Cybersecurity Needs You, Here is Why

  • Speaker: Verónica Valeros
  • Video: YouTube

Further Links

Quote

"In the last decade we have observed a shift in cybersecurity. Cyber threats started to impact more and more our daily lives, even to the point of threatening our physical safety. We learnt that attackers are well aware of our weaknesses and limitations, that they take advantage of this knowledge and that for being successful they need to be just a little better than us. As defendants, we struggle. We perfected existing solutions to protect our environments with some degree of success but still today we fall behind adversaries more often than not. We got really good at collecting data until the point of not being able to use it in its full extent. This lead us to ask ourselves, Is this it? Is this all we can do? The future of cybersecurity needs you, join me on this talk to find out why."

Notes

Very good keynote. Mrs. Valeros (working for Cisco) did present the challenges and threats quite good.

Verónica Valeros could be of interest as a speaker for the Next -> as DDoS and alls that stuff makes Digital really suck.

Keynote : Natural Language Processing: Challenges and Next Frontiers

  • Speaker: Barbara Plank
  • Video: YouTube

Further Links

Quote

"Despite many advances of Natural Language Processing (NLP) in recent years, largely due to the advent of deep learning approaches, there are still many challenges ahead to build successful NLP models.

In NLP, we typically deal with data from a variety of sources, like data from different domains, languages and media, while assuming that our models work well on a range of tasks, from classification to structured prediction. Data variability is an issue that affects all NLP models. "

Notes

Quite interesting from a personal standpoint as I am interested in NLP. Also quite interesting in regards of the problem of the training corpus that michgt be (or have been) used to train NLP models. Are the later "real" data and the training data happening within the same linguistic contexts (training on Wikipedia and classifying chatlogs)?


Data Privacy

Data Analytics and the new European Privacy Legislation

  • Speaker: Amit Steinberg

Sadly got cancelled at the last minute.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment