grudelsud/pydata-2018-london-notes.md

## pydata-2018-london-notes.md

      
    Raw
  

              pydata-2018-london-notes.md
            
          
    Democratising data journalism: building a collaborative and investigative network across the UK

Charles Boutaud, bureau of investigative journalism https://www.thebureauinvestigates.com/profile/charlesboutaud
https://pydata.org/london2018/schedule/presentation/49/
Data driven journalism, overview of things done, they use a slack channel to kick-off the conversation,
discuss and ask stuff then translated into potential projects
CatBoost - the new generation of gradient boosting

https://catboost.yandex/  https://github.com/catboost/catboost
Anna Veronika Dorogush
https://pydata.org/london2018/schedule/presentation/34/
Catboost, basically gradient boost for categories instead of numerical features
Supports GPU with several speed multiplier wrt xgboost
Parameters are very important, particularly learning rate and iterations to find right balance for error convergence
Python Doesn’t Have to Be Slow: Speeding Up a Large-Scale Optimization Algorithm

Dat Nguyen
https://pydata.org/london2018/schedule/presentation/21/
Performance at zopa, lending platform for micro loans, np-hard problems to allocate money pots that meet investors requirements.
Mostly unreadable presentation due to tiny font unfortunately. Anyways, they use https://numba.pydata.org/ an anaconda
package that seems to give good results
Creating correct and capable classifiers

Ian Ozsvald (pydata organiser)
https://pydata.org/london2018/schedule/presentation/32/
Have a look at pandas_profiling https://github.com/pandas-profiling/pandas-profiling
testing the Sklearn dummy classifier on Titanic dataset from kaggle, run random forest and
notice that performs way better than dummy classifier. Plot yellow brick confusion matrix, check the lib, quite cool viz
lib for sklearn https://pythonhosted.org/yellowbrick/introduction.html
Eli5 library, explain like I'm 5 http://eli5.readthedocs.io/en/latest/overview.html
Building out data science at QBE

Liam P. Kirwin
https://pydata.org/london2018/schedule/presentation/37/
big commercial insurance company, they’re old school because works well, so data science helps where
traditional models can’t reach. Interestingly enough, they introduced data science libraries to improve
communication across levels.
Eli5 / lime (explain results from classifiers): https://github.com/marcotcr/lime  / sharply (mh, might have misspelled this) libraries for storytelling
James Powell, more generators

This is the guy of the live coding presentation I saw at pydata 17... I'm sorry, I find incomprehensible live coding sessions
a useless show off, I will avoid in the future.
Python at massive scale, Stephen Simmons at JP Morgan

Starting from monolithic repo across the world, no Dev/prod separation in 2006
Scaled enormously to 35M lines of code,  using hydra, a proprietary object oriented database with
distributed optimistic writing. Source code is stored in hydra, so when running the command, it'd fetch the
latest version directly from the db
Viz tool called perspective been recently opensourced on GitHub https://jpmorganchase.github.io/perspective/
Sunday keynote: Learning programming and science with Scientific Python

Emmanuelle Gouillart
https://pydata.org/london2018/schedule/presentation/51/
core dev for Scikit-image
really interesting talk about Empowering learning in her team. Encouraging to write documentation, add api gallery
and take all actions to onboard new comers as quickly as possible
Check sphinx-gallery https://github.com/sphinx-gallery/sphinx-gallery
Check binder, making run notebooks on k8s, it's also featured on sphinx-gallery https://mybinder.org/
Auto-encoders in the wild... of telco land.

Guillermo Christen
https://pydata.org/london2018/schedule/presentation/38/
Blah blah on how to use neural networks to create encoders/decoders to compress feature spaces. Meh, not convinced at all.
Searching for Shady Patterns: Shining a light on UK corporate ownership

Adam Hill
https://pydata.org/london2018/schedule/presentation/17/
this guy... interesting, seems a good guy, albeit being on the pompous side.
showing the dataset from companies house with neo4j, a lot of things looking really odd (i.e. 4,000 children
under 2 of age are registered as company owners...).
Checkout datakind.org and his presentation http://bit.ly/pyDataLDN2018-Corporate-Ownership
Data Deduplication using Locality Sensitive Hashing

Matti Lyra
https://pydata.org/london2018/schedule/presentation/30/
minhash library to compare document similarity https://github.com/mattilyra/LSH
kind of meh, focused on similar text only e.g. multiple versions of same article where few things are changed,
how to quickly find if they're the same? solution proposed only applies to text