liopic/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    keynote Travis Oliphant


NumFocus (ngo)
ML -> python
anaconda makes ML magic available to mortals

modeling, predicting, classif, visualization
feat labeling, data clean, data extrac, scaling, deploy


spyder IDE
recommends: Scikit, TF, Keras, XGBoost
intros de numpy, scipy(stats helpers), matplotlib
numba (bigger nodes, scale up) vs dask (more nodes, scale out), blaze (best of both: GPU cluster)
jupyterLab

keynote Holden Karau


PySpark
RDDs/Dataframes
FP
DAG (& the query plan)
Py4J (py access java obj in JVM)

climate data


console.ng.bluemix.net weather API
datascience.ibm.com, github/ibm-cds-labs/python-notebooks
pixiedust (graphs with menu editable)

mapbox


climexp.knmi.nl and ecmwf.int/en/forecast/datasets
scipy.interpolate to make a map
forecast weather to change retail offers
pd.merge_
medium.com/ibm-watson-data-lab
seti.org/ML4SETI

Happiness inside the job


Tuesday is the saddest day
Exploring data

Choose day to post job offer


graphs employeeA - employeeB

intracompany interactions


ML churn prediction input

Employee individual features
Company wide features
Employee-company features
Social features


Analyzing code contributions with networkX and matplotlib


cohesion of a group

robustness
overlap


conectivity: remove actors until group disconected, or diferent paths

k-components


Rolling Pandas

∘ in series or dataframes
∘ inclusion-exclusion, summed area tables
Asteroid prediction impact


TensorFlow

tensor = n-dimensional
flow = graph that shows flow of the data


tensorflow google neural network visulization
google released images with labels library
tensor flow codelab in her github
steps: explore dataset, recognition protocol, 1st layer, evaluation

Squeeze your big data


old time tales: the faster the transmision line, the less the compression is needed

modern CPUs are so fast that memory bus is bottleneck


Blosc -> compressor that uses multiple cores
data containers, chunked containers

On disk: HDF5 format, NetCDF4
In memory: bcolz, zarr


compression in ML

Tuple Oriented Coding


Bandwidth that sends data to GPU is slow, compress from CPU to GPU.
Only in recent CPUs
Use compressed data chunks

Neuroscience


control all HW easily
used python in all steps!

Jupyter as interactive dashboard


relies in web front, but not ready until later stages of the project
Early state-> prototype to validate, understand the data
js UI is difficult -> use jupyter
ipywidgets

add components
layout widgets (boxes, tabs, accordion)
jupyter in dashboard mode
far ideal for production


good for prototype

for production: kibana, grafana


Blockchain


Distributed ledger (you need C in CAP)
pyledger

Marketing data science


27 features to represent customers
personality segments & groups