ekinakyurek/thescienceofdeeplearningcolloquia.md

## thescienceofdeeplearningcolloquia.md

      
    Raw
  

              thescienceofdeeplearningcolloquia.md
            
          
    DAY 1

1. The State of Deep Learning : Overview Talk (I) Amnon Shashua


Deep Learning w. overparametrized networks enabled: Training Error ↓ Test Error ↓


Quantum Entanglement related with Deep Nets  https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.122.065301


Self Driving Cars, Diriving Policy, Ethics in AI

Failure allowed only due to perception error and it should be less then human's error
There shoud be no failure due wrong decision
The world is rich with details, challenge with pattern recognition is accuracy
For humans, percetual accidents are very rare 10^7 hours of driving for an accident
Combine two different (independent) subsystem for same task to reduce risk


Natural Language Understanding is a very good small test environment for General AI. Understanding a book has enough complexity.


2. The State of Deep Learning : Overview Talk (II) - Jitendra Malik, UC Berkeley

Phylogeny of intelligence.
Intelligence really about perception and action.
The evolutionary progression

Vision and Locomotion
Manipulation - science falling behind
Language - same

Major Success of DL

CV, Speech understanding, machine translation, game playing
Behind the success: data + computing + annotation + simulation
1 neuron = 1000 instructions/sec
DL in the context of early history
Turing: rather than simulating an adult brain, try simulate a child's brain and then educate it.
1980: Neocognitron: A Self-organizing Neural Network Model
Lecun 1989 - Convolutional Neural Networks
R-CNN
Mask R-CNN
Biological inspiration is overrated - deeper is better: from retina to the back of the there are only seven sinapsis.

Lenet - AlexNet - VGG - GoogleNet -  ResNet - ReneXt - Mask R-CNN

The future: Seeing 3D
Understand the geometry
Learned multiview scenario
Visual Navigation in Novel Environments
Challenges:

few-shot learning
learning with liitle supervision
unifying learning with geometric reasoning
perception and control

What about unsupervised learning? DL is a function approximation technique, there is input and output.
3. The State of Deep Learning : Overview Talk (III) - Chris Manning, Stanford

He provided an overview of DL in speech recognition and speech synthesis and how this technology evolved to innovations such as Alexa and Siri.
A challenge in the past: similar words but non similar representation e.g. hotel and motel
RNN to generate words bu repeated sampling
The idea of by-pass: is very effective! LSTMs: Hochreiter 1998
Recurrent neural encoder decoder networks such as for translation tasks
Ref for deep reps of this kind: Sutskever et al 2014 Luong 2015
Bottleneck: all information of the sentence has to pass thru one pipeline
Solution: seq to seq with attention.
One evaluation: 26% performance improvement from Eng to German 2014 to 2015 in one year
Open domain question answering: DrQA
Ref. Chen et al acl 2017
The Stanford Attentive Reader
Contextual word representations from LMs - think of hidden states as representations
ELMo (Peters 2018) in NER tagger - reps of this time grew by years e.g. BERT, GPT etc.
4. The State of Deep Reinforcement Learning, Orial Vinyals, Google, Deepmind

Summary: Atari-AlphaGo-Alpha-Star
Reinforcement Learning Sceheme : Agent, Observations, Actions, Environment
Mentioned Deepmind's 2015 Atari Paper
Behind every great agent there's a great environment(importance of datasets)
Policy and Value Learning
Atari vs Go: action space is higher in Go, information type Near-Perfect vs Perfect


AlphaGo:
Policy Network and Value Network(predict winner)
Take a Search + Imitate the search: related summary blog post that I found
Policy Improvement Theorem
AlphaGo Zero: 4TPUs
AlphaGo (Lee Sadol): 48TPUs


StartCraft:
Action space is 10^6
Exploration is really hard
Imitation learning
Real time strategy game, in playing phase fast prediction needed
Grandmaster level


Patterns for Success and Challenges Ahead


Success


Environment Full of Rewards
Atari


Available human demonstrations
AlphaGo-AlphaStar


Algorithmic ways to improve the policy
AlphaGo-AlphaZero


...


Challenges


Real Word is not simulations


Transfer/ General AI
Alpha Go should learn chess quickly
Alphastart should learn Atari easily
ImageNet classifier should trivally transfer to MNIST


Understanding theory
non-convex optimization


5. Panel: Tomaso Poggio, Regina Barzilay, Terrence Senjnowski, Rodney Brooks

Regina: Interpretability needed to better undestand. Supervision and Quality of Supervision. Datasets are biased( ex Fake News Dataset, remove evidence we can predict bias :)).
Tomassa: We need more science and math but we have many gpu hours instead :) Big data is not realistic. We, human, learn from very less data. Current architectures are not the way human learn.
Terrence: New scientific studies will be enabled by the hype of deep learning we experienced similar things when heat equations are solved. Deep learning revolution
Rodney: we can't trust systems trained on deep nets
6. Deep Learning in Science(I): Regina Barzilay, MIT

MIT Machine Learning for Pharmaceutical Discovery Consortium
Challenge in drug discovery: a huge combinatorial space.
How to deploy ML to address this challenge. Predicting chemical reactions.

Property prediction: take a molecule, extract molecular fingerprint, graph convolution
One reason the initial model failed was domain transfer.
Better Molecule Generation

What are the open questions?
Molecular representation beyond graphs
Modeling underlying physics
How to improve a molecule to have better properties.
String to string generation: Linearize the molecules (smiles) - did poorly.
Graph to graph generation: Invalidity of intermediate molecules is a challenge. Should produce many diverse outputs.
Tree decomposition: Molecule to tree.
7. Deep Learning in Science(II): Deep Learning and Particle Physics - Kyle Cranmer, NYU


higgs boson discovery


particle collision = complex probabilistic model


created particles create other particles


likelihood calculation is very hard


Bayesian inference under intractable likelihoods: lilelihood free inference

Approximate Bayesian Computation (can also be intractable)
You just need to do Forward Simulation
sufficient statistics can not be determined sometimes


New approaches


Use Simulator

Hijack the inside of simulator


Learning The Simulator

Generative Adverserial Network
Learning the Likelihood Ratio (Supervised)
Likelihood ratio trick  (binary classifer ~= likelihood ratio)


Takeaways

ML have potentiol to effectively bridge the microscopic-macroscopic divide
Physics Aware Machine Learning

Intersection of Deep Learning(successfull) and Bayesian Methods(interpretable)


Physics aware Gaussian Processes


QCD-Aware Recursive Neural Networks


QCD-Aware Graph Convolutional Networks


JUNIPR: generative model for jets can train on real data! and interpretable


8. Deep Learning in Science(III): DL in Genomic Research - Olga Troyanskaya, Princeton

How does a single mutation in genome affect gene regulation?
Which SNPs are functional and lead to human disease?
Understanding disease causing mutations using DL.
Two types of mutations:

Coding variant
Noncoding regulatory variant

Model should be
Genomic sequence -> sequence model -> chromatin organization
Model is trained on a single genome.
Deep convolutional network-based sequence model
Why relevant to genome data?

Many example of the same sequence along the whole DNA.
Capture context information
Interaction of seq features
Multi task prediction

The proposed model is able to predict histone marks, DNase accessibility, transcription factors given a single code change/mutation.
DeepSEA idenrifies significant noncoding regulatory mutation burden in ASD. ASD is composed of families where autism is observed in only one of the children and not in the rest of the family.
One question: could autism be a result of stronger mutations not just a mutation i.e. sibling can also have the mutation but the disease?  Yes. BioRxiv. Nature Genetics.
ExPecto - ab inito prediction of tissue specific gene expression from sequence.
A pipeline of methods deel learning, spatial feature transformation, regularized linear models to obtain the expression and associated impact of mutation.
Summary

A DL based algorithmic framework for predicting the effect of any non-coding mutation in genome.
A computational framework for accurate prediction of tissue-specific expression, including de novo prediction of expression variation
Functional networks produced by semi-supervised data integration enable insight into mechanisms of human disease, including Alzheimer's, Parkinson's and cardiovascular diseases.

9. Deep Learning in Science(IV), Eero Simoncelli, NYU


Deep Convolutional Neural Nets
Largely inspired by neurobiology
Astonishing (but often brittle) results
Model for sensory neurobiology
Basic neural selectivity
MRI - stimulus similarity
Missing:
-Largely unsupervised

-non-classification objective

-Local learning,

-adaptation,

-gain-control,

-homeostasis,

-Recurrence/state/context(memory, reward, attention)

-Myriad bio-physical details

Example 1: Difference between two images(Berardino et al. NIPS 2017)

MSE is not a good measure for human eyes
L(X,Xhat) = ||f(X)-f(Xhat')||
TID2018 Dataset
All models (whether deep or not) performs same on the test data
Which one generalizes:
Least noticeable Eigen Distortions
Most visable Eigen Distortions
Local gain control

Example 2: Perceptual Straightening of Videos
(Henaff, Goris, Simoncelli NN-9)

Curvature in Pixel Domain
Perceptual experiments on Humans
Humans reduce curvature in their brain
CNNs do not work like human brain

10. Panel: Scientific Funding for Deep Learning

Super Turing Computation
Can handle situations it hasn't encountered before utilizing previous learning.
Lifelong Learning Machines (L2M) - is concerned of learing while executing, improve over lifetime
11. Can Deep Learning provide Deep Insights for Neuroscience, Bruno Olshausen, Berkeley

Embrace complexity of biology


Neuroscience moved a lot neuron is not neuron in ML
Cortical Circuits:


Highly Organized by layer
Layers are interconnected in a canonical microcircuit
Feed-back connections

What problems should we be solving?

showed pictures of animals with good capabilities
animal's vision system is very robust, low power


Nakayama et al.(1995)


O'regan & Noe(2001)


Mumford(2010) Pattern Theory
-Sparse Discreteness
-Transformations
-Hierarchy


The Sparse Manifold Theorem (Yubei Chan, NeurIPS 2018)

12. Dissecting Neural Networks, Antonio Torralba, MIT

Very fun talk :))
10 billion dollar spend for CERN data We spend very very less data on learning datasets.
Cycle of Deep Learning: we realize datasets are biased, then google (?), then new datasets
Understanding Deep Representations:
Network Dissection (~visualization)
Test Units for Semantic Segmentation
Top Activated Images: IoU
GANs
How we can identify which neuron responsible for which object's detections in CNNs, we can also identify in GANs which neuron draws which part of the image
13. Super Intelligence, Rodney Brooks

History of AI

Turing papers mentioned
Approaches to AI

a) Symbolic Approach

Logic, statements about symbols, inference and reasoning
Compositonality in symbolic systems
Symbols are not grounded

b) Neural Networks

c) Traditional Robotics

Finding corners and future points in a picture

d) Behavior Robotics

Behavior trees

What we are doing wrong currently
There were some fun comics
Betters turing test? Get Machines to do Real Tasks in the World
Hard Things to Do

Real Perception

Ex: Chess board with grays and whites are in same intensity
Ex: Blue filtered strawberry image where computer RBG colors are not red rather more blue
Our perception adjust according to the context
Audience learned a category with 3 images :)
Real Manipulation:

Read a book

Common Sense Reasoning

What should work on


Object recognition capabilities of 2 year old
Language capabilities of 4 year old
Manual dexerity of 6 year old
Social understanding of 8 years old

Day 2

1.  Networks of neurons for learning and representing symbols in the brain, Tomaso Poggio, MIT

missed!
2. Inductive Bias and Optimization in DL - Nati Srebro, TTIC

Goals:

Capacity of the learning system - how many samples do we need to generalize?
Expressivenes - can we capture reality?

One opinion: NNs can approx any function. The objective could be expressiveness with small samples. In some cases, even small networks can capture everything.


Computation/Optimization: NP-hard to find weights even with 2 hidden units. Even the simplest NN with O(logd) units, no noise, no poly time algorithm always works.

Thus there might be some magic property of reality that makes local search work.


Experiment
As the number of hidden units increase the training error decreases. In on trial it turns out that the test error keeps decreasing as the number of parameters increases. In repeated trials, in most of test errors are large. Maybe in the cases where test fails when training error is zero, the norm of the parameters is high not the number of parameters. This could be norm etc. so what is the relevant "complexity measure"? And, how to minimize by optimization algorithm.
Ref. Neyshabur Tomioka S ICLR '15
SGD vs ADAM
Optimization
Different optimization algorithm -> Different bias in optimum reached -> Different inductive bias -> Different generalization properties
Need to understand optimization algorithm not just as reaching some (global) optimum, but as reaching a specific optimum. Choice of optimzation algorithm matters! The solution space is like an ocean.
Example 1: Unconstrained Matrix Completion
Ref. Gunasekar Woodworth Bhojanapalli Neyshabur 2017
Gradient Descent (small step size etc.) finds not any global minima but min nuclear norm solution which brings generalization.
Example 2: Single Overparatmerized Linear Unit
Example 3: Linear Conv Nets Over-paratmerization
Optimization Geometry and hence inductive bias affected by geometry of local search in parameter space and paratmerization characterization.
3. Peter Barlett

Generalization : Prediction Accuracy of Test Set
Typical Theorem: pred_err <= trn_err+complexity_penalty [1]
Agenda


Emprical process theory for classification


Margins analysis: relating classification to regression


Interpolation: There is no apperant tradeoff between fit and complexity


Interpolation in Linear Regression


VC Theory

P(f(x) \noteq y) <= 1/n (trn classification error) +  sqrt(c\n (VCDim(F)+log(1/sigma)))
Neural Networks VC-Dimension increases with (p=#of parameters, L=#of layers)

p if nonlinearity continues
pL  o non linearity piece wise continous

A classification problem becomes a regression if we use a loss function that doesn't vary too quickly.
For regression, the complexity of a NN is controlled by the size of parameters.
Interprolation in DL - A new challenge for Statistical Learning Theory

Deep networks can be trained to zero training error for regression loss with near state-of-the-art performance and even for noisy problems. Thus there is no notion of a tradeoff between fit to training data and complexity where [1]. Ref. Zhang, Bengio, Hardt et. al. 2017 and Belkin et al 2018.
Interpolation in Linear Regression

Classical linear regression setting, with n samples. f(x) = x'\theta, squared error as loss, risk = E[loss].
Choose \theta^ to minimize the training error average.
Excess expected loss: Empirical Risk - True Risk
^Q is corrupted because our view of covariance of x is distorted by x1, x2, ..., xn. Also, the noise.
Accurate interpolating prediction as dimension p_n grows.
Consider covariance of x in two pieces

a fixed piece due to dimension k
a tail which flattens with n

Summary
Interpolation: far from the regime of a trade off between fit to training data and complexity.
In high-dimensional linear regression if the covariance has a long and flat tail the minimum norm interpolant can hide the noise in these many unimportant directions.

Relizes on overparametrization
and lots of unimportant parameters

Can we extend these results ot interpolating deep networks?
Empirical process theory for classification: need n>>p
Margins analysis with Lipshcitz loss complexity can depend on size of parameters.
Interpolation: a new challenge. Where is the tradeoff between fit and complexity?

Interpolation in linear regression can exploit overparametrization to hide the noise.

4. Why neuroscience needs science of DL , Konrad Kording, UPenn

Goals in computational systems neuroscience

Understandable.
Should work.

5. Does AI come at a cost? Instabilities in DL - Anders Hover

Deep Fool was established in EPFL to test the instability of NNs.
Theorem: There are uncountably many classification problems.
Key point: there is always a NN that achieved zero training error but achieves generalizibility.
Question: Can stable neural networks be produced using recursion?
Example: Ref. On instabilities of deep learning in image reconstruction. Antun Renna Poon et al.
Image reconstruction with NNs is completely unstable.
If you overperform in two images, things go wrong (instability).
The instability problem is a nontrivial one. But we can test them against instabilities. Cure is DL theory.
6. Challenge and scope for Empirical Modeling for ML - Ronald Coefman, Yale

At this point ML provides encoders/tabulation and regression. The real quest should be to find instrinsic varibales which enables direct consistency and performance match between algorithmic learners.
7. Panel - Julia Kempe & Eero Simoncelli

Expressive theory vs General Theory
What do our students care about? Computation, data size (n=1), instabilities.
8. Dataset for Analyzing Face Recognition - Jonathon Phillips, NIST

Datasets

FERET, Dept of State, Mugshots (2010) - 1.6 million images

Two questions: Verification and Rank 1 recognition (who is this person?)
Ref. Lessons from collecting a million biometric samples - Philips, Flynn
Face recognition accuracy of forensic examiners, superrecognizers and algorithms
Experiment including human recognizers four groups with different expertise levels.
Best recognizer agent is created by combining one facial examiner and A2017b.
9. Neural Solvers for Power Transmission Problems, Isabelle Guyon, Paris-Sud University

AI & Electricity

Thesis works

Deep learning methdos for predicting flows in power frid by Benjamin Donnon
RL for controlling power grids

The load flow: Input: production, topology etc. --- numeric solver ---> output: power flows
One example of a numeric solver is Hades 2: the challenge is speed 100ms should be faster by 2 orders.
LEAPNet - Latent Encoding of Atypical Perturbations

Generalizes to combinatorial topology changes

LEAPNet is able to predict around operating conditions. Ref. LEAPNets for power grid perturbations, Donnot et. al. 2019.
GNS (Graph Neural Solver) for Power Systems - Iteratively propogates messages through edges
Conclusion: Augmented intelligence = operators + hades2 + NNs
10. From Deep Reinforcament Learning to AI, Doina Precup, McGill-MILA

Standart RL Scheme inspired by animals, AlphGo environment very clear, reward function is well known
Golden Goal: Efficient, continual learning and reasoning
Knowledge Representation of AlphaGo (policy and value)
Procedural Knowledge and Predictive/emprical knowledge
Knowledge must be: Expressive, Learnable, Composable
Procedural Knowledge: Options

Option: (initiation set, policy, temination condition)

Options as behavioral programs
Where do options come from: Domain knowledge, Option-Critic Models
Back to value function
Knowledge Representation: Generalized Value Functions (cumulant function, continutation function) coming from Horde Architecture
Option Models

Life Learning Agent

11. Theory-based measures of object representations in deep artificial and biological networks, Naim Sompolinksy, Hebrew University of Jeruselam

!Not familiar to subject, so couldn't write much!
Untangling Object Manifolds
Object classification capacity
12. NNs in Speech Recognition - Tara Sainath , Google AI

Conventional ASR pipeline:
Input speech -> Feature extraction -> DNN/RNN Acoustic Models -> Decoder  -> Second Pass  -> Rescoring Output
NNs helped combine the feature extraction and classification steps into one.
Deepness in speech: lower layers similar phones from different people are group together whereas in higer layers better discrimination is achieved.
Ref. B. Li et al Interspeech 2017
Multi-channel neural networks for Google HOME
What does the network learn?  Filters are doing spatial and spectral filtering.
Model: End2End Trained Seq2Seq Recognizer combining the whole pipeline for the sake of simplicity, model size shrinkes and joint optimization.
Ref. C Chiu ICASSP 2018 - conventional baseline model was outperformed by E2E which is launched in Gboard.
Tail cases in speech recognition: numerics, context injecting, injecting domain knowledge.
13. Panel: What's missing in today's experimental analysis of DL? P. Jonathon Philips, Jitandra Malik, Peter Bartlett, Antonio Torralba, Isabelle Guyon

Question: If a breakthrough happens, do we have the capacity to realize/test it?
Reproducibility facilititated DL revolution.
Datasets are biased such that the creator's algorithm shines.
14. Right Ways Forward(I): Terrence Sejnowski, Salk Institue for Biological Studies

! Organizers request short talks starting with this talk. So there was not much to note !
High dimensional Geometry, subspaces will become important
Adversarial examples
Perturbation could help building robust NNs against adversarial attacks.
We’re looking for general architectural principles.
15. Right Ways Forward(II): Jon Kleinberg, Cornell

Social policy and algorithmic decisions
Screening as a prediction problem: Tabular structures into predictions e.g CV
Interpretability problem for human decisions
Two categories of discrimination:
- disparate treatment: deliberately favoring individuals on race gender etc
- Disparate impact: regardless of intent the output is disproportionate.
The challenge in correcting for human bias
Key argument: well regulated algorithms can make discrimination easier to detect.
Decomposing a Gap in Outcomes
Disparity = Structural disparity + bias from choice of outcome + ...
16. From machine learning to Artifical Intelligence: Leon Bottou, Facebook AI Research

Caveat 1: Statistical problem is only a proxy to the real problem.
ML algorithms recklessly take advantage of spurious correlations.
Caveat 2: Causality
Viewpoints to causality
- Manipulative causation
Causal invariance
Causal reasoning
Dispositional causation: where do causations come from
Causal intuituon: correlation is not causation but the data contains hints
For instance, asymmetric relation.
The scientific method is a good model of a learning process. Hypothesis generation precedes empirical validation.