WhoSayIn/gist:4a6fc0e124297b8eaa70a521da48c090

## gistfile1.txt
00:00
good morning hi my name is Amelie I'm
00:04
going to be a session chair for the
00:05
morning so it's my great pleasure to
00:08
introduce cryovac solo who is going to
00:11
give an invited talk karai is a director
00:15
of research in deep mine and he's one of
00:17
the star researchers in our community he
00:20
has contributed to many highly
00:22
influential projects in deep mind such
00:25
as spatial transformer networks auto
00:28
regressive generated models such as
00:31
pixel recurrent networks and wave nets
00:34
and debris enforcement learning for
00:36
playing Atari games and alphago today he
00:39
will talk about from generative models
00:42
to generative agents so let's welcome
00:46
karai
00:47
[Applause]
00:48
[Music]
00:53
thank you very much Hong Kong for the
00:55
very nice introduction
00:56
and thanks everyone for being here it's
00:59
it's absolute pleasure
01:00
so is Hong like mentioned I'm going to
01:04
try to talk about unsupervised learning
01:08
in general starting from the generative
01:10
models may be a classical way when I try
01:11
to give another view that I think is
01:14
quite interesting that we have been we
01:16
have been working on recently when I
01:20
think about what are the important
01:21
things for us to do is a is a community
01:24
I think everyone here sort of agrees
01:27
that in the end what is important is to
01:29
be to be doing constitute to us learning
01:31
we sort of realize that supervised
01:33
learning has all sorts of successes but
01:37
in the end unsupervised learning is kind
01:39
of like the next frontier and when I
01:42
think about unsupervised learning there
01:46
are there are sort of like different
01:48
explanations that come to my mind and
01:50
when talking to people I think we all
01:52
have sort of different opinions on this
01:54
one of the things that I think is a
01:57
common explanation is we have an
01:58
unsupervised learning algorithm we run
02:00
it on our data what we expect is the
02:02
algorithm to understand our data and to
02:04
explain our data or or or our
02:07
environment right and and what we expect
02:10
from this is that the algorithm is going
02:12
to learn the intrinsic properties of our
02:14
data of our environment and then it's
02:16
going to be able to explain that through
02:18
those properties but most of the time
02:20
what happens is because of the kinds of
02:24
models that we use we resort to and at
02:26
the end
02:26
looking at samples and what we look at
02:28
the samples we try to see that did our
02:31
model really understand the environment
02:32
and if it understood the environment
02:34
then then the sample should be
02:36
meaningful of course we look at all
02:38
sorts of objective measures that we try
02:39
to that that we use during training like
02:41
Inception scores looked at cahoots and
02:43
such but in the end we always resort the
02:45
samples in terms of like understanding
02:47
if our model really can explain what's
02:49
going on in the environment the other
02:51
kind of general explanation that we all
02:53
use is like the goal of unsupervised
02:56
learning is to learn rich
02:57
representations right it's already
02:58
embedded in the name of the skill of
03:01
this conference the main goal of deep
03:03
learning unsupervised learning is with
03:05
learning
03:05
those are presentations but then when we
03:08
think about those representations again
03:09
it doesn't this explanation doesn't give
03:11
us an objective measure what we think
03:13
about is why those like how are we going
03:17
to think about those representations in
03:19
terms of being great and useful and to
03:21
me the most important bit is if we have
03:24
good and richer presentations then they
03:26
are useful for generalization for
03:28
transfer right and we need to we need to
03:31
sort of if you have a good unsupervised
03:33
learning model and it can give us good
03:35
through presentations then we can get
03:37
generalization so what I'm going to do
03:39
is today also tie it together with
03:41
something else that is really I think
03:43
for me it is very important as long as
03:45
I've mentioned some a big chunk of work
03:47
that we have been doing a deep mine that
03:49
I've been doing is about agents and
03:51
reinforcement learning and in this talk
03:53
I'm going to sort of take a look at
03:55
unsupervised learning from classical
03:57
sense of like learning a learning a
04:00
generative model and also learning an
04:02
agent that can do on supervised learning
04:03
so I'm going to start from the wavenet
04:06
model hopefully as many of you know it
04:10
is a generative model of audio it's a
04:12
pure deep learning model and turns it
04:14
does you can model any audio signal like
04:17
speech and and and music and then you
04:20
can get really realistic samples out of
04:21
that and the next thing I'm going to do
04:25
is I'm going to explain this other sort
04:27
of new approach that that I find really
04:29
interesting to unsupervised learning
04:31
that is based on deep reinforcement
04:34
learning learning an agent that can
04:36
actually that does unsupervised learning
04:38
so this model called spiral is based on
04:41
a new agent architecture that we have
04:43
been that we have been working on that
04:45
we have published recently called Impala
04:46
it's a very large highly scaleable
04:49
efficient off-post elearning agent
04:51
architecture that we use in spiral to do
04:54
unsupervised learning and the
04:57
interesting bit about the spiral work is
04:59
it does generalization through using
05:01
some sort of tool space tools that we as
05:03
people have created that we have created
05:06
so that we can actually solve not one
05:08
specific problem we can solve many
05:10
different problems using these tools and
05:12
using the interface of a two
05:14
and having an agent you can actually now
05:16
learn a generative model of your
05:19
environment
05:19
all right so without like more delay the
05:24
first thing that I'm going to try to
05:25
introduce is like quickly the very net
05:27
model way net is a generative model of
05:30
of audio as I said it is it samples the
05:33
robot your signal it doesn't use any
05:35
sort of interface to model the audio
05:38
signal audio in general is very very
05:41
high dimensional so the the standard
05:43
audio signal that we started when Miller
05:45
done moved a bit when we were at the
05:48
beginning but 16,000 samples per second
05:50
like if you compare that our usual
05:52
language modeling and and and machine
05:55
translation kind of tasks it is several
05:57
orders of magnitude more data so the
06:00
kinds of dependencies that one needs to
06:02
model to be able to model good audio
06:04
signals is very it's very long so this
06:09
model what it does is it samples it
06:11
models one sample at a time and it is a
06:14
soft max distribution to model the model
06:17
each sample depending on dependent on
06:20
all the all the previous samples of the
06:22
of the signal when you look at it more
06:26
closely though it is it is it is an
06:28
architecture that has quite a bit of
06:30
resemblance to the pixel CNN model maybe
06:32
some of you also are familiar with that
06:34
in the end it is a stack of multiple
06:38
commotion layers to be a little bit more
06:40
specific it has these residual blocks
06:42
you use multiples of those residual
06:44
blocks and each decision and in each
06:46
residual work there are these dilated
06:49
convolutional layers that that go on top
06:53
of each other and through those dilated
06:54
convolutional layers that are causal
06:56
convolutions we can model very long
06:59
dependencies so through that we can get
07:01
the modelling dependency in time now one
07:06
of the biggest design considerations
07:08
about wag net is it is designed to be
07:11
very very efficient during training
07:13
because during training what you can do
07:15
is because all the targets are known
07:17
when you generate the signal you
07:19
generate the whole signal at once just
07:20
run it like a convulsion on net you get
07:22
your signal then because you have the
07:24
targets you get your error signal
07:26
from that propagate back so training is
07:28
very efficient but of course when it
07:30
comes to sampling time in the end this
07:31
is an autoregressive model and through
07:34
those causal emotions you need to run
07:36
through them one sample at a time so if
07:38
you are sampling let's say 24 kilohertz
07:39
24,000 samples per second you need to
07:42
generate one sample at a time just like
07:44
you see in this animation and of course
07:46
this is painful this is painful but in
07:49
the end it works quite well and we can
07:51
generate very very high quality audio
07:53
with this so what I want to do is I want
07:58
to actually I want to I want to make you
08:01
listen to the unconditional samples from
08:04
this model so rag model the speed signal
08:07
and without any conditioning on text or
08:10
anything just take the audio signal and
08:12
model that with model that it wavenet
08:14
and then when you sample this is the
08:17
kind of so as you can see or here
08:30
hopefully the the quality is very high
08:35
and this is modeling really the raw
08:37
audio grow audio signal and this is
08:40
completely unconditional so what you
08:42
hear is sometimes you even hear short
08:44
words like okay from and then if you try
08:48
to listen all the tonation and
08:49
everything sounds quite natural and
08:52
sometimes it feels like you are
08:53
listening to someone speaking in a
08:55
language that you don't know so the the
08:57
main characteristics of the of the
09:00
signal is all captured there so in terms
09:02
of dependencies we are looking into like
09:04
something like several thousand samples
09:06
of dependencies are actually properly
09:09
and correctly modelled there and then of
09:12
course sorry and then of course what you
09:16
can do is you can you can augment this
09:18
model by conditioning on a text signal
09:22
that is associated with the signal that
09:24
you want to generate and by conditioning
09:26
on the text signal now you have a
09:28
generative model a conditional
09:30
generative model that actually solves a
09:32
real-world problem just by itself and
09:34
turn deep learning right so
09:37
the text you create the linguistic
09:38
embeddings from that using those
09:40
linguistic embeddings you can generate
09:42
the signal and then and then it starts
09:46
it's not talking right so it's a it's a
09:48
solution to the whole text to speech
09:51
synthesis problem that as you know is
09:53
very very common used in in in real
09:57
world sorry alright so when we did this
10:03
the the bayonet model and this was
10:07
around like almost two years ago now we
10:10
looked at the we looked at equality when
10:12
we use it as a TTS model and in green
10:15
what you see is the quality of the human
10:17
speech I can obtain through this mean
10:19
opinion scores and in blue you see the
10:21
wavenet and the other colors are the
10:23
other models that were the best models
10:25
around and at the time and you can see
10:27
that they met close the gap between the
10:30
human called speech and other models by
10:33
by a big margin so at the time this this
10:37
really got us excited because now we
10:39
actually had a model a deep learning
10:41
model that comes with all the
10:42
flexibilities and advantages of doing
10:44
deep learning and at the same time it's
10:46
modeling raw audio and it is it is it is
10:49
very very high quality
10:50
I could play text to speech samples that
10:53
is generated by this model but actually
10:55
what you can do is what I'm going to go
10:56
into next if you are using Google
10:58
assistant right now you are already
10:59
hearing back that there because this is
11:01
already in production so anyone who's
11:03
using Google assistant and like querying
11:05
Wikipedia and things like that the the
11:08
speech that is generated there is
11:10
actually coming from the very net model
11:11
and what I want to do is I want to
11:13
explain how we how we did that and that
11:18
brings me into our next project that we
11:20
did in the wagonette in the very net
11:22
domain this is the parallel way net
11:24
power the net project so of course when
11:27
you have a research project and at some
11:29
point you realize that okay it is
11:30
actually lands it actually lands itself
11:33
into the solution of a real-world
11:34
problem and you want to put it into
11:37
production in a very challenging
11:39
environment then then of course it
11:41
requires much more than our little
11:44
research group so this was a big
11:45
cooperation between the D point research
11:47
applied and the Google
11:48
speech teams actually so in this slide
11:52
what but what what I show is basis the
11:55
the the basic ingredients of how we turn
11:58
a wave net architecture into a
12:01
feed-forward and parallel architecture
12:03
because what we realize pretty soon when
12:06
we started when we try to attempt doing
12:09
doing putting putting a system like this
12:13
into production was actually speed of
12:15
course is very important quality is very
12:17
very important but the the importance is
12:19
of speed is it is not enough to actually
12:22
run something in real time the kind of
12:24
constraints that we track those ovals
12:26
like orders of magnitude faster than
12:27
real time even actually being able to
12:30
run in constant time so when one day
12:32
when the constraint becomes being able
12:34
to run in constant time the only thing
12:36
you can do is create a feed-forward
12:38
Network and then paralyze the signal
12:40
generation right so that is what we did
12:43
so in this slide at the top what you see
12:45
is the usual wavenet model we call it
12:48
the teacher now in the setting this
12:49
wavenet model is pure trained and it is
12:52
fixed and it is used as a scoring
12:55
function at the bottom what you see is
12:57
the generator that we call the student
12:59
and this student model is again an
13:02
architecture that is very close to write
13:04
net but it is a it is it is run as a
13:07
feed-forward convolutional network and
13:09
the way it is run is and it is trained
13:11
is actually it has two components one
13:13
component is coming from a net we know
13:15
that it is very efficient in training as
13:17
I said but slow in something the other
13:19
the other thing is based on the inverse
13:21
autoregressive flow work that was done
13:22
by the king - colleagues at opening I
13:24
last year and and and and this this
13:28
structure gives gives us the capability
13:30
to actually get a input noise signal in
13:33
and slowly transform that noise signal
13:36
into a into a proper distribution that
13:39
is going to be the speed signal right so
13:42
the way we train this is random noise
13:44
goes in together with the linguistic
13:46
features through layers and layers of
13:48
these flows the signal gets that that
13:50
random noise gets transferred into
13:52
speech signal that speed signal goes
13:54
into a net very net is like already the
13:57
best kind of scoring function that we
13:59
can use because it's a
14:00
it's a density model and wavenet scores
14:03
that and that score from that we get the
14:06
gradients back into the generator and
14:09
then we update the generator we call
14:11
this process the proper water density
14:12
distribution but of course when you are
14:15
trying to do real-world things and if
14:18
things are very challenging like speed
14:19
signals that is by itself not enough so
14:21
I have highlighted two components here
14:23
one of them as I said is the magnet
14:25
scoring function the other thing that we
14:27
use is a power loss because what happens
14:30
is when we train the model in this
14:32
manner the signal tends to be very low
14:35
energy sort of like whispering someone
14:38
speaks but they are like whispering so
14:39
during training we sort of edit this
14:41
extra loss that tries to conserve the
14:43
energy of the generated speech and with
14:47
these two the the wavenet scoring and
14:49
the power loss we were already getting
14:51
very high called speed signal but of
14:54
course like the constraints are very
14:55
very tough and what we did was we
14:58
trained another wave net model so we
15:00
sort of used wavenet everywhere right
15:01
that we are generating through a leg net
15:03
through convolution we are using very
15:04
net as a scoring function we again
15:07
trained another very net model this time
15:08
we used it as a speech recognition
15:10
system and that is the perceptual loss
15:12
that you see there so we train the wave
15:14
net again as a speech recognition system
15:16
what we do is during training of course
15:18
you have the text and the corresponding
15:21
speech signal we generate the we
15:25
generate the corresponding speech
15:27
through our generator we get the text
15:29
give that the speech recognition system
15:30
the speech recognition system of course
15:32
not needs to decode we generated signal
15:35
into those into that text right and we
15:37
get the error from there propagate back
15:39
into our generator so that's another
15:41
sort of quality improvement that we get
15:42
by using speech recognition as a
15:45
perceptual loss in our generation system
15:47
and the last thing that we did was using
15:51
a contrasting term that basically uses
15:53
okay we generate a signal conditioned on
15:55
some text you can you can create a
15:58
contrast applause we're saying that the
16:01
signal that is generated with the
16:02
corresponding text is it should be
16:05
different than the same signal if it if
16:07
it was conditioned on a separate text
16:09
right
16:10
there's a contrasting luster so more
16:12
specifically what we have is in the end
16:14
we end up with these four terms at the
16:18
top we see that the the original sort of
16:22
using vena there's a scoring function
16:24
the problem with advances the
16:25
distillation idea then we have the power
16:28
loss that that uses Fourier transforms
16:31
eternal to to conserve the energy and
16:34
the contrastive term and find out the
16:36
perceptual was that does the that does
16:40
the speech of cognition and when we all
16:42
these then of course what we did was we
16:44
looked at the quality now what what I'm
16:47
showing here is the quality with respect
16:49
to the again the best non wavenet model
16:52
so this is sort of like a year after the
16:54
original research pretty much exactly a
16:57
year and so during that time of course
17:00
the the best speech synthesis models
17:02
also improved but wavenet was still
17:04
better than better than anything else
17:06
and it was matching the quality of so
17:08
the new magnet the parallel bayonet is
17:11
exactly matching the quality of the of
17:15
the original magnitude and what what I'm
17:18
showing here is three different US
17:20
English voices and also Japanese and
17:21
this is the kind of thing that we always
17:23
want from deep learning right the
17:25
ability to generalize to new datasets to
17:27
new domains so we have developed all
17:29
this model one practically one single US
17:31
English voice and it was just a matter
17:33
of collecting or getting another data
17:35
set from another either speaker or
17:38
another language like some speaker
17:41
speaking Japanese you just get that run
17:43
it and there you go you have a speech
17:45
synthesis you have a production called
17:46
speaks into the system just by doing
17:48
that this is the kind of thing that we
17:50
really like from deep line right and and
17:52
if you are thinking about from from deep
17:54
learning and if you are thinking about
17:55
unsupervised learning I think this is
17:57
this is this is a very good
17:58
demonstration of that
17:59
so before switching to the next one I
18:02
also want to mention that we have also
18:04
done some further work on this called
18:06
wave RN and that is recently published
18:08
and
18:09
I encourage you to look into that one
18:11
too that's a very interesting piece of
18:12
work also for generating speech at very
18:15
very high speed the next thing I want to
18:18
talk about is the Impala architecture
18:20
the new agent architecture that I said
18:22
because as I said so now wavenet is a
18:25
sort of in a classical sense of of
18:30
unsupervised model that actually can
18:32
solve a real world problem now the next
18:35
thing I want to sort of start talking
18:36
about is this new different way of doing
18:38
unsupervised learning but for that most
18:41
another exciting bit is to be able to do
18:44
deep reinforcement learning at scale
18:47
sorry all right so I want to sort of
18:54
motivate why do we want to actually push
18:56
our deep reinforcement learning models
18:57
further and further because most of the
18:59
time what we do because this is a new
19:01
area is we take sort of like very simple
19:05
tasks in in some simple environments and
19:08
what we try to do is we try to train an
19:10
agent that shows a single task in that
19:12
environment well what we what we want to
19:15
do is we want to go further than that
19:16
right like again going back to the point
19:18
of generalization and being able to
19:19
solve multiple tasks we have created the
19:22
new task set this is an open source task
19:24
set that we have like we have an open
19:26
source environment called vm lab and as
19:28
part of that we have created this new
19:29
task set vm lab 30 it is 30 environments
19:33
that are sort of covering tasks around
19:36
language memory and navigation and those
19:38
kinds of things and the goal is not to
19:41
solve each one of them individually the
19:43
goal is to have one single agent one
19:45
single network that is that is solving
19:48
all those thoughts all at the same time
19:50
there is nothing custom in that agent
19:52
that is specific to any single one of
19:55
these environments when you look at
19:56
those environments I'm showing some of
19:59
those here the agency has a first-person
20:02
view so it is in like a maze-like
20:04
environment and the agent has a
20:06
first-person view camera input and it
20:08
can navigate around go forward backwards
20:10
and rotate around look up down jump and
20:13
those kinds of things and and it is
20:16
solving all different kinds of tasks
20:18
that are that are catered to test
20:19
different
20:20
kinds of kinds of abilities but the goal
20:22
is as I said again to solve all of them
20:24
at the same time one thing that becomes
20:26
really really important in this case is
20:27
of course the stability of our
20:29
algorithms because now we are not
20:32
solving one single task we are solving
20:34
30 of them and we want to really stable
20:36
models because we don't have the chance
20:37
to tune hyper parameters one single task
20:39
anymore and of course what becomes
20:41
really important is task interference
20:43
right hopefully what we expect again by
20:45
using deep learning is this is like a
20:47
multi task setting and in this multi
20:48
task setting we hope to see positive
20:51
transfer rather than task interference
20:53
and and and we hope to demonstrate this
20:55
in this in this challenging
20:56
reinforcement of a reinforcement
20:58
learning domain - okay I sort of
21:03
realized that I needed to put a slide
21:05
about by deep reinforcement learning
21:07
because a little bit to my surprise that
21:10
was actually not much reinforcement
21:11
learning in this conference this year
21:12
and I wanted to sort of a little bit
21:15
touch on why I think is important for
21:18
for the deep learning community before
21:20
this community to actually do deep
21:22
reinforcement learning because it is to
21:24
me it is at the core of if if one of the
21:26
goals that we work for here is AI then
21:28
it is at the core of order right
21:30
reinforcement learning is a very general
21:32
framework for it
21:33
for doing sequential decision-making for
21:36
learning sequential decision making
21:38
tasks and deep learning on the other
21:40
hand of course is the best model that we
21:43
have the best set of algorithms we have
21:45
to learn representations and
21:47
combinations of those combinations of
21:51
these two different models is is the
21:55
most sort of like arm is the best answer
21:58
so far we have in terms of learning very
22:00
good state representations of very
22:03
challenging tasks that are not just for
22:05
like solving toy domains but actually to
22:08
solve challenging real world problems of
22:11
course there are many things that are
22:12
there are open problems there like some
22:14
of them that are sort of interesting at
22:16
least for me is the idea of separating
22:20
the computational power of a model from
22:22
the number of weights or the number of
22:24
layers it has or basically again going
22:27
back to on supervised learning learning
22:29
to transfer
22:30
so if we do this deep reinforcement
22:32
learning models with the idea to to
22:35
actually generalize to transfer okay so
22:39
the Impala agent is based on the on
22:44
another work that we have done couple of
22:46
years ago called the a synchronous
22:48
advantage actor critic the a3c model in
22:50
the end it's a it's opposed to gradient
22:53
methods but you have is like that I
22:54
tried to sort of cartoonishly explain
22:56
that in the in the in the figure at
22:58
every time step the agent sees the
23:00
environment and at that time step the
23:03
agent outputs a post distribution and
23:06
also the also the value function the
23:08
value function is the agents expectation
23:12
of the total amount of reward that it's
23:14
going to get until the end of the
23:16
episode being in that state all right
23:18
and the policy is the distribution over
23:19
the actions that the agent has and at
23:21
every time step the agent looks at the
23:23
environment and updates is policy so
23:25
that it can be can actually act in the
23:27
environment and it updates his value
23:28
function and the way you train this is
23:30
with the with the post the gradient
23:32
intuitively this is actually is actually
23:34
very simple what you do is the gradient
23:36
of the policy is scaled by the
23:39
difference between the total reward that
23:41
the agent actually gets in the
23:43
environment - the baseline and the
23:46
baseline is the value function right so
23:48
what it means is if the agent ends up
23:50
doing better than what the value
23:52
function what its assumption was then
23:55
it's a good thing you have a positive
23:56
gradient you're going to reinforce your
23:57
understanding of the environment if the
23:59
agent does worse than what it got so
24:02
well so the value was higher than the
24:04
total reward that you got then you have
24:06
a negative gradient you need to shuffle
24:08
things around and the way you learn the
24:10
value function is by the usual and step
24:13
and step TD error now the a3c algorithm
24:17
so this was the actor critic part the a
24:20
synchronous party in 3 C algorithm is
24:22
composed of multiple actors and each
24:24
actor independently operates in the
24:27
environment and and and collecting for
24:30
collect observations
24:32
acts in the environment computes the
24:34
posted gradients and and
24:37
completes the gradients with respect to
24:39
the parameters of its network then what
24:41
it does is it sends those gradients back
24:43
into the parameter server then the
24:45
parameter server collects all these
24:46
gradients from all different actors
24:48
combines them together and then shares
24:50
those parameters with all the actors
24:52
around now what happens in this case is
24:55
as you increase the number of actors
24:56
this is the usual asynchronous
24:58
stochastic gradient descent setup as the
25:01
number of actors increases the stale
25:03
grade the staleness of the gradients
25:05
becomes a problem so what happens is in
25:08
the end is distribution the experience
25:10
collection is actually something very
25:11
very advantages it's very good and but
25:14
what happens is communicating gradients
25:16
might become a bottleneck as you try to
25:17
really scale things up so for that what
25:21
we tried was a different architecture
25:27
the idea of a sanctuary server is
25:31
actually quite useful but rather than
25:33
using it to just to just do the
25:36
accumulate the parameter updates the
25:39
idea of that learner is to make the
25:42
centralized component into a learner so
25:45
the all the whole learning algorithm is
25:46
is contained in that what the actors
25:48
does is only act in the environment not
25:50
compute the gradients or anything
25:52
send the observations back into learners
25:54
to the learner and the learner sends the
25:56
parameters back and in this in this way
25:58
what you are doing is you are completely
26:00
decoupling what happens about your
26:02
experience collection in your
26:04
environments from your learning
26:06
algorithm and in this way you are
26:07
actually gaining a lot of robustness
26:09
into noise in your environments
26:11
sometimes rendering times vary some some
26:14
environments are slow some environments
26:16
are fast
26:17
all that is completely decoupled from
26:18
your learning algorithm but of course
26:20
what you need is a good learning
26:22
algorithm to to be able to deal with
26:24
that kind of variation so in the end we
26:27
empower what we have is we have a very
26:29
efficient decoupled backward pass if you
26:31
were so actors generate trajectories as
26:33
I said but then but that that decoupling
26:37
creates this of posionous write the
26:39
policy in the actors the behavior poles
26:41
if you will is separate from the policy
26:44
in the learner
26:45
target policy so what we need is enough
26:47
posted earning of course there are many
26:48
of posted learning algorithms but we
26:50
really wanted to have a post gradient
26:52
method and and for that we developed
26:56
this new method called V trace and it's
26:58
an off-post advantage critic algorithm
27:00
the advantage of V traces it is using
27:04
these truncated important sampling
27:06
ratios to actually come up with an
27:08
estimate for the valley so because of
27:12
there is this imbalance between the
27:13
learners that and the actors you need to
27:15
balance those you need to balance that
27:17
difference the good thing about this is
27:19
it's an algorithm is a smooth transition
27:22
between the on post case and off policy
27:24
case when they when the actors and the
27:26
learner are completely in sync so you're
27:29
in the on policy case the algorithm
27:30
actually boils down to the usual a3c
27:33
update with the n steps bellman equation
27:35
if they become more separate than the
27:38
correction of the algorithm kicks in and
27:41
then you have the corrected corrected
27:43
estimate the algorithm has two main
27:47
components to those truncation factors
27:49
to control two different aspects of the
27:52
of off learning one of them is the robe
27:55
which controls the reach value function
27:58
the algorithm is going to converge
28:00
towards the behavior the value function
28:02
that code that corresponds to the
28:04
behavior policy or the value function
28:06
that corresponds to the target policy in
28:07
the learner and the other one controls
28:09
the speed of convergence the C factor by
28:13
by controlling the by controlling the
28:15
truncation that it can it can increase
28:17
or decrease the variance in learning and
28:19
the stick and it can it can it can have
28:22
an effect on the speed of convergence
28:24
now than me when we tested this of
28:28
course the goal is to test on all
28:29
environments at once but what we wanted
28:31
to do was first you look at the single
28:33
task is also we look at five different
28:35
environments and we see that in these
28:37
environments the Impala algorithm always
28:39
very stable it performs at the top so
28:44
the comparisons here are the Impala
28:45
algorithm the batch a3c method and they
28:50
touch a to C method and then different
28:52
versions of a three C algorithms and you
28:54
can see that Impala and batch a to C are
28:56
always at
28:57
performing at the top Impala seems to be
29:00
doing fine
29:01
they're like the the dark blue curve and
29:03
and this gives us the sort of feeling
29:06
that okay we have a nice outlet now of
29:08
course the other thing that is very
29:09
important and that is discussed a lot is
29:12
the stability of these algorithms right
29:14
I actually really like these floods
29:16
since during the a3c work actually keep
29:19
looking at these floods and we always
29:21
put them in the papers the plot here is
29:23
on the x-axis we have the heart we have
29:25
the hyper parameter combinations when
29:27
you when you of course trade any model
29:29
what we do all of us is we do some sort
29:31
of hyper parameter sweep and here what
29:33
we are doing is we are looking at the
29:35
final score that we achieve with every
29:37
single hyper parameter setting that we
29:39
that we get and you sort it and in the
29:42
in this kind of thought what you have is
29:44
the the the KERS the algorithms that are
29:47
at the top and that our most flood are
29:49
the most like better performing and most
29:52
stable algorithms right and what we see
29:54
here is Impala is always of course it's
29:57
achieving better results but it's not
29:58
achieving those results because there is
30:00
one sort of lucky - parameter setting is
30:03
consistently at the top and you can see
30:05
that it's not of course completely flat
30:07
because in the end we are sort of
30:08
searching over three orders of magnitude
30:10
in parameter settings the but we can see
30:18
that the algorithm is actually quite
30:19
stable now when we look at our our our
30:22
main goal here what we are looking at in
30:24
on the x-axis we have the wall clock
30:26
time and on the y-axis we have the sort
30:29
of the normalized score and the and the
30:32
red line that you see there is the a3
30:34
see and you can see that Impala not only
30:37
H is much better of course if they
30:39
choose them much much much faster the
30:41
other thing is comparing the green and
30:43
the orange line thirds that is the
30:45
comparison between training Impala in an
30:47
expert setting versus a multi task City
30:49
and we see that it achieves better
30:51
scores like the faster which again gives
30:54
us the idea that we are actually seeing
30:56
positive transfer it's it's a like to
30:58
like setting the all the all the all the
31:02
details of the network and the agent are
31:03
the same in one case you have one
31:05
network
31:06
tasks and in other case you train the
31:08
same network on all the tasks and what
31:10
you achieve is a better result because
31:12
of the positive transfer between those
31:14
tasks and what happens is if you give
31:17
Impala more resources you end up with
31:20
this almost vertical takeoff from there
31:23
right and what you have is you can
31:24
actually solve this challenging turkey
31:26
task domain in under 24 hours given the
31:29
resources and that is the kind of
31:30
algorithmic sort of power that we want
31:33
to be able to train these very highly
31:35
scalable agents now why do we want to do
31:38
that that is the point that I want to
31:40
come next and and and in the final part
31:43
this is the new spiral algorithm that I
31:46
want to talk about now just quickly
31:49
going back to the original ideas that
31:52
that I talked about unsupervised
31:54
learning is also about explaining
31:56
environments and generating samples but
31:59
maybe generate examples by explaining
32:01
environments and we talked about the
32:03
fact that when we have these deep
32:04
learning models like magnet we can
32:06
generate amazing samples but at the same
32:08
time maybe there's a different way we
32:09
can do these things less implicit in the
32:11
Sun set when we generate these samples
32:13
they come with some explanation and that
32:15
explanation can go through some using
32:17
some tools in this particular case what
32:20
we are going to do is we are going to
32:22
use a painting tool and we are going to
32:24
learn to control this painting tool it's
32:26
a real drawing program and we are going
32:28
to basically generate a program that the
32:31
painting tool will use to generate the
32:33
image and the main idea that I want to
32:36
convey is by using tools by it by by
32:39
learning how to use tools that are
32:41
already available that we have actually
32:44
we can start thinking about different
32:46
kinds of generalizations that I'll try
32:47
to demonstrate so in real word we have a
32:50
lot of examples of programs and their
32:53
executions and the results of those
32:55
programs they can be arithmetic programs
32:57
floating programs or even architectural
32:59
blueprints right and what we do is
33:02
because we know we have an information
33:06
on that generation process when we see
33:10
the results we can go and try to infer
33:13
what was the program what was the
33:14
blueprint that generated that that
33:16
particular input so we can do this and
33:18
the goal is to be able to do this with
33:20
our with our agents too
33:22
specifically we are going to use this
33:24
environment called lead my paint it is
33:27
actually a professional-grade
33:28
open-source drawing library and it's
33:30
used worldwide by many artists what we
33:33
are doing is we are using a limited
33:34
interface basically learning - learning
33:36
to draw brushstrokes we are going to
33:39
have an agent that does that the agent
33:41
in the end called spiral has three main
33:43
components first of all is the agent
33:45
that generates the brushstrokes sort of
33:47
I like to see that as writing the
33:49
program the second one is the
33:51
environment to lead my paint so the
33:53
brushstrokes come in environment turns
33:55
those into brushstrokes in the canvas
33:57
and that cameras got those into a
34:00
discriminator and the discriminator is
34:01
trained like again and that
34:04
discriminative looks at the generated
34:05
image and says does this look like a
34:07
real drawing and then gives a score and
34:09
that score is opposed to the usual gun
34:11
training rather than propagating the
34:13
gradient packs we get that score and we
34:16
train our agent with that score is a
34:18
reward so when you think about this all
34:20
these three components coming together
34:21
you have an unsupervised learning model
34:23
similar to the Ganz but rather than
34:26
generating in the pixel space we
34:28
generate in this program space and the
34:30
training is done through the done
34:33
through the reward that the agent itself
34:35
also learns so we are sort of trusting
34:37
another neural net just like in Gans
34:39
setup to actually guide learning but not
34:41
through its gradients just treat the
34:42
score function so in my opinion it makes
34:44
it in certain cases it makes it very
34:46
very sort of capable of using a
34:49
different kinds of tools so as I said
34:52
this agent the the reinforcement
34:54
learning part of the agent is completely
34:56
the same as the Impala
34:57
so we now that we have an agent that can
35:00
actually solve really challenging
35:02
reinforcement learning setups we take it
35:03
and put it into this environment
35:05
augmented with the ability to learn a
35:08
discriminative function to actually have
35:11
the reward the to emphasize again the
35:13
important thing here is yes we have an
35:15
agent but there is no environment that
35:17
actually says that ok this is the reward
35:19
that the agent should get the reward
35:22
generation is also inside the agent
35:24
thanks to again all the unsupervised
35:26
learning models
35:26
that is actually being studied here so
35:29
we specifically use against it up there
35:31
so can we generate the first thing of
35:35
course we try is when you are doing
35:36
unsupervised learning from scratch again
35:38
you go back to illness right you start
35:40
from M&S; and initially of course it's
35:42
generating various crash pad like things
35:44
but then through training it becomes
35:47
better and better and better here in the
35:49
middle you see that now the the agent
35:52
learned - these are complete
35:53
unconditional samples again the ones
35:55
that you see in the middle it learn to
35:57
create these trucks that generates these
35:59
digits right to emphasize this this
36:01
agent has never seen strokes that are
36:04
coming from real people how we draw
36:06
digits it learned to experiment with
36:09
these drugs and it's sort of built its
36:11
own policy to create these strokes that
36:14
would generate these images of course
36:16
you can train the whole set up is a
36:17
conditional generation process to
36:19
recreate a given image - I think the
36:22
main thing about this is it's learning
36:24
an unsupervised way to throw the strokes
36:26
I see it as the environment the the
36:29
league my paint environment sort of
36:31
gives us a grounded bottleneck to
36:33
actually create a meaningful
36:35
representation space of course the next
36:38
thing we tried was on the glut and again
36:39
you see the same things it can generate
36:41
unconditional meaningful only glove
36:43
looking like samples or it can recreate
36:45
on the glut samples but then
36:48
generalization right so here what we
36:50
tried was train the model on Omniglot
36:52
and then ask it to generate endless
36:55
digits right this is what you see in the
36:57
middle middle road there can it draw in
36:59
this digits this has never seen amnesty
37:02
just before but we all know that only
37:04
God is more general than in this and it
37:06
can do it right given an amnesty yet
37:08
it can actually draw that the network
37:10
itself has never seen any any amnesty
37:13
just during its training then we tried
37:17
Smiley's right there line drawings okay
37:19
so it can giving it smiley it can also
37:21
drop Smiley's - that is great so can we
37:25
do more we did this we took this cartoon
37:30
drawing and this is done by chopping it
37:33
up into 64 by 64 pieces and it's a
37:36
general line drawing right again this is
37:38
the
37:39
imagine that if the Train using Omniglot
37:40
and now you can see that it can actually
37:43
recreate that trolling certain areas are
37:46
read about right back around eyes
37:47
insides they are really complicated but
37:49
in general you can see that it is
37:51
actually capable of generating those
37:52
drawings so this gives you an idea of
37:55
okay generalization I can I can sort of
37:58
train on one domain and generalize the
38:00
new ones
38:01
so can I push it further the next thing
38:03
that we tried was okay the advantage of
38:06
using a tool is you have a meaningful
38:08
representation space that we can
38:11
hopefully transfer that representation
38:13
space into a new environment so here
38:15
what we do is again the same agent that
38:17
is trained using Omniglot we transfer
38:19
that simulated that that simulated
38:22
environment into real world the way we
38:25
do that is we we took that same program
38:28
and our friends at the robotics group at
38:31
deep mine wrote a controller to control
38:36
that robotic arm to take that program
38:38
and drove it this whole like experiment
38:41
happened in under a week really and what
38:43
we ended up with was the same agent the
38:47
same agent it is not fine-tuned through
38:49
all the setup or anything the same agent
38:51
generates its brushstroke programs and
38:54
then that program goes into a controller
38:56
that can be realized by a real robotic
38:59
arm right the advantage of doing this is
39:01
the reason we can do this is the
39:03
environment that we used is a real
39:05
environment we didn't sort of create
39:07
that environment the latent space if you
39:10
will is not something some arbitrary
39:12
latent space that we created because
39:14
it's a latent space that is defined by
39:17
us that is as a meaningful to space and
39:20
the reason we create those tools is to
39:21
solve many different problems anyways
39:24
right and this is an example of that
39:25
using that tool space gives us the
39:27
ability to actually transfer its
39:29
capability so with that I want to
39:32
conclude I tried to give an explanation
39:36
of you think about generative models and
39:39
unsupervised learning and to me of
39:41
course like I'm a hundred percent sure
39:43
everyone agrees that our aim is not to
39:45
just look at images right our aim is to
39:47
do much more
39:48
than that and I tried to give two
39:50
different two different aspects one of
39:52
them is the kind of genital models that
39:55
we can do actually right now can solve
39:57
real world problems like we have seen in
39:59
Vienna and also we can think about a
40:01
different kind of setup where we have
40:03
agents actually training and and
40:06
generating interpretable programs right
40:09
that is an important aspect that we have
40:10
seen that conversation coming up here
40:12
actually through several of the talks
40:15
here that being interbeing able to
40:17
generate interpretable programs is one
40:19
of the bottlenecks that we face right
40:21
now because there are many critical
40:23
applications that we want to solve there
40:24
are many tools that we're gonna eat you
40:26
eyes and this is one sort of step
40:28
towards that best way how how I see and
40:30
being able to do these requires us to
40:33
create these very capable reinforcement
40:37
learning agents that rely on new
40:39
algorithms that we need to that we need
40:41
to work on with that thank you very much
40:44
I think I want to thank all my
40:46
co-operators for their for their help on
40:49
this thank you very much
40:50
[Applause]
40:50
[Music]
40:57
[Applause]
41:06
we have time for maybe one or two
41:09
questions
41:24
okay so I have 100 so how do you think
41:27
about scaling to like more like general
41:32
domains beyond some simple strokes how
41:37
to generate like realistic scenes right
41:41
so one thing that I haven't shown here
41:43
actually yes creating realistic scenes
41:46
is is one case one thing that I haven't
41:49
talked about is actually as part of
41:51
sorry as part of this work it's actually
41:54
in the paper one thing that the team did
41:57
by the way I had to mention and this was
41:59
worked on most by Yaroslav gun in
42:00
Melbourne he's actually PhD student at
42:03
Mira and he spent his summer with us
42:04
doing his internship so as an amazing
42:06
job for actually doing it during an
42:08
internship pretty big congratulations to
42:10
him so one thing that that that we did
42:12
was actually try to generate images so
42:14
we took the survey data set and use the
42:16
same drawing program to actually to
42:20
actually draw those and in that case our
42:23
setup is just scaling towards those like
42:26
the same stuff set up actually scales
42:27
because it's a general drawing - and you
42:30
can control the color and we can do that
42:32
but it requires a little bit more sort
42:35
of like it was one of the last
42:36
experiments that we did but like it is
42:38
it is sort of in the words thanks for a
42:42
great talker I had a question about the
42:44
Impala results right you had a slide
42:47
where one with a curve where all workers
42:51
are learning versus having one
42:54
centralized sorry centralized learner
42:57
the all workers learning actually does
43:00
better
43:01
than the centralized letter and I found
43:04
that not quite surprising but like you
43:07
know it's great and it's great to see
43:10
the positive transfer between tasks do
43:11
you think
43:12
have you tried that on other Suites of
43:13
tasks do you think it's just because
43:14
it's tasks in this suite of tasks are
43:17
very similar to usually like it
43:19
definitely depends on that but the
43:21
reason we created those tasks it is for
43:24
that reason right in real world what we
43:26
have is we have the visual structure of
43:28
our world is unique so the kind of setup
43:31
that we have in deep defined lab that
43:33
that that tasks it is that it's a
43:36
unified visual environment you have one
43:38
sort of one one one kind of agent with a
43:41
unified action space and now you can
43:43
focus on solving different kinds of
43:45
tasks of course like that is the kind of
43:47
thing that we were testing given all
43:48
these through does it actually is it
43:51
possible to do the multi task positive
43:53
transfer that we see in supervised
43:55
learning cases that we were able to see
43:57
that in reinforcement learning yeah
44:01
hello this is exciting I have a question
44:06
about extending this to maybe more open
44:09
domains so what is the challenge it's a
44:13
challenge to be a number of actions to
44:16
pick because the number of strokes maybe
44:19
the strokes face smaller so what other
44:22
challenge to extend to open domains with
44:27
what do you like what do you have in
44:29
mind is open domains like number of
44:31
actions is definitely a challenge right
44:32
it is definitely one of the big
44:34
challenges that a lot of research in as
44:36
far as I know in RL goes into that but
44:39
that is that is I think only one of the
44:41
main challenges the other challenge of
44:42
course is the straight representation
44:45
that is mainly why we sort of used deep
44:48
learning right because we expect that
44:51
with deep learning we are going to be
44:52
able to learn better representations and
44:54
that still remains as a challenge
44:56
because being able to learn
44:57
representations is not an architectural
44:59
problem only it is also about finding
45:03
the right sort of training set up and
45:05
spyro was an example of that where we
45:07
can get that reward function that that
45:08
reward signal in an unsupervised way
45:11
right and in many different domains
45:13
like there are many different ways we
45:15
can do this but actually finding those
45:16
solutions also part of that
45:20
okay so let's Bank arriving
45:24
[Music]
45:27
[Applause]
Up next
AUTOPLAY