Skip to content

Instantly share code, notes, and snippets.

@bmorphism
Created December 17, 2023 02:10
Show Gist options
  • Save bmorphism/f8f52dfca99b7f6ac470f7837de2a236 to your computer and use it in GitHub Desktop.
Save bmorphism/f8f52dfca99b7f6ac470f7837de2a236 to your computer and use it in GitHub Desktop.
soft equivariances

Thank you very much for coming to our workshop.

Thank you, Mr. Joseph.

I'm super happy to be here.

I'm a bit of an imposter.

I know nearly nothing about neuroscience.

So I'm going to use neuroscience terms, but I'm basically flopping my way through those things.

You could take me a little bit more seriously for the machine learning things that I'm saying.

I changed my title to Travelling NeurApps and Brains and Machines because I think the title NeurApps is absolutely brilliant.

So I congratulate the organizers for thinking of that term.

And this is joint work with my previous student, Andy Keller, who sadly isn't my student anymore.

He's now a postdoc at Harvard, and if you want to talk to him, this is his picture, which I put up to embarrass him.

But he's somewhere here, and you can find him now.

So symmetries are amazingly powerful in physics.

And so here's a few examples that I always like to quote, which is in the turn of the 19th century.

So the electric fields and magnetic fields were considered two completely different things until Maxwell came around and figured out that really if you change the frame of the observer, you can turn one into the other, and it's really one phenomenon.

Similarly, Einstein figured out that really if you're in a box, let's say in an elevator, and you drop a ball, you don't know it's because you're actually in a box flying out, accelerating away, or in a gravitational field.

And from all of that, he figured out all of general relativity.

So one insight, a massive, impactful theory.

And of course, the entire standard model of elementary particles, staggeringly precise theory in physics, is actually organized in terms of the symmetry groups.

And here's a bunch of people who have worked in the lab, in M-Lab.

And there's Mark in business here, who has also done great work, you've heard about him already.

So it started with Dr. Cohen, working on equivariance, and the idea of equivariance.

There's a beautiful book by Maurice Weigert here.

It's somewhat of a bible.

It's like 500 or 600 pages, beautifully illustrated, at all levels of sophistication.

So if you want to get into this field, it'll be good.

So the idea of equivariance is very simple.

You want your neural network to understand that if you change something at the input, that that's really the same thing happening, and you want the output to change in similar ways.

So if you have a gecko, and you move the gecko in your input, then if you first filter the gecko, and then the output should translate as well, so the filtered image should translate as well.

You can do this on images, and this is just one illustration of one of these beautiful pictures that's in this book.

But I want to point out that the transformations that are happening in the latent space do not have to be the same thing that are happening in the input space.

So here's a beautiful painting that we all know, and if you filter it, let's say, for horizontal things, then you find, you know, so you filter it for eyes, and for this line, and you filter it for mouth and this line, you'll find these two detections, but you won't find anything here.

If you rotate the image, then you'll find the same detections which you find from here, because the eyes have been rotated.

And this basically means that you're shifting from one of these columns here to the next column in your latent representation.

So it's not just a rotation of the detections, but it's also a shift in the space.

Okay, so at a sort of more general level, you can think of symmetries as input transformations that generate predictable transformations of the activation layers, and this is also known as homomorphic representations.

So you have input, which could be an image or something, you encode it using some neural net, and you want to maybe encode it as an equivariant network.

Then there is these hidden representations, and if you transform the input, it could be anything, it could be sort of transformations in real life, like you're rotating your head and the information in your head is changing, or the light might be turned on, or something like that.

And you want some kind of representation in these hidden layers, which mirror the representations in the input.

And then once you've done that, you should be able to predict, using the decoder, you should be able to predict the transformed input.

And so you also see here something that we'll talk about in a minute, the variation on the decoder structure.

But the big question that we're going to ask ourselves today is, what is this?

Now, for regular groups and representations that we know, we know what that is, because those are the actual actions on the reduced representation of the group, so we'll find, we know how that works.

But we are going to ask the question, how can we generalize this to work, to things that are maybe not groups, or just transformations, or things that we don't know, maybe they're hidden in the data.

And so, we probably all, all scientists in the room certainly know this type of picture, which is sort of the layout, or the orientation, selectivity of the neurons, and the contrast, there are a lot of neuroscience words, but I, of course, just read these in books.

And sort of each one of these colors now is an actual orientation.

And now we can sort of imagine that if we do something to the input, that we traverse a trajectory in this space, right?

If I rotate my head, you know, all my angles will change, and you can sort of imagine maybe that you're sort of, you're rotating in this space of this orientation somehow.

And so that's the thing that we're going to try to model.

This could be a path of distributions, right?

Because every input causes an actual distribution of activations in your brain, and that would, and I'm now having to change and push forward this distribution through this representation.

Now, I'm going to differentiate between overdamped and underdamped dynamics.

If you do underdamped dynamics, it's sort of like an oscillator, things are oscillating.

If it's overdamped, you're basically taking a distribution, you're sort of pushing it, deforming and pushing it forward, but it's not deforming and coming back with inertia.

Right, and if you want to make it truly simple, when I say waves, I sort of mean something like this, like you see in a soccer stadium.

There's activations, right?

If you do like this, this neuron is very active, and it sort of moves through this, through these orientation maps or these representation maps.

Okay, and so, I'm not just fantasizing about this, although it is something that's quite recent, and this is what I get from talking to people like Lyle Miller and Thierry Smasi.

There's been a whole bunch of very interesting papers on waves in the brain, so people have actually been detecting these waves in the brain.

And so here you see, this is from one of Lyle's papers there, so you can actually measure the fact that there's actually these waves of activity going through a sort of portable area, actually over very large distances.

And this is by far the most exciting thing I've seen, so Andy showed it to me recently.

It's from this paper.

I can just, it's hard for me to stop watching this thing.

So, the dynamics that's happening in the brain, and these things here are these pinwheels, and these pinwheels are these places where, very quickly, you go around it, the phases go through 2pi.

And in this particular dynamical model of this paper, there's a kind of, and there's a little vortices in a sort of a fluid, if you wish, and they can be created in pairs, they have both a charge, a positive and a negative charge, depending on which orientation, you know, the activation move around it.

A positive and a negative pair can be created and destroyed, but you have to have charge conservation.

If you have a background in quantum field theory, like myself, you look at this, you think, wow, the brain is doing quantum field theory, and actually the second quantization is actually creating and destroying particles, and one day, neuroscience will probably do a defining diagram.

Okay, so, back to some more serious stuff.

Well, this is very serious, actually, it's measurements.

It's getting quite more serious than that, I guess.

So, how do we want to generalize this notion of equidermis?

And so the idea is, again, so there is input, I'm going to change the input in some way, I have an encoder, a neural network, some neural sort of pathway that gets an input encoded in the brain, and then I have to choose something here, so this is the big question mark, this is where the inductive bias we talked about this morning, what is the inductive bias that we're going to stick into this, the big question, right?

And so once we have chosen something here, we can now sort of mirror the stuff that's happening in the world by things that are happening in the brain, and then we should be able to decode or predict what's happening in the future.

We could just get an input, close our eyes, have some thoughts, and then predict the future, right?

And so also note that this is actually like an equidermis diagram, right?

Because if this is just ordinary group transformation, then this other diagram would say, if you first filter and then transform, or if you first transform and then filter, the original diagram should float.

But I've given myself a lot more freedom not to put irreducible representations there of a group, I'm going to do something far more relaxed.

Okay, so what we chose here is this set of oscillators, and I'll show in a minute that it was inspired by another paper, a paper, let me just quickly go there, so this paper here, this was a graph neural net where every neuron was basically modeled as an oscillator, it was connected to its neighbors, right?

So you can now imagine that if you're connected to your neighbors, if you start doing something, it has an effect on the others, and everybody starts to oscillate.

What they found, though, is that this type of neural network backpropagation was very effective, there wasn't collapse of gradients or explosion of gradients.

But if you look at the actual dynamics in the late space, it doesn't look like waves at all.

And then Andy came up with this interesting idea, he's like, let's just change this matrix multiplication by a convolution.

And if you do that in this model, you create waves.

And actually, this is a very robust feature, we found that, you know, it's not something that, it's similar to, when I was, you know, when I was starting in this field, I was interested in ICA, and it was just amazing that whatever you did, you always hit the Bohr filter.

You know, every week there was a new paper, and with a new method, new principle, and everybody goes to Bohr filters.

So my hope is that these waves are also extremely robust.

So, yeah, so here we go, so the input is like maybe a rotation, the input, the encoder is a neural network, and then here is this, you know, these neurons, and I connect it through these convolutional sort of oscillators, and then we're going to train the whole thing.

So we're going to train by giving it pairs of inputs, and it transforms images, or things, objects, whatever, right?

And then we're going to ask it, train for me both the encoder and the decoder, and train for me the parameters of this PD, that's sitting there, or this set of all these.

Okay, everything is going to be trained.

And if you want to know what kind of ODEs these are, this is second-order ODE, so you get some oscillations.

There's external forcing, because the information from the data is impinging on the neural representation.

It's locally connected, that's why we get wave patterns, and it's under-damped.

Okay?

Now I need to explain a little bit about various nodal encoders, so this is work done by Dirk Kiemann in NLAB.

And so this idea of our encoders is a kind of, I guess most people know it, but it's like, you start with some distribution that's living on some complicated data manifold, and you push it through layers of encoders, like a neural network, to map it onto a much easier sort of distribution, marginal distribution in the data space.

Think of that as maybe the Gaussian distribution, and every data point gets mapped to some small distribution in that space.

And then there is a decoder distribution, where you pick a point from that sort of space, and you push it through the decoder, that's also typically a neural net, but could also be a simulator if you want, and then you generate a point, hopefully on the manifold, but of course, you know, maybe it's a bit fatter, but it's not exactly the same as the original one.

And you train that to make the one that's going up and the one that's going down to sort of become the same.

And that's very similar to the Fuby models, if you like.

Okay, so now the idea is that this model that I just described is actually a variational autoencoder, because you encode, and then you transform, and you decode.

It's just this temporal sort of dynamics which we have now added to the variational autoencoder, and that's basically, you know, the PD.

But you can see how you can train this now.

You just put an elbow, you know, you just write down, you know, all the equations, you put the elbow, minimize the value, and you're done.

Okay, once you have done all that, you get this neural wave machine, and that's, so the idea is, you start with a 5, you want to code it, you have to train this vector field, this vector field can be a function of the input, but also it's learning components, and then as we traverse sort of through the lake space, you get all sorts of nice waves, and then at the end you decode, and of course you can predict the future by basically decoding, you know, waves, and keep propagating them.

And you can think of the number I have basically as a Lagrangian prediction.

Here's some examples.

You know, bubbles going around each other.

So here is the, this is the reconstruction, and this is the ground truth.

Here you see the latent activations, and here are the phases.

Yeah, so these are not very wave-like, these are more, if you look at these, these are more like standing waves at this point.

As you see here, you know, it starts off okay, and then if you try to predict too far into the future, it sort of starts to fail.

It's not unexpected, but also these models didn't have like a huge amount of capacity yet, so we can improve certainly with more capacity.

Okay, so then this miracle happens, and I'm still not quite sure why, but, you know, there's some speculation here.

So here on the right-hand side, this is a measured representation.

So, in fact, when we transform these waves, you can then start to measure, and use the neurons actually to become a little bit more effective.

So the idea is you get lots of waves in your latent space, and then when you in fact, after a while, you've done a lot of training, and you look at the neurons, and you find that they are pretty good, so it's implemented.

Okay, so that's amazing.

I think it's helpful to have these observations.

Now, and we also went on to see if this phenomenon is actually robust.

If you change the model, you still get waves.

And so if you look at, you know, for instance, And then the final thing I'm very excited about as well, which was presented here at NeurIPS, is to use this idea to think differently about disentangled representations.

And this might mean something to do with hand training, or pretty much all of it.

So the idea is that, okay, at your input, you have images, and you have some transformation paths on your input.

You're going to be able to get back to a sum of some set of activations in your latent space.

But you're also going to have a vector, a transformation vector.

In fact, you have a whole bunch of transformation vectors.

One that changes that distribution in terms of orientation, one that changes it in scale, one in color, etc.

Alongside your actual representation, you also have directional vectors, which tell you how to change that representation and how to change it.

And all that can be learned by basically using the latent space that we need to think about.

So it's basically a Fokker-Planck equation.

So this is now an under-damped model, but you could also have an over-damped model.

So it's more like pushing around distributions rather than oscillating.

But I think you can do the same with oscillations as well.

So this is a Fokker-Planck equation where this velocity field tells you how to change a distribution, and that's instead of these vectors that are appearing here.

Yeah, and that's worked with Yui Song, a visiting student.

Okay, and then you can do funky things. So you can sort of train these representations and you can say, okay, now everything is fixed. Now first rotate, and then change color.

So you know these vectors. You say, okay, first work while change rotation, and then follow another vector and just change color, and see what happens on new data, right? And you can see beautifully it keeps understanding what it means to change color, and scale, and transform.

And here it changes, for instance, to, you know, two things in this image. First, model U, and then object U.

But more excitingly, you can also change things in linear simple positions. So you can see you've only trained it perhaps in scale and object U.

And then you say, okay, now I'm going to change scale and object U together.

And if you do that, it's a real vector space that should work, right?

You should simply get something that still makes sense. But it's not a priori true that if you take a linear combination of things that you get a linear solution to that.

But it usually works if it gets you the two transformations taken together.

Okay, so then to conclude, traveling waves are found in the cortex. That's just a true statement.

Traveling waves can also be learned in these variable models. We did that, and it's a pretty robust phenomenon.

And our interpretation is that it does implement a form of approximately generalized equivariance. In particular, you can think of a pinwheel.

If you rotate around a pinwheel, you rotate through all the orientations in the selectivity map. That's like a capsule in an equivariant representation where if you rotate something, you permute through all those states of this orientation.

And our last statement was we can use this diffusion equation and then wave equations to define a new kind of disentangled function in representation.

Thank you.

Applause Applause Applause Thank you, Max.

I think we're able to take a few questions.

Please.

Hi, Professor Welling. Thank you very much for the exciting talk. I have a maybe very stupid question regarding slide number 14 where you explain the example where you have a rotating line and you have a free structure.

Yes, exactly.

This is the only one.

Yeah, my question is in this case, basically there's only one dimension.

Can I understand in this way that there's only this wave structure when there's only one traversal, like the traversal in one single dimension or what would happen if I have multiple traversals, like for example, I'm scaling the size of the five at the same time.

How does the structure look like?

Yes, it still looks like the wave. So this is very one-dimensional. So this is actually a torus.

So you would have to have a torus with multiple dimensions, but you could imagine the wave would sort of flow up that way or diagonally or linearly, sort of take any sort of complex form.

So these things here are already far more complex. But also this is just only a one-dimensional or two-dimensional representation of course, you would also want many sort of channels of that happening at the same time. So actually the real picture is far more complicated and these waves live in this higher dimensional space. But yes, so every one of these transformations will induce some set of waves through the space not necessarily straight, but they can go all sorts of directions.

Do these transformations or do the later vectors have to be independent from each other?

Well, that's sort of this idea of this K-frame where so you learn sort of the basics and these directions are the most, I would interpret as statistically independent sort of transformations that you're seeing in your data.

They get represented.

But then if you want some kind of combination of things you would take linear combinations of those.

So that's how I view, I mean it's a tangent space where every point has a set of these vectors attached to it.

And yeah, I would actually do it. Okay, thank you very much.

Thank you professor for the super inspiring talk.

And my question is that whether the frequency or the speed of the traveling wave in the latent space is determined by external factors or internal factors.

Because I'm thinking about how this model could be related to the traveling wave in the brain. In the brain we have gamma waves, zeta waves, and those, most of the time the frequency is determined by some internal factors.

And that is my first question.

If it is determined by internal factors... Can we first do one question because my memory is so short that by the second question I forgot the first one.

So, where is my...

So here you can sort of see what determines the wave.

These parameters w here and d's here, they in principle determine the frequencies with which this thing runs.

But there's also this external factor, which is this one, which can also modulate.

So I guess you could speed up something or slow down a little bit by having this factor. But mostly I think frequencies are determined by these internal parameters here.

It's very interesting actually because I think also the frequency should be something that you could determine by the input and then it becomes even more flexible.

Thank you.

The second question is similar to the first one by the previous audience that if there are super imposed or combination of objects like different time scales whether those super imposed trajectories can be mapped to the imposed waves in the latent space.

So you're saying if it's being...

So I think at this point there's still in this model an issue with let's say the speed with which things happen. So if you train things changing in certain speeds and you would now ask it to say have one thing go a lot faster and then linearly combine it with something that's slower, I think that would not work at this point.

I think we would need to extend the model to be able to do that. So you found a little bug there. It's good to know.

Thank you.

Thanks for the super exciting talk.

Could it be that the wave nature is like a consequence of the invertibility of the group in that in a way you have like a representation of monoids so things that don't necessarily are invertible in that any dynamical system is more like a monoid structure that doesn't per se need to be invertible.

But maybe these wave systems are like dynamical systems with a notion of invertibility. And could it be that because you're trying to learn to sort of represent the group in this kind of dynamical system, in this monoid that for this to work your dynamics need to become invertible and that's why you get waves. Is that something you've thought of?

Well, that's fascinating.

Yeah, I wouldn't know.

And like diffusion, you can't model a group like diffusion because you can't really diffuse back maybe the wave.

I alright.

So it's definitely true that if you don't get noise and you have a wave it can become time-reversal invariant.

But I didn't quite get the relationship with the monoid structure, but maybe we can take that offline. It seems a very technical question. I'm sure you're onto something, but we should talk.

Yeah, thank you.

Hi Max, very cool talk. I think you should convert more to computational neuroscience or come to neuroscience meetings more often.

It's very hard with the echo. Oh, sorry.

My ears are also my ears are bad.

Maybe I should have touched the mic.

Is it better now?

Like this? Okay.

So I wanted to encourage your neuroscience side and I have a quick question regarding generalization. You often like to have, basically you want to look at the quotient group, right? So is it easy to obtain a representative of the of the manifold for say, you want to do a variational recognition.

Can you easily obtain like a standardized version of, say, the object that you want to recognize from this type of representation?

That's a fascinating question.

It's true that So you would want to average over orientations like that.

You could say it's maybe an average over the travelling waves somehow.

I don't know.

I'm not sure how to build invariance necessarily immediately because for that you would sort of have to average. Well, I guess if you average over paths somehow you might get invariant.

If your transformation is supposed to actually show you the transformation you want to become invariant over then in fact as this thing is happening you traverse a path, but then if you just integrate over that path that's going to be your invariant.

So my first guess would be to see if you can average over the path that you're grading on but maybe also transform.

We have time for one more question.

I just want to ask a simple question.

In a group convolution, we know that when the input is rotated I can't hear it.

Sorry.

In a group convolution in like a group convolution network when the input is rotated by 90 degrees we know that the feature maps are rotated and permuted.

Is there a deterministic rule for how the wave will transform when the input undergoes a certain transformation?

Do we know how the feature map will change?

The feature map? Well, the feature map is learned so the feature map is fixed but it's the activations in the latent space that are changing and if you rotate something I would think if you want something that's equivariant you would want to flow around all the orientations in your map your orientation map.

And then again you can average over that to become invariant.

I'm not sure if that answers your question.

Yeah, so just do we know how the how the activations do we know what's the rule how the activations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment