Skip to content

Instantly share code, notes, and snippets.

@bmorphism
Created December 16, 2023 08:02
Show Gist options
  • Save bmorphism/5b44ca9585752b4d1366188a9b23a896 to your computer and use it in GitHub Desktop.
Save bmorphism/5b44ca9585752b4d1366188a9b23a896 to your computer and use it in GitHub Desktop.
immaculate moments

that we had today, and so I will not introduce them again.

We're very lucky to have Stefan Mand, who will also participate in the panel, and who's a professor at UC Irvine working on neural compression, among other things.

I don't see him.

He was there a moment ago.

Maybe he will join in a bit.

So OK, so the format is as follows.

The organizers have prepared a couple of questions to kind of kick off the discussion, but mostly we're hoping to have a discussion among the panelists and with you guys.

So please also think about questions you might want to ask and have the panelists debate or fight about.

So maybe you can start by a question from the organizers.

So the first question is quite general.

Why are you interested in information theory?

Why do you think what is the most exciting challenge that you think can be solved with information theory tools, and what cannot be solved with information theory tools?

Yeah, so my interest is in theoretical biology mostly, and I really like how mutual information actually relates to a large number of types of resources, materials used, energy from some recent results on non-equilibrium thermodynamics, and time also.

And so as a result, I think it's very interesting in that sense.

I feel unprepared.

I didn't originally intend to be on the panel, but actually I feel like my talk served as a decent answer to this question.

So for me, information theory provides what I feel like is the universal objective for representation learning, and what can't it do?

Yeah, I think there's a lot.

Anything that isn't trying to optimize for representation learning.

I don't know.

That's the best answer I can give right now.

Sorry.

So, yeah, do you want to do a later question before?

Yeah.

Or you can just get started.

Oh, you can do it.

OK, based on that, yeah, hi, I'm Stefan.

I'm an associate professor of computer science.

I'm a trained theoretical physicist, and I think, yeah, I'm excited.

I got into the field of information theory through statistical physics.

I was interested in non-equilibrium processes, and under which conditions they find an equilibrium distribution, and the distribution can never converge to equilibrium, and things like that.

And then later on, latent variable models became a very natural extension.

So, yeah, I think the question was, in part, what can information theory not provide?

Right?

OK, so, yeah, I think one of the things that it can provide is basically, well, it assumes a stationary distribution, essentially, right?

I mean, if the distribution itself is time-dependent, and a change is passed so that any learning algorithm can adapt, I mean, this would basically be an example where information theory would probably need to be at least extended to provide answers, I guess, since it's essentially an equilibrium theory.

Cool.

Yeah, I mean, cognitive scientists, including myself, have been interested in resource rational analyses for a while.

And when we did it in the past, it felt like we had to commit a little bit too specifically to what the resources were and what they did.

And so then we would get results that felt interesting, but maybe very contingent on assumptions.

And so I guess the thing that's exciting to me about applications of information theory recently is that they give us a much higher level way of talking about resources and asking, what would happen if there was a resource restriction, a resource constraint, and then getting these results that take that into account.

That's great.

It can't do all the rest of the stuff.

I think I want to focus on the last part, which is what it can and can't do.

And I think specifying what the assumptions are of the approach are really important, as Stephan just did.

But I also think it's important to understand and, well, emphasize the care it takes to choose what the resources are or what is it that we're actually trying to quantify in terms of the information in the system and how we link that to theories of how the system functions.

So in other words, there's a lot more behind the scenes in developing even the conceptual framework for the application of the method.

And I think that's a really big challenge.

I think I just thought of a better answer, maybe, so that I did before.

So not better than the rest of the panelists, sorry, if that was the implication.

But so I want to highlight, and I heard a bit of it come up, but I want to emphasize, I think, one of the greatest strengths of information theory for providing useful objectives is also one of its big weaknesses.

So the fact that it's reparameterization invariant makes me feel comfortable as a physicist because it means we're looking for solutions independent of the way we specify a problem.

But I also think that's a bit of a downside because you could have an arbitrarily complex bijection.

Like, I could hash the data or encrypt it.

And as far as information theory is concerned, that shouldn't affect the native answers.

But that requires sometimes impossible computational cost to undo.

And so I feel like that's one of the bigger weaknesses, is the information theory objectives are, by themselves, aligned to the computational costs involved in some of these things.

Just to riff off of that, however, maybe there's mutual information that talks about the computational cost in order to do that bijection.

So maybe everything does boil down to some sort of information theory objective.

So I don't know if any of you have a question, or we should continue poking as the organizers.

Yes?

On this note of the bijection and its complexity, do you think Kolmogorov complexity could enrich the story as an extension of information theory?

Who wants to talk about Kolmogorov complexity?

Sure.

OK.

So I'm familiar with Kolmogorov complexity only through Jim Crutchfield's take on it.

So this is going to be a completely biased take.

But his take is that it's basically an entropy rate.

And his take is also that it's uncomputable.

I don't know what to really make of that.

I think I'll agree.

Cover and Thomas, I think, have a nice chapter on Kolmogorov complexity.

And they make clear just how intractable or orderly intractable it is as a quantity.

So I think that makes it less useful as something that we could actually use as the objectives in our own learning procedures.

But I'm not going to say that Kolmogorov complexity isn't useful.

It obviously isn't.

It's really interesting and fascinating stuff.

But in so much as I do have a suggestion or something I do think provides some protection against some of the downsides of information theory by itself, or mutual information being fully reparameterization invariant, the variational bounds on it are more dependent on your choice of variational family.

And right there was the work on the V information, a notion, a relaxed notion, of computable information defined by a choice of variational family that you're willing to search in.

And you might limit yourself to simple functions, linear functions, or something like that.

And I think that's one way to sort of push against some of the problems with the full invariance of mutual information by itself.

Just to be contrarian, Kolmogorov complexity is uncomputable.

But MDL is a very useful objective function that's great for all sorts of stuff, like the success of things like Kevin Ellis' dream coder system, where it's basically using an MDL approach to the objective.

To me, it just seems like a different thing than what's going on with most of the information theoretic analyses.

So I don't think it's competing or additive.

But I do actually think it's useful.

So maybe I'll ask a question.

So you're four physicists, I think, except Noah.

I'm a mathematician.

OK, even better.

So if I think about physics, statistical mechanics is, in a way, much richer than thermodynamics, because you have a lot more knowledge going into it.

And you have access to microscopic quantities.

But actually, it's much harder to compute things And so the domain is much more limited, whereas thermodynamics gives you a very general framework with experimental predictions.

So how do you think, or do you think information theory can provide something like thermodynamics for cognitive processes?

And if so, how and what kind of information has to be put in?

And what kind of ad hoc constraints or information?

Yeah, I actually think about this for cognitive processes, I think about things entirely differently instead of thinking about it in terms of like thermodynamics.

And I think about it in terms of Mars levels.

I don't know if anybody here is familiar with them.

But it's the idea that there are these three levels for understanding organisms.

The first level is this normative level, where you sort of write down what you think the objective function is of the organism.

And then the second level is this normative level, where you sort of write down what you think the objective function is of the organism.

And so that's where you talk about resource rational prediction, or some mutual information that they're trying to maximize, or whatever it is.

Next level is the algorithmic level.

And that's sort of like, what algorithm are they using to do this?

And then the mechanistic level is like, what molecules are actually bouncing around to get this done?

And I feel like stat mech maybe is sort of like the two lower levels a little bit.

And then thermodynamics is like the higher level, if I want to make that analogy, but I'm not really sure.

Yeah, I mean, in my view, information theory is essentially already thermodynamics, right?

I mean, it is already the stationary equilibrium theory, right?

As opposed to dealing with sort of time-dependent systems that are described by differential equations or stochastic differential equations.

So yeah, maybe I view it already as the statistical mechanics in some sense, or the thermodynamics.

I just want to make a pitch for everybody to go learn about the recent work in causal abstraction analysis that makes these sort of relationships precise.

And kind of the attempt to answer the question, when is a simple causal model a faithful abstraction of a much more complicated causal model?

And once you have that formalized, then you can use whatever tools you enjoy in information theory or other things.

As part of talking about the relationship between those levels.

Of course, with the caveat, since I'm among physicists, that as far as I know, the ergodic theorem still hasn't been proven.

And so we still don't actually know that there's a proper reduction.

Oh well, for the thermodynamics case.

I guess I just, so for the last point, I'm, no, no, no.

I also don't think it's proven, but I guess I'm sort of with James in the sense that I'm not sure I care.

James has really remarkable papers called, what's the title of it?

Information Theoretic Approach to Statistical Mechanics or something, in which he tries to demonstrate how you can think about statistical mechanics or more broadly thermodynamics as really just being almost like a Bayesian inferential procedure.

You have a complex system that might have some complex exact configuration.

You don't know which configuration the system is in.

And if you're willing to express your ignorance in the form of a probability distribution, then you can seek in the space of lost possible beliefs for the one that's least constrained, that doesn't have any additional structure except the few measurements that you make.

And doing this, you can kind of recover all of the usual ensembles in statistical mechanics.

And he argues strongly that this viewpoint doesn't require ergodicity or something in order to explain why you might end up in a Boltzmann distribution.

A Boltzmann distribution is just the consequence of the fact that you're willing to be maximally ignorant.

And it doesn't require, you know, I don't actually care if the system is ergodic.

I just don't know whether it is.

And so I'm willing to express my ignorance and believe in the Boltzmann distribution as a predictor.

Yeah.

Yeah.

I love James' work too.

I just was always bothered by the fact that you have to know what constraint you're putting in in order to get out the right answer sometimes.

So if you went and said, oh, I'm going to constrain the average energy, you get out the Boltzmann distribution.

If you went in and said, I'm going to constrain the average energy squared because I don't know what I'm doing, you get out something else.

And I was never, ever given a satisfactory answer for why you should constrain the average energy, like why you should know to constrain the average energy, which I think makes it hard to attend to the time series domain.

Maybe I was not very clear in my question.

My question was not about thermodynamics and statistical mechanics.

It was about cognitive processes.

And so what I wanted to say is that you introduce a bunch of quantities in thermodynamics like entropy, but then you need a few laws that are just inferred from experiments.

So you need to put in something more than just defining quantities or how to update beliefs.

So my question was, what kind of information would you have to put in, in addition to defining these quantities, to have kind of fruitful models of cognitive processes?

I don't think you're going to like the answer.

But again, I think as far as sort of James is concerned, it might not matter, right, in the sense that you would be allowed to form a kind of thermodynamics for any choice of things you're willing to sort of specify.

But then I think to Sarah's complaint, some of the things you choose to specify are going to end up to be more or less useful.

And I'm not sure James can provide much guidance to you as to what a good set of things to measure or a good set of constraints to apply.

And I'm not sure there's a good answer for that.

But at least the first version of your question, if you really just want to build a thermodynamics of cognition, I think you follow James and you have an exact prescription for how to do that.

You choose whatever variables you want and you end up with a kind of thermodynamics over those variables.

Whether or not you are happy with that, I can't tell you.

So I guess I think that we don't know the answer to that question.

And it's partly because we haven't fully grappled with the actual statistics of experience in the world.

Like the structure of the statistics will affect a lot whether you need anything other than just density estimation to do what humans are doing.

I think that the success of sequential density estimators as transformers, recently at not just learning stuff, but learning stuff and exhibiting some surprisingly human-like cognitive biases and effects makes me kind of take one step back and think, okay, I don't know whether I need to impose additional constraints or if I can just look for the risk minimizing density estimators for experience.

And that's that.

So I think one of the things many of us like about information theory is that it allows us not to think about specific codes, right?

So we can sort of abstract away.

For instance, in the domain of cognitive systems.

And I'm curious how you think about what we lose by that.

Sort of the counter-argument to Noah's point just now.

And perhaps also what you think machine learning research can do to fill in that gap in a sense.

In what sense it helps us to deliver some of these efficient codes.

Yeah, I don't know if that's exactly the answer to your question, so I can try to answer the second part.

There's actually this growing field of neural data compression, which doesn't actually reinvent coding schemes necessarily, but it actually improves the density models that are kind of input to many codecs.

And of course you can learn or approximate certain probability distributions much better with neural networks than you could otherwise do.

So there's actually a lot of interesting development in this domain.

When it comes to completely new coding schemes, there's this interesting direction of compression without quantization or relative entropy coding, in which you basically sample from a Bayesian prior multiple times until you hit a sample that has high probability under a posterior distribution.

And then you transmit the index of how many times you had to sample.

And the receiver will then be able to reproduce exactly the same procedure if they share a common source of randomness.

And essentially decode that data point.

So I think that's a relatively novel idea, which I wasn't aware of in existing work, information theory, that sort of inspired from the Bayesian machine learning perspective.

I guess I also want to add that I feel like this is maybe another advantage of variational approaches, like the KL divergence before you get to mutual information, because when I was presenting the talk, I chose to say that everything is just KL minimization, not mutual information maximization.

Because there's one more step involved.

You start with the variational bound, if I have a KL divergence between P and Q.

If I then optimize in the space of all possible Q, if I don't put any constraints on it, then what I get back is a whole bunch of different mutual information terms.

And I feel like the prevalence of mutual information in old literature and stuff, it feels like it's because in the past, it was easier to work analytically with what we thought was the optimum of what was otherwise a variational objective.

If you take the step of imagining that you could perfectly optimize it, suddenly there's a simplicity that happens and you can describe the answer and certain properties the answer might have.

But nowadays, we actually have, like Stefan mentioned, very rich and very complex and very exquisite neural density estimators where you don't have to take that step.

You don't have to go all the way to imagine that you've searched the space of all possible probability distributions. Instead, you can just search in a wickedly large space for a distribution that you get to hold in your hand.

And that, I think, induces other properties and lets you talk about coding and talk about, and I think answer some of the questions that sort of evaporate if you go all the way to infinity.

So I'm becoming more and more of a believer in don't go all the way with K.

I'll stop at the point where you're sort of explicit about the fact that you're performing a search and maybe just be explicit about how you want to parameterize that search.

And I think there's more questions we can ask about those types of answers before we make it easy on ourselves and imagine we actually do the optimization in the space of all possible probability measures.

But, I don't know with that.

Yeah, and then just from the point of view of understanding algorithms, again going back to Mara, I feel like the actual what is the code comes in at the algorithmic level and I would just say there are two levels of understanding and it's okay to say that you're going to spend your entire lifetime working on the first level the normative level and never touching codes.

That's my honest answer.

Danny, did you want to maybe answer?

I have a question about to what extent can information theory help us interpret the flow from information in complex systems such as deep neural networks?

Yeah, I think Alex might also add to that, but there's this interesting framework of the variational information bottleneck, for example, where you have basically a pipeline, something goes into a neural network that there's an intermediate stochastic state and ultimately you predict something and maybe you want to compress data in this intermediate state because, for example, you might have a pipeline part of which is deployed on your phone and another part is deployed on an edge server and you might wonder how efficiently can I compress in this particular architecture with a supervised goal? And here you can work again with variational inference, potentially deriving lower bounds but definitely upper bounds on the efficiency of transmitting information, so I think this would be one example that would sort of quantify the amount of, the efficiency of transmitting information in a particular neural architecture.

Thank you, Stefan, I agree.

I also want to take this as an opportunity to pitch for another direction I think you can go.

I mean, there's obviously lots of work where people try to use information theoretic ideas to bound generalization error and talk about various properties of neural networks.

I've written papers like that myself.

But there's one direction that I wish more people kind of thought, which is, if we're talking about information theory, if we're talking about mutual information, we need to be sort of explicit about where the stochasticity is coming from.

And in especially some of the earlier work, there were some problems with people trying to apply information theoretic ideas to neural networks because at the face of it, it feels like neural networks are deterministic maps rather than stochastic. And so there's a bit of a mismatch and things can get fuzzy and you can get answers wrong if you kind of push too hard and interpret the intermediate layer of an otherwise deterministic neural network as a random variable.

However, there's other forms of stochasticity. There's stochasticity in the training that we do of neural networks, and there's lots of people we're looking at trying to look at the particulars of the dynamics that we use for training, like Langmuir dynamics or something, to characterize the complexity of the solutions found by neural networks.

But there's another level of stochasticity that I feel like people don't appreciate as much, which is the stochasticity in the initialization.

You know, the point that I have is you train a neural network and what you get is some response, but empirically, it doesn't matter what my random initial seed was. I can train the neural network ten times and I'm sort of just as happy with each one of those ten training loops.

The RL people have it different. They're sort of sensitive to the seed. But most of the things that we do in neural networks are sort of reasonably invariant to the starting seed. And that's a signal to me that the functions that they're learning, I wish more people thought about the neural network as being like a sample from a distribution.

I could describe an ensemble, which corresponds of following exactly the same training procedure a billion times. And I could think about any particular neural network as being a sample from that ensemble.

And I wish more people tried to characterize the properties of that ensemble, of the distribution from which we've drawn a single sample. And I think a lot of work gets confused or gets lost because they overfit to the fact that they have a single sample.

And they try to describe properties of the sample rather than the sort of distribution it was drawn from. I think part of the problem is it's really hard to draw that ensemble or to characterize that ensemble, but I wanted to say that this room, because I think this room is full of people who might get on board and try things like that.

Yeah, just to mention another resource direction that might be relevant is I know that people are coming up with really interesting things like partial information decomposition and stuff like that, and I feel like that might be useful in interpreting information flow in some of these neural networks.

Although, honestly, if you pressed me on exactly how that would work, I wouldn't know how.

So, yeah.

Slightly different question.

To what extent would you again agree that KL divergence is also kind of the proper form of how to formalize any sort of bounded rational constraints or constraints on resource processing where you kind of take a human as the prior and the human is his name, develops into some posterior, and it's KL divergence again.

I'm repeating Noga's answer from lunch.

So we were actually talking about this at lunch a little bit.

And my claim was that my initial claim was that you needed to have a utility function and you couldn't just have a KL divergence. So I was disagreeing with Alex's talk. And Noga was like no, actually, you can take a KL divergence and turn it into any utility function you want if you choose your priors correctly. And I was like, okay, I guess it's true. But then you have to ask, what is that by here, and I guess it's potato-potato.

Goodbye.

I guess I made my position clear.

But I will caveat it, right? I am kind of a deep believer that the KL divergence is the right way to compare two probability distributions. And so I think just another way to ask your question is, how comfortable are we with trying to explain something like human cognition or motivation as describing a set of two probability distributions?

If you're okay with that, then I'm going to say, go with KL. But I think a way out is if you don't think that that's the right kind of mental model, that they're probability distributions, instead they're something like a utility function or some other kind of mathematical object, then, okay, now the door's open and there's some other way to compare it.

So I'm going to...

If there's two distributions, you do the KL.

If one of those things isn't the distribution, do something else.

Just for the sake of completeness, even though I, in practice, agree a lot of the time, I think I'm going to take a tools sort of perspective that it's not always the right tool for the job when you want to compare two distributions.

In generative adversarial networks, the KL version of the distance between the distributions is just not stable.

Bostersen's distance is just much better.

And I think there's lots of cases where that's true. Earthmover is better or where you just are doing something where you don't have a preferred distribution and so you don't want it to be asymmetric.

And I don't know, I think I'm much more open to there being other answers sometimes.

Yeah, I mean, same answer. I mean, depends on the optimization problem, right? If you're getting stuck in local minima, then maybe another measure could be more adequate, but you can also do annealing and things like that.

So I guess I'll do KL.

And I'll play the role of the extremist and I guess in my heart, the fact that GANs don't have a good characterization in terms of KL optimization is the reason why I don't like GANs as much as the alternatives.

Yeah, that's true.

Yes.

So you discussed a lot about the KL divergence and then you essentially assume that the world is full of random variables and they obey exactly the axioms of Kolmogorov.

If you don't want to assume that, then you get the Kolmogorov complexity and the computability issues.

Do you believe that in cognitive science we need yet another version of an information measure which is orthogonal to these two?

So kind of a Klein program which we had for geometry, but now for information and in which direction should that go?

I don't know.

I'll admit that I think we're all afraid to answer because that sounds like a fantastic question that I don't think any of us, I'll speak for everyone, I don't feel qualified to provide even a suggestion, but I think that sounds like the million dollar question.

Oh yeah, what's your answer?

No, I don't have an answer, but you know, if you apply now these type of concepts to society, you basically treat people as random variables, and I think most would be utterly offended if they would know what the consequences are.

So, but you may not want to go to the extreme that you know nothing about these objects, and that's why the description length type of computational concept like Kolmogorov complexity is the right way to do it. So clearly I think there are some questions behind where we would like to have more structure which we know about these objects, and they are not so nice as random variables which give you these mutual information as the big stone of detecting things are statistically independent, so you know exactly you can't predict anything.

So, yeah.

We live in a highly stylized world, and we should ask ourselves, did we make it too simple to really capture the complexity of reality?

So I suddenly have more to say.

Okay, so I agree with everything you said, but I guess again I feel a responsibility to say that like, what's the analogy? The analogy is something like a pseudo-random number generator.

A pseudo-random number generator doesn't actually produce a random variable, right? There's perfect correlations, it's perfectly deterministic, and you know exactly what the sequence of random variants is going to be. So that is definitely not a random variable. And even if that's totally true, I want to invoke the sort of Jaynesian sense in which it might not matter.

Because if I don't have the seed, I might be more successful in discussing that object as if it were a random variable. Because as far as I'm concerned, from my ignorant position, if I don't know the seed, I should just model that thing that has more structure that is hidden from me as if the thing itself was stochastic. And so I think it's less that if you try to treat people as if they were random variables, or human cognition as a random variable or something, I think there's a less mean way to interpret it. It's not that you think people are behaving random-scandum-stochastic or whatever. They could have a lot more structure. But insomuch as we're ignorant of that structure, until we figure out something more intelligent to say, then I don't think there's anything wrong, and I think there's a lot good with just modeling behavior or basically anything in the world that you don't have anything else intelligent to say about as a random variable.

Yeah, I kind of hesitate to say what I'm going to say next, but I'm going to say it anyway.

So I feel like you can say something about the laws of physics and quantum mechanics and how things are random, and that may explain everything, but I'm going to back off.

So I don't know if you have other questions. You can have the option of having the answer to your question also, as you just saw.

So, as you've heard throughout this workshop today, variational methods like variational encoders and variational IDs are becoming very popular, and they seem very crucial in order to scale up information-theoretic methods. But once we do that, we kind of lose Shannon's first principles on derivation and the theoretical guarantees that we have.

And so my question is, how should we think about these approximated measures? How should we interpret them, and what kind of guarantees do we have or can we have for those measures?

Thank you.

I feel fairly comfortable.

I think there are lots of...

I think you should be aware of the different approximations that you do when you apply them.

First of all, you decide on a function class.

Sometimes you also do amortization in which you further restrict the function class by some that can be reduced by...where conditional distributions can be characterized by neural networks.

But I think you can always, being aware of these, you can always do fixes to essentially get a better variational approximation.

You can do ensembles, you can do mixtures.

Even if you're satisfied with your variational family, you can do importance sampling, you can do annealed importance sampling.

So in principle, you know that even with annealed importance sampling, you get sort of the right answer. It's just computationally expensive.

I'm okay with that.

Yeah, that's all I have to say.

I don't think you should be fussed unless you're a mathematician because the really interesting things have big enough qualitative differences that the errors from, you know, fitting modern giant neural nets as your function approximation family are going to be, you know, probably absorbed and get on with the work.

I want to make another plug for, I'm embarrassed I don't remember the authors or whoever, or even the full title, but if you search for, like, V-information, right, there was a paper a few years ago where they sort of took seriously the notion of...

There you go, perfect.

Yeah, they took serious the notion of the sort of variational approximation to mutual information and tried to show that this satisfies similar properties to mutual information, a sort of non-negativity if you do it in a certain way, a data processing type inequality, and stuff like that.

So I think there's a chance that you could recover some of the properties you really like about mutual information, but even in this more relaxed setting. And I think there's a lot more work to do there, but you should definitely check out that paper.

So I'd like to push back on two of the comments that were made earlier, and we'll see maybe if this sparks useful conversation, but I guess you had mentioned that you think of information theory as something equilibrium, and I worry that that could be misleading. For example, in non-equilibrium thermodynamics, there's been huge progress in the last decades of using information theory with like forward and reverse processes, which actually I've heard that's where some of the diffusion stuff in AI kind of was inspired from. So I mean in a very real sense, it's not equilibrium, but maybe you mean something like there just has to exist a probability distribution, but I don't know.

And then maybe I'll just say the other thing then. It's a similar topic then.

While you're invoking these kind of fluctuation relations where you have these log probabilities and all that, you can't just take a Jamesian stance of, ah, my environment is Gibbs because I'm ignorant. Actually, if you say that and then you do an experiment, the probabilities will not be as predicted unless the environment truly is according to that distribution.

Yeah.

That's a great comment.

Yeah, fluctuation relations are, you can still, basically you apply the same tools that you're familiar with and can still think of them as being part of the information theoretical toolbox as well.

I have to think a little bit more about that. I guess that was a question for me.

I was wondering if there might be also an additional assumption of quadratic equilibrium in there. I'm not completely sure if you can be really far from equilibrium with this formalism.

But...

Yeah.

Yeah, okay. Good.

Proven wrong. Thank you.

I want to defend James a little bit more. So, again, I think you're totally right that if the world has structure that you are ignorant of and wrong about, the structure is actually there, then yeah, you're not going to get as good answers as if you were aware of that structure.

I guess I just want to highlight, this isn't unique to science.

If I throw this microphone across the room, I can treat it as a person.

Meaning, I'm going to ignore all of the internal structure of this thing. I'll even ignore its rotational modes, and I'm just interested in the motion of its center of mass.

And that's wrong, right? And it's going to make inaccurate predictions at the level of 10 to the minus 4, but it's totally good enough if I just want to know whether it'll hit you in the head or somebody else.

Right?

So...

I...

We could try it out.

So I want to highlight that I think you're totally right.

If you're missing certain structure, you're going to get wrong answers.

But like...

I mean, are we so brazen as to think that we know all the structure that governs everything in the world anyway? No! Everything's an approximation. Everything's a model. Everything's some level of ignorance.

I think James was just more honest than most of us about that sort of fact.

And I just wanted to say that.

Yeah.

I just want to add on to what Alex said, which is that if you look at what Bill Bialik and colleagues have done in the area of biological modeling over the past few decades, they've used maximum entropy models to great effect. And basically what happens when they discover that their model isn't good enough is they just add another constraint, which is basically what I think Alex is saying.

Yeah, because you learn when your probabilities are off.

I think there was a question there.

Alright, so this is intended as a fun question.

Do you see any role for quantum information theory in machine learning or cognitive science?

It's his idea of fun.

Okay. I want people to work on it.

So, personally, well, I'm sort of of two minds of it. On the one hand, I'm sort of skeptical or bearish on it yielding interesting fruit in the near term.

I think the things that we're interested in making predictions about are inherently classical or higher energy scales. I don't think that they're fundamentally quantum mechanical in nature, and I don't think we need those kinds of things to do a good job of them.

But on the other hand, I really do think that quantum mechanics as a theory has interesting and different structure than classical probability theory.

And I've often wondered whether by not making use of that structure, even if it's not sort of manifest in the things that we want to predict, if not making use of that structure, we're sort of leaving something on the table. And I do think that there's a modern analog to this.

Hopefully you guys have seen or been following all of this work on equivariant neural networks.

People have increasingly been able to build neural networks that explicitly reflect the structure of certain groups.

Because you might have data which is inherently geometric, like you have LIDAR or you actually are measuring points that move around in space or something. And so they build these networks which are sort of manifestly, they do the math carefully, they do the group theory, the representation theory, all this stuff.

And you can point to individual activations in this network and say that, oh, this is the first year rep, this is the second year rep, and they're the different components of the irreducible representations of the group and how it acts on things.

And my understanding is, generically, they discover that even if their data is only a vector, that if they allow intermediate layers, which are tensor-like or scalar-like or even higher order irreducible representations of the rotation group, and then maybe eventually only predict the scalar, they didn't need all of that to be there. That wasn't in the input, it's not in the output. But by putting it all there and letting the thing push some of the flops over to that part of the space, they get more efficient predictions. And I wish we had such an expert to validate this, but this is my impression. And so in much that way, even if the things that we're interested in making predictions about aren't manifestly quantum mechanical, if quantum mechanics provides additional structure beyond just classical probability theory, I wonder whether if we allowed our machines to operate in that space, we could make more efficient computations even if we project it out at the end before we look at the answer. But that's just speculation.

So, if I may also speculate, yeah, so one way of using quantum mechanics is to kind of exploit the tunneling effect of course, right? And then there's this very promising area of quantum annealing that could at some point maybe be feasible and scalable.

If I would have to speculate, I could imagine that there might be another fully new revolution coming up which relates to discrete optimization.

And that could be completely unrelated to AI, right?

I mean, it might lead to a revival of whatever classical algorithms and solve a lot of other problems, right, for us.

But I don't see, like, an immediate selling point why quantum might be important for machine learning the way we currently use it.

But I'm not the best person to ask on that.

I just feel obliged to make the obvious point that as soon as we have useful quantum computers, quantum information theory will be useful for machine learning.

And I don't know when that is, but, like, probably some grad students in the room should get ready for that.

As Paul's well aware, there's also these things called quantum epsilon machines.

And I don't really know too much about them, except that you can basically...

If you guys remember from one of my slides, I took a two-state machine, and it turned into an epsilon machine that had an infinite number of states.

And quantum epsilon machines make them much smaller, and they have much lower complexity.

So from some perspective, yeah, I would agree.

But you definitely ask Paul if you want more details on that.

That's Paul.

Yeah, so I'm really excited by all this work which shows that being informationally optimal for some trait, bottleneck, or anything like that, can induce nice performance for machine learning algorithms, or that humans or animals do a lot of straight-up and so on.

But what I feel that hasn't been studied a lot yet, maybe it's because I don't know it yet, but it's the reason why being so...

Being informationally optimal makes you...

The reason why there are organisms all along information trade-offs.

In other words, like, what is the structure which is induced by being informationally optimal?

And for instance, in Danny Bassett's presentation, this thing that for beta, which is neither zero nor infinity, but something in between, that in this case...

...identify the positive network.

It's like exactly the kind of things I have in mind, that being along the trade-off for nicely chosen values makes you extract some structure.

And my question is, are you aware of lines of research, let's say more on the theoretical side, which identify nice correspondences between the fact of being informationally optimal for some trade-offs, and inducing some topological or geometrical structure or something like that?

And if not, if there isn't, which tools do you think would be relevant in the direction?

Like, which mathematical tools, for instance?

I don't know, like, topology, information geometry, these kind of things.

Yeah, I don't have a really good answer to your question, but I do think that, again, for me, the tricky part is identifying what those two constraints are that produce the trade-off.

And that, to me, requires a conceptual understanding of the system and a deep connection to previous theories that have been developed for the system.

And so I think I'm constantly focused on that back end, rather than on, once you've figured that out, then look at the trade-off, and then identify the information optimal spot, and then see the structure.

That part seems straightforward.

I think it's the earlier part that always feels harder to me.

Yeah, sorry.

So, the way I like to think about why this information optimal stuff is happening at all is because I like to think about things in terms of reinforcement learning, but resource-rational reinforcement learning, sort of like resource-rational decision-making.

And I think everybody can agree that there are resources that organisms have to deal with, like a limited volume of the head.

They have to come out of their parents, of their mother, not their parents.

They have a limited amount of time to do stuff in.

They have limited energy.

They have limited materials.

And at the same time, they have to satisfy some utility function, satisfy some utility function, maybe.

Or maybe some version of that.

And I think basically what's going on is that those resources can be replaced profitably by a mutual information that has to do with some memory.

And the utility can be correlated with a mutual information that has something to do with predictive power.

And that's my answer.

And then I guess just to add on, I think if you want references, I feel like people like Bill Bialik is sort of king of this kind of philosophy.

Suzanne Still has done a lot of work on kind of a more theoretical side doing these kinds of things.

And then Stephanie Palmer has done experiments of like trying to measure predictive information in sensory neuron populations and just showing that real neurons grown in a dish shown in movies seem to be operating near these kind of information theoretic maximum points.

And then they go beyond and in that paper, they're able to kind of try to work out what kinds of stimuli they're responding to and what kind of story that can tell about the encoding strategies that are being used and stuff like that.

So Bialik, Still, and Palmer seem to be good things to check out.

Oh, and Sarah, by the way.

Okay, so we have time for one more question.

So who wants to be the bouquet final?

I hope it will be bouquet, not burial bouquet in the sense of maybe it's totally out of scope.

But, you know, we're talking about cognitive systems, right?

And I think a lot of information theory assumes that there is kind of underlying probability that the organism has to learn and that you can approximate it with variational methods where you can, you know, put some bounds on it because our computational cognitive research is so different.

But I was kind of wondering what do you think about, you know, and not so much about like epistemic foraging or any kind of searching for more information to learn in the existing world, but really creativity.

I mean, does that in any way encompass, you know, the type of information theory?

Because, you know, in some sense you want to create some new probability that you have never seen.

And then the question is, of course, people are going to take more generative methods to say it's creative, but they're not.

They're just exploring the existing space.

You can do a transfer of learning and then you maybe switch from one representation to another.

It's still not creative.

So is this a fair question for information theory?

Why cognitive systems could be creative at least and, you know, and also conscious?

Maybe quantum is the conscious answer, but just creative in the physical world.

Okay, so I'll answer this by quoting some of my friends.

So one of them is Gautam, who works with me at the WMCAC Science Department.

I just want to challenge the idea that humans are super creative.

Like, they are creative enough, but he has been, he developed this game that you can play and you can look at the strategies that people have over time.

And basically, apparently, they sort of stick with bad strategies for a really, really long time.

And it takes a while before they sort of get that creativity boost.

Once they do get that creativity boost, though, I would say Lavarshini might have done some work on creativity and information theory and I'm aware that he's done this work, but I don't know what he did, so I apologize for that.

Yeah, maybe that's an unpopular answer, but, I mean, after playing with Church of BT and these general models a lot, I have the impression they are creative.

I mean, they're very creative.

They're coming up with new sentences, new answers, new ideas.

They even inspired some of my scientific research.

The images they draw are, you know, yeah, I think they're creative, but it's probably a matter of definition.

I just want to plus one that, I mean, people have now given GPT models all of the standard human tests of creativity and they score highly, they're good, so, like, that doesn't mean there isn't some other notion of creativity that we haven't captured psychometrically, but I think, like, in order to answer the question, at least, you know, if creativity is a human thing, we would need to nail down what we mean by that a lot more before we can say that the models are not.

Okay, so let's thank the panelists and the questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment