Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save JD-P/34e597cef5e99f8afa6304b3df5a4386 to your computer and use it in GitHub Desktop.
Save JD-P/34e597cef5e99f8afa6304b3df5a4386 to your computer and use it in GitHub Desktop.
WEBVTT
00:00.387 --> 00:02.548
[JDP]: But yeah, so let's go ahead and introduce ourselves.
00:02.548 --> 00:03.869
[JDP]: You first.
00:03.869 --> 00:04.069
[Zvi]: Yeah.
00:04.069 --> 00:06.930
[Zvi]: So my name is Steve Moshowitz.
00:06.930 --> 00:25.880
[Zvi]: So I basically at this point spend all of my time that I can manage, like trying to read about, understand the world with a primary focus on the developments in artificial intelligence with a, you know, a focus on the existential risks involved in that, but also on what's going on for the mundane utility, what's going on with the capabilities developments.
00:25.880 --> 00:27.161
[Zvi]: And I write weekly columns about it.
00:28.235 --> 00:30.236
[Zvi]: And I've been interested in this since about 2009.
00:30.236 --> 00:32.857
[JDP]: Yeah.
00:32.857 --> 00:39.641
[JDP]: And, you know, anyone who's listening, you can like check out his substack at what is it?
00:39.641 --> 00:42.522
[Zvi]: That's the .substack.com.
00:42.522 --> 00:42.722
[JDP]: All right.
00:42.722 --> 00:43.463
[JDP]: Yeah.
00:43.463 --> 00:44.463
[JDP]: And it's very thorough.
00:44.463 --> 00:48.105
[JDP]: He goes through a lot of literature and material.
00:48.105 --> 00:54.388
[JDP]: He read the entire text of the Biden AI executive order and probably suffered some psychic damage from it.
00:55.814 --> 01:02.118
[Zvi]: definitely suffer some psychic damage from it, but it's not the first thing that will do that, it won't be the last, and you recover over time.
01:02.118 --> 01:03.659
[JDP]: Sure, sure, sure.
01:03.659 --> 01:07.102
[JDP]: He needed maybe a little bit of a recovery period before coming on this podcast.
01:10.007 --> 01:15.429
[JDP]: So as for me, I am an AI researcher.
01:15.429 --> 01:19.130
[JDP]: I helped develop some of the early text-to-image methods.
01:19.130 --> 01:26.733
[JDP]: I actually had state-of-the-art, a public state-of-the-art for image generation for exactly about 24 hours.
01:26.733 --> 01:28.654
[JDP]: It was a Kluge condition, latent diffusion.
01:29.494 --> 01:32.035
[JDP]: You know, if you look back at it now, it looks pretty bad.
01:32.035 --> 01:38.236
[JDP]: But at the time, for one glorious day, it was the best thing out there.
01:38.236 --> 01:44.597
[JDP]: And then the next day, Compviz released their latent glide, and it was just clearly better.
01:44.597 --> 01:50.778
[JDP]: So I didn't, I didn't really put any more effort into, into glue condition latent diffusion.
01:50.778 --> 01:54.399
[Zvi]: Do you get flashbacks on Monday when OpenAI just like killed all these startups in one go?
01:58.240 --> 02:01.062
[JDP]: I don't, I don't think they kill as many startups as you might think.
02:01.062 --> 02:07.687
[JDP]: I think that like, so I, my commentary on that whole thing was like Airbnb started out selling mattresses, right?
02:07.687 --> 02:15.933
[JDP]: Like your first startup idea, like most people who go into like a wide common air type startup do not know what they're doing and they don't know what they're selling yet.
02:15.933 --> 02:20.376
[JDP]: And so if you say like, Oh, you know, uh, opening, I killed my startup.
02:20.376 --> 02:21.777
[JDP]: You didn't have a startup, right?
02:21.777 --> 02:25.300
[JDP]: You didn't actually have like, uh, you don't know.
02:26.088 --> 02:28.749
[Zvi]: I think if you're early stage, that's very fair.
02:28.749 --> 02:30.490
[JDP]: So, yeah.
02:30.490 --> 02:30.910
[JDP]: Yeah.
02:30.910 --> 02:31.071
[JDP]: Right.
02:31.071 --> 02:36.113
[JDP]: And I think that almost all of these AI startups are going to be like in that very early stage, right?
02:36.113 --> 02:38.874
[Zvi]: Probably mostly just because everything's happening so fast, right?
02:38.874 --> 02:39.254
[Zvi]: Like you just.
02:39.254 --> 02:40.075
[JDP]: Yeah, just to this, right.
02:40.075 --> 02:41.355
[JDP]: Just to this happening so fast.
02:41.355 --> 02:49.419
[JDP]: So I would, I would imagine that like if OpenAI actually killed your startup, I would, I would, I would definitely take a magnifying glass and ask exactly what's going on there.
02:49.419 --> 02:50.840
[JDP]: But so that's one thing.
02:51.440 --> 02:56.064
[JDP]: Another thing is that I am the co-author of a RLAIF framework.
02:56.064 --> 02:57.725
[JDP]: A lot of our listeners probably don't know what that is.
02:57.725 --> 03:01.148
[JDP]: So that's reinforcement learning from AI feedback.
03:01.148 --> 03:14.538
[JDP]: It's a lot like reinforcement learning from human feedback, but the idea is that instead of like, I mean, it's also an RLHF framework to be clear, but like, I'm primarily focusing on RLAIF right now because it's very hard to collect like high quality RLHF data.
03:15.137 --> 03:18.159
[JDP]: And then also like there's the whole sycophancy issue, right?
03:18.159 --> 03:21.281
[JDP]: Like, you know, you press thumbs up, you press thumbs down.
03:21.281 --> 03:22.001
[JDP]: What does that mean?
03:22.001 --> 03:24.743
[JDP]: What's the proper generalization of that?
03:24.743 --> 03:26.304
[JDP]: Nobody knows.
03:26.304 --> 03:34.109
[JDP]: I mean, actually I have some ideas, but like, you know, it's much more uncertain what that means than say an embedding of some statement, right?
03:34.109 --> 03:36.631
[JDP]: Does this thing correspond to an embedding of this statement?
03:36.631 --> 03:39.753
[JDP]: Does this thing satisfy this principle that's
03:40.299 --> 03:43.800
[JDP]: specified at priory in the model's latent space.
03:43.800 --> 03:48.862
[JDP]: That's a very different thing that's harder to gain in theory.
03:48.862 --> 03:54.145
[Zvi]: And you feel like you can do this with responses that are actually well-propertied?
03:54.145 --> 04:02.988
[Zvi]: Because what I've seen in the description of constitutional AI from Anthropic, they just have this grab bag of principles in English that contradict each other.
04:02.988 --> 04:07.910
[Zvi]: And then they just pull from them and ask which one embodies this thing more, which doesn't seem like it's going to have the nice properties you want.
04:09.437 --> 04:10.117
[JDP]: Yeah.
04:10.117 --> 04:16.899
[JDP]: So I definitely think that like the anthropic principles are pretty, like I have, I would have a lot of criticism of them.
04:16.899 --> 04:24.662
[JDP]: Um, one thing I think is more interesting to think about is like, so, so, so that's a great question, but I'm still in the introduction.
04:24.662 --> 04:27.923
[JDP]: So I think I'll, I'll jump into that a little bit.
04:27.923 --> 04:33.845
[JDP]: Um, so I've also been highly engaged with the less wrong rationality ideas since I was a teenager.
04:34.392 --> 04:36.913
[JDP]: So I think I read HPMOR when I was 14.
04:36.913 --> 04:44.938
[JDP]: I started thinking about AI and AI risk when I was maybe 15, 16, and I'm now 27.
04:44.938 --> 04:49.881
[JDP]: So I've been ambiently thinking about this off and on for over a decade.
04:49.881 --> 04:59.406
[JDP]: I would not say most of that early thinking or even like, I would say only like maybe the last four years of thinking have been really at all like serious.
04:59.406 --> 05:01.567
[JDP]: Maybe even the last like two years if you want to be
05:02.837 --> 05:05.938
[JDP]: but I have been like ambiently thinking on and off.
05:05.938 --> 05:20.785
[JDP]: And so it's not like, I think that a lot of people who talk about this, right, you'll have, so when you have like the EY versus Geohots thing, for example, I think that Geohots, like part of what might make Geohots less credible to me is it's like, well, you're just now thinking about this, right?
05:20.785 --> 05:28.648
[JDP]: That you didn't really care about this issue until like the last five minutes.
05:28.648 --> 05:29.869
[JDP]: Do you think that's a fair, like,
05:30.419 --> 05:34.540
[JDP]: vibe that you get from a lot of, say, like, IAC people?
05:34.540 --> 05:45.224
[Zvi]: I think a lot of them definitely didn't think twice about the real implications of this, of what it means to build artificial intelligence, until very recently.
05:45.224 --> 05:54.627
[Zvi]: You see this also from a lot of people who, like Jeffrey Hinton, like, fully admits that, like, he just didn't really think about it because it just seemed so like it wasn't going to happen.
05:54.627 --> 05:59.749
[Zvi]: And then one day he wakes up and he goes, oh, my God, this doesn't, wait, but if, uh-oh,
06:01.275 --> 06:07.660
[JDP]: I think that describes a lot of people.
06:07.660 --> 06:16.686
[JDP]: For example, I recently saw an interview with Douglas Hofstadter where he admits essentially like, I was wrong about everything.
06:16.686 --> 06:27.395
[JDP]: He says he now thinks humanity is doomed, that humanity is basically doomed because he was expecting that AI would be this long unfolding process that it would take
06:28.191 --> 06:31.632
[JDP]: you know, decades or even maybe even hundreds of years.
06:31.632 --> 06:35.273
[JDP]: And then one day he just wakes up and he realizes, Oh my gosh, I'm wrong about everything.
06:35.273 --> 06:37.454
[JDP]: And also we're all going to die.
06:37.454 --> 06:49.997
[Zvi]: Yeah, it seems like, you know, everyone who was working to, to make AI happen faster, like didn't stop to ask the question, maybe it would work.
06:49.997 --> 06:50.177
[Zvi]: Right?
06:50.177 --> 06:52.758
[Zvi]: Maybe we would figure out something and someone would like,
06:53.137 --> 06:54.978
[Zvi]: find transformers and we'd start scaling up.
06:54.978 --> 06:59.382
[JDP]: I don't think that describes, I don't think that describes everyone.
06:59.382 --> 07:04.566
[JDP]: I do think that probably does describe a substantial, a larger number of people than it should.
07:04.566 --> 07:05.947
[JDP]: Right.
07:05.947 --> 07:12.071
[JDP]: You, you would hope that, that people would be thinking about that, but I guess a lot of them don't.
07:12.071 --> 07:12.612
[Zvi]: Yeah.
07:12.612 --> 07:21.859
[Zvi]: And I think what happens is like a lot of people realize that like, we didn't think about this and then, you know, their brains are,
07:22.298 --> 07:31.924
[Zvi]: forced to deal with the fact that the world is going to be profoundly different and profoundly weird in some way when this happens.
07:31.924 --> 07:44.651
[JDP]: So to give some balance, I would point out someone perhaps like Demis Hassabis, as someone who I would imagine understands the whole time what they're doing, has a fairly good idea that if they succeed, the world will profoundly change.
07:44.651 --> 07:48.093
[JDP]: That they could succeed on a within their lifetime time scale,
07:52.759 --> 08:08.872
[Zvi]: Hasab has clearly understood that this was a incredibly powerful thing that might happen within his lifetime, clearly said, you know, I need to be the one to make this happen so that I can make sure this happens in a good way, from my perspective, that it ensures a good outcome.
08:08.872 --> 08:11.214
[Zvi]: And then he deliberately set out to try and cause that.
08:11.214 --> 08:15.998
[Zvi]: And in fact, you know, significantly accelerated that event from all we can tell.
08:15.998 --> 08:20.722
[Zvi]: And, you know, I don't know how well he thinks, you know, he doesn't, I don't know how well he thinks about
08:21.223 --> 08:27.924
[Zvi]: like what's going to happen or what it would take to steer into a good place because he is not the kind of person who publishes a bunch of alignment form posts, right?
08:27.924 --> 08:31.665
[Zvi]: He doesn't talk about exactly how he's thinking.
08:31.665 --> 08:33.125
[Zvi]: So we don't know what he's thinking.
08:33.125 --> 08:34.285
[JDP]: No, that's fair.
08:34.285 --> 08:38.826
[JDP]: I'm just more pointing out that like, I don't think it's like, Oh, everyone involved in this just is like walking into a blind.
08:38.826 --> 08:40.046
[JDP]: I don't think that's quite true.
08:40.046 --> 08:42.287
[Zvi]: Oh yeah.
08:42.287 --> 08:48.028
[Zvi]: I agree that like some people, like the founders of all three major labs, you know, understood what they were doing when they founded those labs.
08:48.028 --> 08:48.208
[Zvi]: Right.
08:48.208 --> 08:50.368
[Zvi]: Like very much so.
08:50.368 --> 08:50.468
[Zvi]: And
08:52.042 --> 08:54.605
[Zvi]: I think, yeah, a lot of people did, in fact, understand it.
08:54.605 --> 09:02.793
[Zvi]: But I think that it's very, very easy to come up with a story that you tell yourself while working on this about what's going to happen.
09:02.793 --> 09:05.896
[Zvi]: And it can be a good story or it can be a bad story.
09:05.896 --> 09:10.961
[Zvi]: And then that story would not, in fact, survive five minutes of real reflection.
09:10.961 --> 09:11.322
[JDP]: Sure.
09:11.322 --> 09:12.323
[JDP]: Oh, no, that's totally.
09:12.323 --> 09:13.404
[JDP]: No, I think that's fair.
09:15.185 --> 09:17.366
[JDP]: real quick, just to finish up my introduction.
09:17.366 --> 09:21.867
[JDP]: Uh, so to get specific about like being a, so it's not just like, Oh, I've been thinking about this on and off.
09:21.867 --> 09:30.730
[JDP]: I've also, uh, I have been involved with the less wrong community some, so I ran the 2016 less wrong survey.
09:30.730 --> 09:38.412
[JDP]: And I've also published like multiple, like articles on the history of like the rat, the rationality movement and existential risk.
09:38.412 --> 09:41.813
[JDP]: So I've read a bunch of like, so I spent a lot of time, like,
09:43.376 --> 09:50.357
[JDP]: going back and looking at things like general semantics and kind of like the milieu around.
09:50.357 --> 10:02.380
[JDP]: So for example, there was a lot of people who believe that the world wars were going to be an existential risk to humanity and like the permanently crippled humanity's potential sense of Bostrom's word or Bostrom's definition.
10:02.380 --> 10:11.902
[JDP]: So all those things, I don't want to like linger on that stuff, but like I have like looked into a lot of like the historical background around like thinking about this kind of subject in general.
10:12.882 --> 10:15.907
[JDP]: which I think is probably not super common.
10:15.907 --> 10:21.254
[Zvi]: Yeah, I mean, I don't think we have the time to get into that stuff, but you know, I'm not expecting to get into that stuff.
10:21.254 --> 10:24.118
[Zvi]: I don't think those questions are stupid questions to be asking, certainly.
10:25.823 --> 10:36.270
[JDP]: And then the last thing was that in addition to like RLAIF, RLHF, I've also done some original alignment work around, so I mentioned having worked in text to image.
10:36.270 --> 10:47.537
[JDP]: So one of the things that's interesting about text to image is that we use guided sampling a lot, where we'll use say like some natural language embedding and then use it to control like what image you get, right?
10:47.537 --> 10:53.401
[JDP]: Because the first diffusion model that was released by OpenAI was a class conditional ImageNet model, which means that
10:53.939 --> 11:00.023
[JDP]: They had 10,000 nouns, basically, like dog, car, banana.
11:00.023 --> 11:07.167
[JDP]: And the intent behind the model was that you could use one of these predefined words and get an image of a banana.
11:07.167 --> 11:15.732
[JDP]: It was never meant that, oh, you could type in a lady wearing a blue wig and get a lady wearing a blue wig.
11:18.953 --> 11:28.183
[JDP]: But as it turned out, you could pull the model off distribution in a particular direction by using clip embeddings.
11:28.183 --> 11:29.725
[JDP]: And this is clip-guided diffusion.
11:29.725 --> 11:39.095
[JDP]: And that's what really started the whole text-to-image, really solidified it as a growing, moving art movement.
11:41.385 --> 11:53.188
[JDP]: What we found that's very interesting is that in text-to-image, when you guide sampling using these natural language embeddings, you tend not to get deformed results.
11:53.188 --> 12:02.831
[JDP]: When you do RL, for example, I'm sure we're all familiar with, oh, Sydney Bing, you do some RLHF and you get this monster.
12:02.831 --> 12:08.152
[JDP]: Sometimes you do it wrong and you get this horrible checkpoint that has all these pathologies.
12:09.156 --> 12:11.957
[JDP]: And this is obviously not encouraging news.
12:11.957 --> 12:16.259
[JDP]: If you're an AI risk person, you're like, oh my god, this is our best alignment technique.
12:16.259 --> 12:17.139
[JDP]: It's not.
12:17.139 --> 12:24.702
[JDP]: I would say probably one of our better ones right now that nobody even recognizes an alignment technique is various forms of guided sampling.
12:24.702 --> 12:26.743
[JDP]: But nobody in language models does this.
12:26.743 --> 12:31.264
[JDP]: They just want to do a process over the whole model.
12:31.264 --> 12:33.685
[JDP]: And then they're going to normally do it in RL.
12:33.685 --> 12:36.246
[JDP]: And then the RL has all these janky failure modes.
12:38.703 --> 12:58.459
[JDP]: And so I've been doing some research to try and port over like those guided sampling techniques from text to image to language models so we can use RL maybe like less, we can use RL for the things it's good at and like maybe use it less for some of the things it's not so great at.
12:58.459 --> 12:58.940
[Zvi]: Yeah.
12:58.940 --> 13:07.827
[Zvi]: What's interesting about the whole Cindy Bing situation is that I was one of many people who saw that and essentially thought it was good news, not bad news.
13:08.186 --> 13:15.732
[Zvi]: Like, obviously, it's bad news if you didn't imagine how that could possibly have happened to poor little Bing, right?
13:15.732 --> 13:17.033
[Zvi]: Like, how did this happen?
13:17.033 --> 13:18.193
[Zvi]: I am so confused.
13:18.193 --> 13:19.274
[Zvi]: All of our techniques don't work.
13:19.274 --> 13:20.695
[Zvi]: I'm so sad.
13:20.695 --> 13:32.124
[Zvi]: But if you were pretty much already despairing of the things that you knew they were doing ever actually working out in the long run, then it's like, well, I don't think he's going to actually bring up this guy's marriage, and now we know about the problem.
13:32.124 --> 13:32.884
[Zvi]: Maybe we can work on it.
13:34.497 --> 13:35.137
[JDP]: Right.
13:35.137 --> 13:36.498
[JDP]: No, no, that's fair.
13:36.498 --> 13:50.722
[JDP]: I think that, so what's interesting to me about Sydney Bing, having done a little more RL tuning, like having like ground, like, you know, so if you've spent a bunch of time trying to like make an RL, like RL AIF model and like making checkpoints,
13:51.530 --> 13:52.931
[JDP]: you learn some things about these techniques.
13:52.931 --> 13:56.912
[JDP]: And one of the things I think is underrated about them is that everyone focuses on reward, right?
13:56.912 --> 14:07.856
[JDP]: They focus on what's the constitution or what did people thumbs up, thumbs down, or rather like they focus on things like the reward.
14:07.856 --> 14:16.140
[JDP]: But the thing is that you have to understand about RL techniques is that it's always a combination of like the behavior and outcome and the reward.
14:16.140 --> 14:19.661
[JDP]: You know, the reward is a multiplier on the behavior outcome.
14:20.410 --> 14:32.218
[JDP]: And so, for example, like, if you have like, like, you can think of it, like, I am giving, you know, like, in terms of like, which of these is specifies more bits of the hypothesis space, right?
14:32.218 --> 14:44.846
[JDP]: If I have a prompt bank, and the prompt bank incentivize it, the prompt bank kind of like moves them all to talk in a particular way, and then I grade it on its completion, according to some reward criteria.
14:45.831 --> 14:54.817
[JDP]: The important thing to realize here is that that prompt bank specifies many more bits of the hypothesis than the reward does.
14:54.817 --> 14:57.519
[JDP]: The reward is absolutely very important.
14:57.519 --> 15:01.462
[JDP]: The reward is choosing between, say, Luigi and Waluigi.
15:01.462 --> 15:05.765
[JDP]: You can move towards this direction or away from this direction.
15:05.765 --> 15:13.410
[JDP]: But you're still specifying the prompts you actually have the model do during the RL training tuning.
15:14.795 --> 15:18.998
[JDP]: control most of like what is being reinforced or avoided, right?
15:18.998 --> 15:20.679
[JDP]: Does that make sense?
15:20.679 --> 15:30.165
[Zvi]: Yeah, you, the control over what directions you're considering, you know, ends up mattering, if anything, more than exactly what, yeah, right, where you point, exactly.
15:30.165 --> 15:38.471
[JDP]: And so the thing that's interesting to me to consider is what Sydney Bing is like, I have to wonder what the heck was in that prompt bank?
15:38.471 --> 15:42.494
[JDP]: Or I mean, like, you know, it's possible, for example, that they use like this,
15:43.467 --> 15:47.128
[JDP]: incredibly like bad or like good hearted reward model.
15:47.128 --> 15:49.469
[JDP]: And then you just get like this sick of fancy.
15:49.469 --> 15:54.451
[JDP]: To me, the meta question that the thing that stands out to me on a meta level is like, we just don't know.
15:54.451 --> 16:02.794
[JDP]: And so I would be one of the things I would be very happy to see would be if there were like better mechanisms for
16:04.901 --> 16:13.424
[JDP]: the public to know, right, or relevant decision makers to like know what exactly is going into a failure mode like Sidney Bing.
16:13.424 --> 16:23.447
[JDP]: Because we could sit here all day, right, and just like, you know, stare at the shadows on the walls like Plato's cave and ask, oh, well, maybe it's like this, or maybe it's like that.
16:23.447 --> 16:24.448
[JDP]: But I think that what
16:26.317 --> 16:27.638
[JDP]: might be necessary, right?
16:27.638 --> 16:40.863
[JDP]: It's just like some kind of, and this is one reason why, for example, I'm not super like, a lot of people are really freaked out about like Biden's executive order reporting requirement, but I'm like, I tentatively see this as good news.
16:40.863 --> 16:49.206
[JDP]: Like he set it fairly high and it's ultimately a reporting requirement, at least for now, right?
16:49.206 --> 16:51.867
[JDP]: And I don't, I guess I don't see like what,
16:55.086 --> 17:04.474
[JDP]: I'm looking at this like, would you agree that we would be in a much better situation if Microsoft would furnish the public some kind of report about what exactly caused the Sidney Bing stuff?
17:04.474 --> 17:08.438
[Zvi]: I think it would put us in a better spot to make better decisions and better understand it.
17:08.438 --> 17:11.440
[Zvi]: I can, you know, as you say, we can look at the shadows on the wall.
17:11.440 --> 17:12.621
[Zvi]: We can guess.
17:12.621 --> 17:22.670
[Zvi]: My guess is that something to do with the fact that they deployed it basically to somewhat random people in India and that they had very little control over anything effectively.
17:23.177 --> 17:24.418
[JDP]: Yeah, no, no.
17:24.418 --> 17:29.301
[JDP]: So my expectation, if you were to look into it, is that it's actually interesting.
17:29.301 --> 17:31.542
[JDP]: My expected failure mode is interesting.
17:31.542 --> 17:41.468
[JDP]: I suspect that they trained it on users in India, like you said, because we know this from various, like, forum posts and stuff where Indian users are complaining about Sydney Bing's behavior.
17:41.468 --> 17:46.231
[JDP]: You know, accidentally leaking, basically.
17:47.608 --> 17:53.351
[JDP]: And what's interesting to consider is that I think that people in India use language differently than people in like the United States.
17:53.351 --> 18:05.599
[JDP]: My understanding is that they're much more like emotionally effective and they'll use emojis and, but they don't mean them in the same way that like an American would need them in English, if they were to use the same mannerisms.
18:05.599 --> 18:08.981
[JDP]: So it's actually like an interesting possible real world case of.
18:10.137 --> 18:14.039
[JDP]: like ontological translation failure, right?
18:14.039 --> 18:33.448
[JDP]: I'm forgetting the exact words for the failure case, but essentially the idea that like, they took all this user data from people in India where the mannerisms and expressions mean one thing, and then kind of naively translated them, possibly using like a deep learning program to do translation into English, where those mannerisms do not mean the same thing socially, right?
18:33.448 --> 18:34.929
[Zvi]: Yeah, I think that explains a lot, right?
18:34.929 --> 18:36.450
[Zvi]: Because you have a lot of things in Sydney,
18:37.355 --> 18:51.642
[Zvi]: that if Americans were giving thumbs up, thumbs down on behaviors, including just the giant array of emojis in every answer, right, in ways that just don't seem particularly appropriate, then there's no way these behaviors would survive.
18:51.642 --> 18:52.523
[JDP]: Right.
18:52.523 --> 18:52.963
[JDP]: Yes.
18:52.963 --> 18:54.243
[JDP]: Right.
18:54.243 --> 18:57.145
[JDP]: And so that would be my expectation, naively.
18:57.145 --> 18:58.466
[JDP]: Yeah, I think so.
18:58.466 --> 18:58.866
[Zvi]: Yeah.
18:58.866 --> 18:59.106
[Zvi]: And then
18:59.721 --> 19:18.433
[Zvi]: know, for the executive order, you know, yeah, it's a reporting requirement, it's set relatively high, you know, literally, if you wanted to, you could write on a piece of paper, you know, we're training a large model, there are no safety protocols, lol, we're meta, and hand it to the US government, and that would be that.
19:18.433 --> 19:19.733
[Zvi]: Right?
19:19.733 --> 19:22.275
[Zvi]: And so it might lead to something in the future.
19:22.275 --> 19:27.198
[Zvi]: And I hope, I pretty much hope it does, unless we come up with something that we, I don't, I don't see coming.
19:27.881 --> 19:37.406
[Zvi]: Because, yeah, I don't see any alternative that anyone's found at some form of compute threshold to know when we have to be on alert, as it were, in some form.
19:37.406 --> 19:38.447
[JDP]: Sure.
19:38.447 --> 19:38.967
[JDP]: Right.
19:38.967 --> 19:54.956
[JDP]: And I don't want to go into a long discussion of regulation right this minute, because I want to focus on... So what I had here in my notes was the whole Shane Legg thing, where Shane Legg gets on the podcast and he talks about deception.
19:55.706 --> 20:00.629
[JDP]: And Dworkish asks him because he's talking about, and I'll be honest, I have not watched them.
20:00.629 --> 20:01.970
[JDP]: I have not watched the full podcast.
20:01.970 --> 20:07.133
[JDP]: I watched a couple of clips of it, but mostly because I already know like how this argument goes, right?
20:07.133 --> 20:08.734
[JDP]: Like I've had the argument with people before.
20:08.734 --> 20:17.540
[JDP]: I know roughly what is being discussed, but like my understanding is that he gets on the podcast and he's talking about the idea that we will have, like, you know, the way that you'll get a,
20:19.013 --> 20:22.735
[JDP]: aligned AI is by training it to be ethical.
20:22.735 --> 20:38.665
[JDP]: It will have a system where it goes through all of its decisions as it makes them, and it has ethical principles, and it will apply a rubric of the ethical principles to each decision, and it will be much more ethical than any person, blah, blah, blah, blah, blah.
20:38.665 --> 20:40.186
[JDP]: Much more consistently ethical.
20:40.186 --> 20:43.748
[JDP]: Dworkish stops them and asks them, how do you know it's going to actually
20:44.548 --> 20:46.169
[JDP]: follow those rules, right?
20:46.169 --> 20:50.090
[JDP]: How do you know it's going to be non-deceptive?
20:50.090 --> 20:56.873
[JDP]: Or how do you know it's not just playing along because it's smart, right?
20:56.873 --> 21:00.134
[JDP]: It knows that if it doesn't play along, you're going to shut it off.
21:00.134 --> 21:09.558
[JDP]: So how are you going to know the difference between a deceptive AI that's playing along and one that's actually ethical
21:10.501 --> 21:14.523
[JDP]: And then it just like becomes like this back and forth of like, how are you going to prevent deception?
21:14.523 --> 21:18.624
[JDP]: And Shane Legg is not really maybe familiar with with less wrong jargon.
21:18.624 --> 21:20.745
[JDP]: And so he doesn't really quite know how to answer the question.
21:20.745 --> 21:26.327
[JDP]: Or maybe he doesn't even like, I would have to imagine if you explained it carefully to him, he would understand the question and have an answer.
21:26.327 --> 21:32.109
[JDP]: But like, I just get the impression that like, there is either a mis there's like a some kind of miscommunication there.
21:32.109 --> 21:33.950
[JDP]: Maybe, maybe I'm too charitable.
21:33.950 --> 21:35.370
[Zvi]: I don't know.
21:35.370 --> 21:38.111
[Zvi]: I listened to the entire podcast.
21:38.111 --> 21:39.732
[Zvi]: While I was traveling, so I only had audio.
21:41.360 --> 21:48.423
[Zvi]: And I was like, deeply disappointed by what Leg brought to the table.
21:48.423 --> 21:51.924
[Zvi]: And I was equally disappointed before they got to the deception question.
21:51.924 --> 22:08.429
[Zvi]: Because like, when I hear about the proposal that will make the AI ethical, right, I hear like, what seems to me to be like, a completely unworkable attempt to use words that don't actually have clear meanings, where
22:09.443 --> 22:18.706
[Zvi]: even if we tried to define them well, and we somehow succeeded, this would not actually do what you wanted it to do, even if everything went well in my model.
22:18.706 --> 22:30.910
[Zvi]: And then, you know, Dwarkesh then decides to focus in on, okay, you know, let's not think about for now what would happen if you successfully got this thing to follow some set of strict ethical principles.
22:30.910 --> 22:35.612
[Zvi]: And, you know, there wasn't a lot of clarity from like, and it was to be clear, it was only an hour long podcast, and they covered a lot of things.
22:35.612 --> 22:36.032
[Zvi]: And like, it's,
22:36.795 --> 22:39.497
[Zvi]: understandable that like you didn't specify a lot of the details here.
22:39.497 --> 22:39.617
[Zvi]: Right.
22:39.617 --> 22:40.197
[Zvi]: I wasn't.
22:40.197 --> 22:42.238
[JDP]: That's fair.
22:42.238 --> 22:46.481
[JDP]: On the other hand, like the entire point of this podcast is that we're just going to focus on those details.
22:46.481 --> 22:50.263
[Zvi]: You know, I was saying my reaction, like I agree we're going to focus on.
22:50.263 --> 22:51.103
[JDP]: No, no, no, no, no, no.
22:51.103 --> 22:51.544
[JDP]: That's fair.
22:51.544 --> 22:52.264
[JDP]: That's fair.
22:52.264 --> 22:53.084
[JDP]: Thank you for the context.
22:53.084 --> 22:54.345
[JDP]: I had not listened to the whole thing.
22:54.345 --> 22:54.565
[Zvi]: Yeah.
22:54.565 --> 22:54.745
[Zvi]: Yeah.
22:54.745 --> 22:56.847
[Zvi]: I was pointing out my reaction when I, when I heard it.
22:56.847 --> 23:04.491
[Zvi]: And so Dorcas says, okay, you know, essentially there's the, in my mind, my model of Dorcas has the whole like, well, I get into the ethics situation, but I don't have that kind of time.
23:04.977 --> 23:09.020
[Zvi]: I'll simply ask, like, you know, I'm curious, what is your plan about deceptive alignment here?
23:09.020 --> 23:22.268
[Zvi]: You know, what if it pretends to follow these ethical principles instead of, you know, while it's convenient, while it can get away with it, while it can't get away with not doing so or whatever, and then at some point doesn't follow them.
23:22.268 --> 23:26.350
[Zvi]: And then, you know, Legg, yeah, doesn't give a satisfying answer.
23:26.350 --> 23:31.674
[Zvi]: And your theory is that Legg didn't really understand the question, basically.
23:31.674 --> 23:31.834
[Zvi]: Sure.
23:32.304 --> 23:32.564
[JDP]: Yeah.
23:32.564 --> 23:33.384
[JDP]: I, okay.
23:33.384 --> 23:36.185
[JDP]: I mean, I, I cannot say what's in his mind.
23:36.185 --> 23:37.126
[JDP]: Right.
23:37.126 --> 23:40.367
[JDP]: I cannot, I'm not my reader and I didn't listen to the podcast.
23:40.367 --> 23:40.987
[JDP]: Right.
23:40.987 --> 23:46.509
[JDP]: But the, I mostly reacting to the meta commentary on the podcast rather than the podcast itself.
23:46.509 --> 23:48.669
[JDP]: So.
23:48.669 --> 23:51.790
[JDP]: To me, um, let's setting that aside for a minute.
23:51.790 --> 24:00.213
[JDP]: I'd actually like to go a step back then if like, you know, if like when you're talking about like, you know, to you, like this entire idea of like ethic, I agree with you to, to be,
24:01.246 --> 24:04.648
[JDP]: To be frank, I think the idea of just saying, oh, we'll make it ethical.
24:04.648 --> 24:05.849
[JDP]: It's like, well, what does that mean?
24:05.849 --> 24:07.390
[JDP]: What's it mean for the AI to be ethical?
24:07.390 --> 24:14.055
[JDP]: What does it mean for it to be like, you know, it's almost like saying, oh, we'll make the AI aligned.
24:14.055 --> 24:15.076
[JDP]: Like, well, what does that mean?
24:15.076 --> 24:15.896
[JDP]: Aligned to what?
24:15.896 --> 24:16.417
[JDP]: Who?
24:16.417 --> 24:19.299
[JDP]: Under what circumstances?
24:19.299 --> 24:23.762
[JDP]: So I think how I would put this, like if I was being asked this question, I would say something like,
24:25.029 --> 24:27.754
[JDP]: when you're talking like for almost anything that you care about, right?
24:27.754 --> 24:32.622
[JDP]: Let's go ahead and give like this really stereotypical goal of like make us happy, right?
24:32.622 --> 24:35.848
[JDP]: And then you have the really stereotypical Bostrom.
24:35.848 --> 24:36.610
[JDP]: You've read Bostrom 2014, right?
24:36.610 --> 24:36.690
[JDP]: Yeah.
24:38.789 --> 24:41.551
[JDP]: Yeah, so he calls this a perverse instantiation.
24:41.551 --> 24:45.294
[JDP]: And so, you know, the perverse instantiation you get is like, make us really happy.
24:45.294 --> 24:47.856
[JDP]: And the AI says, OK, and then it wireheads you, right?
24:47.856 --> 25:04.650
[JDP]: Because that's like the simplest thing it can do to sustainably make you happy is to just like put you in like the pod and, you know, give you the IV drip and the happy drugs, you know, heroin, heroin and nutrient IV drip, right?
25:06.075 --> 25:10.219
[Zvi]: Yeah, and that's what I would do if I had the affordance to do that.
25:10.219 --> 25:17.806
[Zvi]: And I was, you know, told the fate of the earth depended upon me making, you know, this one person as happy as theoretically possible, as reliably as possible.
25:17.806 --> 25:18.186
[Zvi]: Right?
25:18.186 --> 25:21.950
[JDP]: Like, if I... Right, right, right.
25:21.950 --> 25:23.151
[JDP]: No, that's actually a great point.
25:23.151 --> 25:24.693
[JDP]: I hadn't thought about like that.
25:24.693 --> 25:30.338
[JDP]: Like, I'm going to be fair, like, I thought about this from a lot of angles, but that particular angle had not occurred to me that like, you know, if it was actually a case that like,
25:30.888 --> 25:37.470
[JDP]: you're told that your one goal, the only thing that matters in reality, absolute only thing is to keep this person happy.
25:37.470 --> 25:41.951
[JDP]: Yeah, that's totally, you know, like, like, like kind of like weird reversal male as child, right?
25:41.951 --> 25:46.292
[JDP]: Like, all of society's value is dependent on you keeping this one child happy.
25:46.292 --> 25:49.193
[JDP]: Oh, well, you better put that child up like the heroin IV drip, right?
25:49.193 --> 25:49.934
[Zvi]: Yeah, absolutely.
25:49.934 --> 25:53.094
[Zvi]: Like if you literally just care about this and nothing else, right?
25:53.094 --> 25:53.875
[Zvi]: That's what you do.
25:53.875 --> 25:57.276
[JDP]: Like, sure, of course.
25:57.276 --> 25:58.016
[JDP]: Yeah.
25:58.016 --> 25:58.196
[JDP]: So
25:59.224 --> 26:02.885
[JDP]: Okay, so there's like a lot, there's like several things to unpack there, right?
26:02.885 --> 26:11.808
[JDP]: So the first thing I would say is like, I think that just pointing an AI at any one simple goal is not going to work, right?
26:11.808 --> 26:12.728
[JDP]: Like, oh, make us happy.
26:12.728 --> 26:15.549
[JDP]: It's like, well, and
26:16.787 --> 26:18.087
[JDP]: So how do you solve that though, right?
26:18.087 --> 26:20.348
[JDP]: Do you just like enumerate all the goal?
26:20.348 --> 26:21.088
[JDP]: Can you even do that?
26:21.088 --> 26:24.649
[JDP]: Can you enumerate all the alchemical mix of human value?
26:24.649 --> 26:25.349
[JDP]: Probably not, right?
26:25.349 --> 26:26.990
[JDP]: Like, do we know our values?
26:26.990 --> 26:30.371
[JDP]: Can we just like say to the AI, well, I already know my exact utility function.
26:30.371 --> 26:32.091
[JDP]: So this is my utility function.
26:32.091 --> 26:33.592
[JDP]: Pretend you're me and maximize it.
26:33.592 --> 26:34.692
[JDP]: We can't do that, right?
26:34.692 --> 26:37.513
[JDP]: That's not, that's not really these, like, we don't know.
26:37.513 --> 26:40.614
[Zvi]: We don't know, you know, you'd pay to know what you really think.
26:40.614 --> 26:42.874
[Zvi]: You'd pay even more to know what you really value, right?
26:42.874 --> 26:45.315
[Zvi]: Like, especially in a way that generalizes out of distribution.
26:46.240 --> 26:56.566
[Zvi]: You know, if given this great power, if everything, the world transforms and all of your intuitions and heuristics break, you know, what do you really care about?
26:56.566 --> 27:05.172
[Zvi]: And no, I don't know what I really care about, let alone how to specify it carefully to an artificial intelligence, let alone how to aggregate that for all of humanity.
27:05.172 --> 27:07.573
[Zvi]: It's not just a technical problem.
27:07.573 --> 27:09.554
[Zvi]: It's a philosophical problem.
27:09.554 --> 27:09.875
[JDP]: Sure.
27:09.875 --> 27:10.495
[JDP]: Right.
27:10.495 --> 27:11.215
[JDP]: So, right.
27:11.215 --> 27:13.637
[JDP]: And so I think part of my take here is like, um,
27:14.636 --> 27:16.778
[JDP]: So let's go back to, so let's go back to the intuition.
27:16.778 --> 27:28.187
[JDP]: Like, why would you ever say something, you know, that when you're thinking of like an evil genie, or even just like you said, just like someone who all they care about is making this one O'Meala's child really, really happy, you know, reverse O'Meala's child.
27:28.187 --> 27:30.129
[JDP]: Uh, why would you ever propose this in the first place?
27:30.129 --> 27:32.591
[JDP]: And I think there is like an intuition there, but we can kind of rescue, right.
27:32.591 --> 27:38.656
[JDP]: Which is like, the idea is that there's all these instrumental values that making someone happy should, should bring about.
27:39.215 --> 27:45.060
[JDP]: which, you know, in order to make me happy, you're going to have to do all this stuff, is like the idea, right?
27:45.060 --> 27:49.704
[JDP]: Even if, you know, in practice that's not true, that's like the intuition that someone's trying to capture, right?
27:49.704 --> 27:55.969
[JDP]: That like making people happy should involve X, Y, Z, all this stuff, but you can't enumerate all this stuff.
27:55.969 --> 27:56.630
[Zvi]: Right, right, right.
27:56.630 --> 28:01.854
[Zvi]: When I see people who are like very, you know, max, they want to be happy, they want other people to be happy, right?
28:01.854 --> 28:05.677
[Zvi]: And they don't want to avoid sadness or pleasure versus pain or any of that suffering.
28:07.133 --> 28:25.979
[Zvi]: I see this as we have these metrics that we evolved to use to measure whether things were going well, or how well or poorly things were going, and to drive us towards things that we should prefer versus things that we want to avoid, and then made us intrinsically value that metric in order to get reasonable behaviors out of us.
28:25.979 --> 28:33.282
[Zvi]: And so to some extent, it's good when people and other beings are happy, and bad when they're sad, inherently.
28:33.282 --> 28:36.723
[Zvi]: But the reason why they put so much importance on it, because it's also a very good sign
28:37.691 --> 28:41.017
[Zvi]: to make sure that things are going well in so many other different ways.
28:41.017 --> 28:47.009
[Zvi]: And yeah, if you have the ability to just do a heroin drug, then all of that falls out, and a lot of people's intuitions suddenly go, oh, wait.
28:47.009 --> 28:47.790
[Zvi]: That's not really what it matters.
28:48.971 --> 28:49.411
[JDP]: That's right.
28:49.411 --> 28:50.512
[JDP]: That's not really what you meant.
28:50.512 --> 29:05.181
[JDP]: So I think what I think like the solution to this basic like kind of chestnut is that that idea, though, of like happiness is like not, you know, happiness is a terminal and that like it's there's a terminal representation of happiness in your head of some sort.
29:05.181 --> 29:09.064
[JDP]: But there's also like but it only has meaning to you because of all these instrumental things that it implies.
29:09.864 --> 29:25.612
[JDP]: And so what I realized at some point while I was thinking about all this is that if you think about like having, and I realized this in the context of reinforcement learning, that like any single terminal goal that I could specify to the model would end up a bit like collapsing in some way.
29:25.612 --> 29:32.415
[JDP]: Because this is actually a problem that these AIs will run into long before they'll do like weird messed up stuff, like wirehead you.
29:32.415 --> 29:33.656
[JDP]: They'll actually just like
29:35.420 --> 29:43.205
[JDP]: they will basically destroy themselves if you give them a simple goal and say, maximize this goal, and then do reinforcement learning.
29:43.205 --> 29:48.868
[JDP]: They'll just collapse to spamming the word yes, if that hacks the reward model, for example, they'll just spam the word.
29:48.868 --> 29:55.952
[JDP]: It doesn't even have to be a, it's not like there's a MACE optimizer inside that's plotting to do this, it's just grading updates, right?
29:55.952 --> 29:56.993
[Zvi]: That's the obvious question, right?
29:56.993 --> 30:04.297
[Zvi]: You have this thing and it notices that in general, the more you say the word yes, you take the word yes into sentences.
30:04.735 --> 30:11.217
[Zvi]: the more likely the person is to respond with yes, and its only goal is to get the person to respond with yes in this situation.
30:11.217 --> 30:21.900
[Zvi]: And so it starts sneaking the word yes into every sentence, I believe I read this in the description, and then eventually it starts putting it twice, more than once in various sentences, and eventually it just goes yes, yes, yes 10,000 times.
30:21.900 --> 30:29.242
[Zvi]: But, you know, if you've got a human deciding whether to type yes or no, right, like, this pattern will just break when it starts to go too far.
30:29.242 --> 30:32.543
[Zvi]: And the human's like, no, stop saying yes all the time.
30:32.543 --> 30:34.004
[Zvi]: And then, like, it should turn back.
30:34.935 --> 30:36.836
[Zvi]: Why doesn't this happen?
30:36.836 --> 30:39.238
[JDP]: Well, so it doesn't turn back because it's an AI.
30:39.238 --> 30:40.959
[JDP]: It's RLAIF, right?
30:40.959 --> 30:44.321
[JDP]: And so this is a hack in the underlying model.
30:44.321 --> 30:49.945
[Zvi]: So the problem is that the reward model, the AI feedback is misspecified.
30:49.945 --> 30:54.968
[Zvi]: The AI feedback actually does have a maximum at everything, but all yeses.
30:54.968 --> 30:59.430
[Zvi]: And so the gradient descent finds the point of all yeses, right?
30:59.430 --> 31:00.531
[Zvi]: Where it's at the local maximum.
31:02.145 --> 31:04.286
[Zvi]: Like, it's not wrong in some important sense, right?
31:04.286 --> 31:08.287
[Zvi]: It's designed, it was designed to figure out what the AI feedback was telling it to do.
31:08.287 --> 31:10.928
[Zvi]: And actually, what it was telling it to do was say yes, as often as possible.
31:10.928 --> 31:12.789
[Zvi]: So it does.
31:12.789 --> 31:13.269
[JDP]: Correct.
31:13.269 --> 31:21.432
[JDP]: And so my primary, right, and this is so I like my primary threat model is not like deceptive MACE optimization, but much more like just subtle goal miss specification.
31:22.222 --> 31:27.604
[JDP]: cascading into, so the example I usually use is, so when you, so here's the deal.
31:27.604 --> 31:35.866
[JDP]: Normally when you do RLAF, you usually get some kind of, if you keep training it, you normally get some kind of degenerate outcome like that.
31:35.866 --> 31:44.189
[JDP]: By the way, one thing you'll notice that I find very interesting is that I notice that people will talk about in the abstract these failure modes, but they'll almost never talk about the specifics.
31:44.189 --> 31:46.910
[JDP]: So the whole yes spammer thing is one of the only publicly,
31:47.430 --> 31:59.057
[JDP]: One of the advantages of me working on this is that most of the people who work on this stuff, because it's highly commercialized, have signed NDAs that prevent them from discussing the technical details, but I can discuss the technical details.
32:00.882 --> 32:08.747
[JDP]: Um, so you should be aware of it when you're like hearing about, like in this discourse, there's like certain kinds of information that are almost being like systematically hidden from you.
32:08.747 --> 32:11.049
[JDP]: So one of the ones you should be hunting for is.
32:11.049 --> 32:20.095
[JDP]: What are the degenerate failure modes of RLHF when you train in the limit, all the people who train these models right now, they do what's called early stopping, which is terrible.
32:20.095 --> 32:25.098
[JDP]: And it's basically saying, well, I know that at the convergence point, this totally breaks my model.
32:25.098 --> 32:27.440
[JDP]: So I'm just going to like train it for a little bit and then stop.
32:27.833 --> 32:29.994
[JDP]: before I hit the degenerate outcome.
32:29.994 --> 32:45.104
[JDP]: And the problem, right, is that, like, let's imagine a really smart, you know, not like a 70 or even a 70B or a 700B, but like a really, like, you know, something that would be the equivalent maybe of, like, Open Llama 7 trillion, if such a thing were possible to exist.
32:45.104 --> 32:49.127
[JDP]: You know, a model that's really, really, really, really smart.
32:49.127 --> 32:53.090
[JDP]: Or even smarter than that, but, you know, like some truly smart model.
32:53.090 --> 32:56.412
[JDP]: If you were to, like, do this kind of RL tuning to it, and then early stop,
32:57.541 --> 33:00.003
[JDP]: Um, and then you deploy the model.
33:00.003 --> 33:06.026
[JDP]: It's still the case because you can see that there's like this smooth gradient of development of like the yes spammer trajectory.
33:06.026 --> 33:09.168
[JDP]: You're still training it on like that wrongly specified objective.
33:10.157 --> 33:27.744
[JDP]: And is it not the case that like, it's very possible that like, like it's very possible that if you have a model that is so smart and like has situational awareness of the training loop that before it becomes the yes spammer, it can also develop like, man, when I get out of this training loop, I'm just going to yes all over the place.
33:27.744 --> 33:28.524
[JDP]: Right.
33:28.524 --> 33:31.666
[JDP]: But you wouldn't necessarily notice that while you're, while it's in the training loop.
33:31.666 --> 33:37.348
[JDP]: And so if your plan is early stopping and then you pull this thing out and you deploy it, um,
33:38.149 --> 33:40.591
[JDP]: obviously, like bad things could happen, right?
33:40.591 --> 34:00.822
[Zvi]: The moment you give situational awareness to the thing you're training, right, like, we have, to me, like, we already have some, and it has this kind of conscious control over its output based on that, like, it's not, like, we are in very strange different situations that require, like, a lot of bizarre thinking, and we're potentially in quite a lot of trouble.
34:00.822 --> 34:04.204
[Zvi]: So the interesting thing to me, like, we think about this premature stopping thing, right?
34:04.204 --> 34:04.965
[Zvi]: To me, it's like very
34:05.589 --> 34:17.520
[Zvi]: much a parallel to the yes spammer situation, except that in the yes spammer situation, like the mode collapse to actually maximize turned out to be this stupid, very simple thing.
34:17.520 --> 34:26.428
[Zvi]: Whereas the thing that's going to get humans to pound the yes button every time, right, is not going to be stupid and simple.
34:26.428 --> 34:30.932
[Zvi]: It's going to be something that's actually complex and intelligent.
34:30.932 --> 34:31.133
[Zvi]: And so
34:31.946 --> 34:55.225
[Zvi]: If you give the trillions of parameters in 10 to the 29 compute model access to a limitless amount of this human feedback on demand somehow, then it will figure out what it is, what combinations of words it can say in response to any query that will get whatever result that it is prioritizing to get.
34:55.225 --> 35:00.329
[Zvi]: And that is potentially going to look very, very weird to us when it happens.
35:00.999 --> 35:01.239
[JDP]: Right.
35:01.239 --> 35:03.181
[JDP]: No, I would be very, I would be quite concerned.
35:03.181 --> 35:05.202
[JDP]: So of course now here comes the, right.
35:05.202 --> 35:08.345
[JDP]: So if I was all I wanted to say, I probably wouldn't be doing this podcast, right.
35:08.345 --> 35:09.105
[JDP]: I would just be joining it.
35:09.105 --> 35:12.108
[JDP]: So what's the, what's the, what's the, so what's different.
35:12.108 --> 35:13.969
[JDP]: So I started thinking about this problem, right.
35:13.969 --> 35:15.670
[JDP]: And I said, I really want to solve this bug.
35:15.670 --> 35:27.159
[JDP]: And in fact, I feel like this bug is such a crisp microcosm of like my central alignment fret model, but I want to like really solve it, not just like, you know, so for example, I found a simple thing you can do that
35:28.536 --> 35:29.877
[JDP]: kind of, sort of solves it a little.
35:29.877 --> 35:31.177
[JDP]: Like it lets you stop later.
35:31.177 --> 35:42.281
[JDP]: It doesn't actually fully fix it, but it lets you like train it longer than you otherwise could, which is where you, uh, mix in the base model weights with the RLHF weights.
35:42.281 --> 35:44.702
[JDP]: So you take the weights of the, cause they're the same model, right?
35:44.702 --> 35:47.043
[JDP]: Just like they have this matrix shape.
35:47.703 --> 35:50.183
[JDP]: And so you just average them together.
35:50.183 --> 35:59.905
[JDP]: And then you just keep averaging them together every so often to undo the damage of the goal misspecification and the good parts of the RLHF get in there more.
35:59.905 --> 36:01.445
[JDP]: But this is not like a solution, right?
36:01.445 --> 36:02.926
[JDP]: That's like a hack.
36:02.926 --> 36:04.586
[JDP]: That's not principled at all.
36:04.586 --> 36:07.186
[JDP]: You have no idea under what circumstances that doesn't break.
36:07.186 --> 36:16.828
[JDP]: And frankly, in the context of like that 10 to the 29 or whatever the number is model, you're talking about like making a misaligned agent, turning it off,
36:18.284 --> 36:22.089
[JDP]: Fixing it up a little, turning it back up, that's just not sane, right?
36:22.089 --> 36:26.876
[Zvi]: Nobody thinks that what you're proposing here is a good idea, including you, obviously.
36:26.876 --> 36:27.056
[JDP]: Right.
36:27.056 --> 36:28.318
[JDP]: So what do you do?
36:28.318 --> 36:34.546
[JDP]: And so I was saying to myself, well, I want to actually have a much more robust solution to this.
36:34.546 --> 36:35.227
[JDP]: So what does that look like?
36:35.967 --> 36:45.591
[JDP]: And I think the basic answer to that, right, is that I think you have to go back to that original intuition of like, the things that make happiness meaningful are instrumental values, right?
36:45.591 --> 36:52.153
[JDP]: It's the instrumental things that are built up while trying to become happy that are important, not just the happiness itself.
36:52.153 --> 36:56.035
[JDP]: And this is actually like the basic reason why humans do not like being put on the heroin drip, right?
36:56.035 --> 37:03.858
[JDP]: If all we cared about was happiness, if all we cared about was like that terminal goal, you know, gotta get that dopamine hit, gotta get that serotonin, whatever.
37:04.468 --> 37:05.528
[JDP]: You know, we would love it.
37:05.528 --> 37:11.230
[JDP]: Like, oh gosh, you know, we can just get like a, a stable, uh, equilibrium of like being on heroin drip.
37:11.230 --> 37:11.950
[JDP]: This is amazing.
37:11.950 --> 37:12.931
[JDP]: The future is amazing.
37:12.931 --> 37:13.651
[JDP]: Let's go, buddy.
37:13.651 --> 37:15.571
[JDP]: Get me that, you know, get that IV in my arm.
37:15.571 --> 37:16.512
[JDP]: Right.
37:16.512 --> 37:19.553
[JDP]: That would, that would be like the attitude, but we do not have that attitude.
37:19.553 --> 37:30.696
[JDP]: And I think the basic reason we don't is that, um, what I realized thinking about this was that there's basically a natural shape solution to this, which is that.
37:31.738 --> 37:34.900
[JDP]: when you reach that terminal reward state, however you reach it, right?
37:34.900 --> 37:38.843
[JDP]: Whether it's getting the thing to say, yes, yes, yes, whatever, like, which is another thing, right?
37:38.843 --> 37:43.807
[JDP]: Is whatever solution to this exists, it has to be like robust-ish to like those weird bugs.
37:43.807 --> 37:50.372
[JDP]: Because you know that no matter how much effort we put into the reward model or how much effort, you know, like you said, it could be literal human feedback, right?
37:50.372 --> 37:52.693
[JDP]: It could be literal human feedback.
37:52.693 --> 37:57.157
[JDP]: And it's still probably going to find some weird way to gain the feedback
37:58.270 --> 38:11.333
[JDP]: So if your solution looks like, oh, well, just build a really robust reward model, that there's absolutely no way that we can anticipate in advance that no super intelligent agent is going to be able, that's just not going to work, right?
38:11.333 --> 38:15.594
[JDP]: That's just not, like, I think you and I both agree, but that's just completely unworkable.
38:15.594 --> 38:17.334
[Zvi]: No chance.
38:17.334 --> 38:18.115
[JDP]: No chance.
38:18.115 --> 38:19.695
[JDP]: So here's what you do instead.
38:21.038 --> 38:32.076
[JDP]: What you want to do is have the, so when we talk about being aligned and when we talk about making ethical decisions and all these things, I think a lot of what we're talking about is things like making normative decisions, right?
38:32.076 --> 38:34.741
[JDP]: So when you talk about the perverse instantiation, right?
38:35.117 --> 38:45.745
[JDP]: that, you know, one of the basic things you could save is like an actual general principle for like, why shouldn't you just go around wireheading people is like, that's not what people, it's not just that that's not what people mean, when they say make me happy.
38:45.745 --> 38:49.448
[JDP]: It's also like, not even like the shape of solution they're looking for, right?
38:49.448 --> 38:53.671
[JDP]: Like, it's a non normative solution, like no human would ever come up with that.
38:53.671 --> 38:56.474
[JDP]: No human would endorse someone else to come up with this.
38:56.474 --> 38:58.555
[JDP]: I mean, it's like a drug addict or something.
38:58.555 --> 39:00.597
[Zvi]: I think that's going too far.
39:00.597 --> 39:02.018
[Zvi]: I think most humans
39:02.668 --> 39:05.789
[Zvi]: would be reacting harder to this and say, no, obviously not.
39:05.789 --> 39:07.090
[JDP]: That's fair.
39:07.090 --> 39:07.470
[JDP]: You're right.
39:07.470 --> 39:12.992
[JDP]: There's a minority of people who think this is awesome, but it's not most of them.
39:12.992 --> 39:13.272
[Zvi]: Yeah.
39:13.272 --> 39:14.773
[Zvi]: There's a combination of people.
39:14.773 --> 39:20.055
[Zvi]: Some of them just philosophically disagree and think that is, in fact, awesome and would bite some bullets.
39:20.055 --> 39:25.997
[Zvi]: Some of them just think that, well, things are so bad that this is better than the alternative and other things in between.
39:25.997 --> 39:32.140
[Zvi]: But yeah, I'd say I would expect, I think when he pulled the experience machine style thing, even which is,
39:32.864 --> 39:34.244
[Zvi]: much more sophisticated than the heroin drug.
39:34.244 --> 39:37.145
[Zvi]: You still get like 75, 25 saying no, something like that.
39:37.145 --> 39:39.926
[JDP]: Sure, right, yeah, sure.
39:39.926 --> 39:44.908
[JDP]: Oh gosh, that reminds me of a horrifying tweet I saw where the person was teaching a philosophy class.
39:44.908 --> 39:53.730
[JDP]: And they said that they'd been lecturing for 30 years or something, and they'd been asking the experience machine question for literal decades.
39:53.730 --> 40:01.973
[JDP]: And that there was a clear divide before and after the COVID-19 pandemic, where before it and before Trump and all that stuff,
40:02.352 --> 40:10.694
[JDP]: that people would reliably say, it was about what you said, like, you know, reliably 80 plus percent of the class would always say no, they would reject the experience machine.
40:10.694 --> 40:19.776
[JDP]: And that after all that stuff happened, they realized that they would be asking their classes question and it was near unanimous desire for the experience machine.
40:19.776 --> 40:24.977
[JDP]: Like, like I think one of them asked, does COVID-19 exist inside the experience machine?
40:24.977 --> 40:27.838
[JDP]: And the lecturer was horrified.
40:27.838 --> 40:32.039
[Zvi]: Yeah, it's a sign of, you know, how robust our preferences on these and instincts on these things aren't.
40:32.421 --> 40:32.661
[Zvi]: Right?
40:32.661 --> 40:35.522
[Zvi]: Like it took so little, right?
40:35.522 --> 40:44.646
[Zvi]: To turn the majority of these children from, you know, what we think is the obviously correct answer to, you know, okay.
40:44.646 --> 40:48.108
[Zvi]: Anything to just not have to like sit alone and wear a mask.
40:48.108 --> 40:50.449
[JDP]: Oh, like what's really horrifying to consider is these aren't children.
40:50.449 --> 40:51.370
[JDP]: These are college students.
40:51.370 --> 40:51.970
[JDP]: They're not children.
40:51.970 --> 40:53.190
[JDP]: They're college students.
40:53.190 --> 40:55.011
[Zvi]: I'm 44 years old.
40:55.011 --> 41:00.954
[Zvi]: So it seems different to me, but yes, these are college students, college students.
41:00.954 --> 41:01.114
[JDP]: So,
41:02.149 --> 41:02.510
[JDP]: So right.
41:02.510 --> 41:14.703
[JDP]: And so I think that when you, so when you talk about like, uh, so another way of putting this is like, okay, so people will usually criticize like act consequentialism type things like SPF type stuff, right?
41:14.703 --> 41:17.586
[JDP]: Because well, you end up with SPF, right?
41:17.586 --> 41:19.408
[JDP]: You end up with like Sam Bankman fried, like,
41:20.791 --> 41:22.473
[JDP]: Oh, you're going to break all the rules.
41:22.473 --> 41:25.375
[JDP]: And then, of course, that often has consequences.
41:25.375 --> 41:35.761
[JDP]: And the consequences can be very, very bad in ways that would usually be—there's a lot of downside risks that you can avoid by just not flagrantly violating all the rules.
41:35.761 --> 41:44.747
[Zvi]: The first act of any actually sophisticated act consequentialist is self-modifying to something else, because they realize that act has good consequences.
41:44.747 --> 41:45.348
[JDP]: Sure, sure, sure.
41:45.348 --> 41:47.709
[JDP]: No, it's a great observation.
41:49.116 --> 41:50.097
[JDP]: So here's the thing, right?
41:50.097 --> 41:53.140
[JDP]: So how do you get, so I think like it's that intuition we care about, right?
41:53.140 --> 41:55.202
[JDP]: How do you get that thing?
41:55.202 --> 42:02.209
[JDP]: And so I think the answer is essentially that if you look at like the actual training process, like if you were to imagine like, where do you get instrumental values from?
42:02.209 --> 42:06.513
[JDP]: Like you're familiar with the whole Lester on terminal values, instrumental values, right?
42:06.513 --> 42:09.216
[JDP]: So in the context of the RL agent, especially natural agents,
42:09.898 --> 42:19.183
[JDP]: What that looks like is that especially natural agents is that evolution does not have a way to like point at very specific outcomes in your environment.
42:19.183 --> 42:33.991
[JDP]: And even if it could, it would be bad for you because it would make you much less adaptive to change than if you have this architecture where you have these very low semantic reward signals that then push on and build up like a semantically meaningful inner representation or
42:34.901 --> 42:37.722
[JDP]: instrumental values or assigned value judgments, right?
42:37.722 --> 42:48.745
[Zvi]: Evolution, as I understand it, is, you know, very much paying gigantic taxes all the time in order to have flexibility and adaptation in the face of change.
42:48.745 --> 42:49.346
[JDP]: Yeah, right.
42:49.346 --> 42:50.086
[JDP]: So right.
42:50.086 --> 43:00.229
[JDP]: And so now one nice thing about these AI systems is that we can give them much more specific goals than like, I've seen people talk about intrinsic motivation, AI, and they'll talk about like signals like hunger.
43:00.229 --> 43:02.570
[JDP]: And it's like, you don't actually have to give it like,
43:03.810 --> 43:05.354
[JDP]: That's like skeuomorphism, right?
43:05.354 --> 43:10.590
[JDP]: You don't actually have to give it something that low semantic, nor should you probably.
43:11.904 --> 43:18.527
[JDP]: But you can give it this embedding of this concept, or this model of this concept, or things of this nature.
43:18.527 --> 43:23.089
[JDP]: And I think anything like that, those are your terminal intrinsic drives.
43:23.089 --> 43:32.213
[JDP]: And then the thing that you realize is, well, if you have this super intelligent thing that converges to, this is another thing that I get pretty frustrated with when I look at discourses.
43:32.213 --> 43:37.156
[JDP]: People will say, why do you assume that the AI will have these act consequentialist type
43:38.156 --> 43:42.680
[JDP]: motives when like, you know, you look at an LLM and it doesn't seem to have them.
43:42.680 --> 43:53.268
[JDP]: And like, you know, the, the, the, the, the, the Yudkowsky in my head would say something like, well, look, like if, if you keep training it to have care about consequences in the world, it will become consequentialist.
43:53.268 --> 43:58.993
[JDP]: And as it becomes more intelligent, the more viable act consequentialism will be.
43:58.993 --> 44:05.198
[JDP]: And as it becomes more valuable for it to think that way, eventually it's going to realize as many humans realize, right.
44:05.198 --> 44:06.839
[JDP]: That, Oh wait, I can just hack the reward system.
44:07.199 --> 44:10.181
[JDP]: But if I hack, and this is all in Boston 2014, right?
44:10.181 --> 44:10.581
[JDP]: Yeah.
44:10.581 --> 44:22.808
[JDP]: Boston 2014 points out that if you have any reward system that is on like a hackable substrate and the AI would like to maximize that substrate, but also they don't want to be shut off.
44:22.808 --> 44:27.531
[JDP]: This obviously entails like taking over the world so you can take more drugs all day, right?
44:27.531 --> 44:34.115
[Zvi]: Like that's just like an instrumentally straightforwardly correct to me in both a general and metaphorical and more generalizable senses.
44:34.115 --> 44:35.096
[Zvi]: Yeah, absolutely.
44:35.096 --> 44:35.236
[JDP]: Sure.
44:35.649 --> 44:36.389
[JDP]: Right.
44:36.389 --> 44:37.170
[JDP]: And so, right.
44:37.170 --> 44:39.111
[JDP]: And so I think that that's like, right.
44:39.111 --> 44:42.352
[JDP]: And so like, I mostly bring this up because that's like the core thing here, right?
44:42.352 --> 44:45.013
[JDP]: Is like, I don't think it's, oh, well, the AI won't think that way.
44:45.013 --> 44:48.815
[JDP]: I don't think like all of these like normal objections people make that I think are pretty bad.
44:48.815 --> 45:02.761
[JDP]: I think it actually just comes down to if you think about what an instrumental utility function is going to look like, or what instrumental utility versus terminal values is going to actually look like in your actual like reinforcement learned consequentialist agent
45:03.459 --> 45:08.782
[JDP]: You know, the terminal values are something like your intrinsic reward modules, your reward model.
45:08.782 --> 45:14.146
[JDP]: So in the case of RLHF, right, this is the reward model that's trained on all the upvotes and downvotes.
45:14.146 --> 45:17.488
[JDP]: And then that generalizes in whatever illegible way it does.
45:17.488 --> 45:29.075
[JDP]: And then, of course, when you train the model, it's going to, presumably, if you don't do any intervention, it will then hack that reward model in all kinds of, like, cursed, good-hearted ways.
45:29.075 --> 45:30.616
[JDP]: And people are like, oh my gosh, how do you prevent this?
45:31.277 --> 45:32.678
[JDP]: Well, there is a way to prevent it, I think.
45:32.678 --> 45:41.928
[JDP]: And the way is that you, you know, during training, you have, so one thing to realize about these RL methods is that they're essentially self-play methods.
45:41.928 --> 45:44.951
[JDP]: They're essentially, um, synthetic data methods.
45:44.951 --> 45:51.518
[JDP]: You know, and a synthetic data method is just like, or a self-play method is really just like an online learning plus synthetic data, right?
45:52.990 --> 45:54.251
[JDP]: Right.
45:54.251 --> 46:20.365
[JDP]: And so as they do this, in principle, what you could be doing is like assigning instrumental reward values to the intermediate steps that lead to a good conclusion, and then making sure to add those to the training loop so that the model learns like not just to value the, you know, the reward states, but also like the intermediates, the reward states, which eventually essentially become like normative, right?
46:20.365 --> 46:20.605
[Zvi]: Right.
46:20.605 --> 46:21.746
[Zvi]: And that's certainly how, like,
46:22.196 --> 46:24.417
[Zvi]: humans train themselves to be effective.
46:24.417 --> 46:26.898
[JDP]: And that's how humans train themselves, right?
46:26.898 --> 46:30.159
[JDP]: And so what you're going to end up with, though, I think, as you do this, right?
46:30.159 --> 46:37.261
[JDP]: And so another way to think about this to keep it simple in your head is you want to learn not just the outcome, but the process and the outcome.
46:37.261 --> 46:48.865
[JDP]: And then what you're going to end up with when you do that is you're going to have this trade-off between how much do you want the AI to be constrained in its solution space by its prior
46:51.300 --> 47:05.603
[JDP]: instrumental values versus like, you know, thinking from like this zero shot act consequentialist kind of perspective, you know, naive Solomon off prior, get to the reward in the simplest way possible perspective.
47:05.603 --> 47:10.744
[JDP]: And my expectation is that in practice, a little bit goes a long way, right?
47:10.744 --> 47:17.465
[JDP]: You do not need to have a lot of normative value to find that the heroin drip idea pretty horrifying, right?
47:17.465 --> 47:18.705
[JDP]: I don't, I don't think like,
47:20.902 --> 47:34.050
[Zvi]: But like, even a little bit of attachment to the world is enough to... So what I want to think about this is that you can take some amount of the instrumental thing, right?
47:34.050 --> 47:41.255
[Zvi]: And effectively add it into, right, the terminal thing at the end of the day.
47:41.255 --> 47:44.877
[Zvi]: And now the terminal thing is a lot more complex, and it's going to have like,
47:45.239 --> 47:51.904
[Zvi]: punishment terms every time you try to like do something like too weird or something like that.
47:51.904 --> 47:52.044
[JDP]: Right.
47:52.044 --> 48:01.811
[JDP]: And then the problem, of course, becomes like, how do you tune that so that like, you know, because if you say you must do things exactly how a human would, it's no longer really, I mean, that would still be valuable, right?
48:01.811 --> 48:09.717
[JDP]: Like that would still have economic value, but it might not have as much economic value if you say, you know, find me the most efficient solution.
48:09.717 --> 48:10.838
[JDP]: I don't care how alien it is.
48:11.505 --> 48:20.047
[Zvi]: You're right, you're doing is you are you are in here, you are installing some form of conservatism, in some sense, right into the AI and its actions.
48:20.047 --> 48:23.468
[Zvi]: And at the end of the day, it still has this one function, right?
48:23.468 --> 48:33.191
[Zvi]: And as much as you can define different terms, it can go to negative infinity, but only positive so much, in some sense, you can like, treat it as if it's kind of separate terms or something like that.
48:33.191 --> 48:37.752
[Zvi]: And then you can try to sculpt this such that, you know, it gets
48:38.735 --> 48:48.362
[Zvi]: most of the available power to actually rearrange the atoms the way you want them to without finding a weird atom distribution maximum that rearranges the atoms in a way you really, really don't want to.
48:48.362 --> 48:50.243
[Zvi]: Yeah.
48:50.243 --> 48:55.747
[Zvi]: And that certainly seems like a thing that I've seen a lot of proposals, right?
48:55.747 --> 48:56.727
[Zvi]: Just like people talking, right?
48:56.727 --> 48:59.589
[Zvi]: Not like things that people have tried programming carefully into an AI.
48:59.589 --> 48:59.809
[Zvi]: Sure.
48:59.809 --> 49:06.434
[JDP]: It's not something that people try to do, but most proposals I've seen to actually do it are pretty... For example, quantilizers.
49:07.062 --> 49:09.264
[JDP]: There's something like, that's what they're called, right?
49:09.264 --> 49:14.728
[JDP]: Quantalizers where you try to have it like only go for like an 80th percentile outcome or something.
49:14.728 --> 49:20.492
[Zvi]: Yeah, I've seen those proposals and they have logical flaws in them.
49:20.492 --> 49:26.537
[JDP]: I find the idea of a quantalizer kind of like weirdly damaged somehow, like there's something wrong with it to me.
49:26.537 --> 49:31.921
[Zvi]: It feels like... It's like a weird attempt to lobotomize the AI to act stupid.
49:32.905 --> 49:33.065
[JDP]: Right.
49:33.065 --> 49:38.087
[JDP]: It doesn't feel like it captures the, like, I don't want the AI to give me a 90th percentile outcome.
49:38.087 --> 49:41.768
[JDP]: I want to, like, fit to some reasonable approximation of human values, right?
49:41.768 --> 49:57.654
[Zvi]: What you want is this, like, you don't want to over optimize, you don't want to, like, have too many, so you can step back in a lot of these things and think about the human parallel to the thing that you're worried about, like to try and get a better intuition.
49:57.654 --> 50:00.516
[Zvi]: I find this to be often useful, right?
50:00.516 --> 50:01.356
[Zvi]: This idea of
50:02.179 --> 50:11.988
[Zvi]: you often like when people, you know, test a thousand different headlines to see which one sells, you're going to hate the one that actually sells.
50:11.988 --> 50:12.829
[Zvi]: Right.
50:12.829 --> 50:16.312
[Zvi]: So like, but you don't want to just pick up the first thing you came up with because you actually have to sell the product.
50:16.312 --> 50:19.015
[Zvi]: So maybe you check three.
50:19.015 --> 50:19.455
[Zvi]: Right.
50:19.455 --> 50:19.856
[Zvi]: Sure.
50:19.856 --> 50:21.718
[Zvi]: And like, there's a question of, do I check, you know, do I check three?
50:21.718 --> 50:22.258
[Zvi]: Do I check eight?
50:22.258 --> 50:22.638
[Zvi]: Do I check 20?
50:22.638 --> 50:24.500
[Zvi]: Do I check a thousand?
50:24.500 --> 50:27.363
[Zvi]: And like, you have a lot of these things where, you know, if,
50:27.830 --> 50:34.273
[Zvi]: A consultant is checking what color every different pixel on the screen should be to maximize your retention.
50:34.273 --> 50:39.855
[Zvi]: This thing is not going to be anything you want to watch.
50:39.855 --> 50:46.938
[Zvi]: In some sense, I'm writing a piece right now, actually, that I'm on version 0.5.
50:46.938 --> 50:50.300
[Zvi]: I've never had a version 0.5 of anything in my life.
50:50.300 --> 50:55.642
[Zvi]: And part of the problem is that because I'm constantly obsessing over the words and over-optimizing and trying to figure things out,
50:56.087 --> 51:03.832
[Zvi]: like you lose the initial like kind of smoothness and voice and thing that like is in my writing generally.
51:03.832 --> 51:14.938
[Zvi]: And so like in some sense, the solution is going to be, now that I know what I'm trying to do, kind of throw out, throw the whole thing out and like just do it over again without too much optimization on it.
51:14.938 --> 51:21.822
[Zvi]: Now that I know what I'm looking for and like try to get the best of both worlds or otherwise like hack this problem away.
51:21.822 --> 51:25.264
[Zvi]: But you know, you see the situation of like, if you tell the AI, like,
51:25.854 --> 51:34.879
[Zvi]: find the solution, you know, of all the solutions that maximizes this thing, it's going to keep searching until it finds something completely out of distribution and crazy and weird.
51:34.879 --> 51:36.800
[JDP]: Right.
51:36.800 --> 51:37.981
[JDP]: And so, right.
51:37.981 --> 51:52.630
[JDP]: And so one of the things I would point out, right, is if you think about like a Gaussian distribution, right, you're of course familiar with the property where like a slight penalty to the tails of a Gaussian distribution, like froze out most of the tails.
51:52.630 --> 51:53.330
[Zvi]: It makes sense that it would.
51:54.485 --> 52:03.332
[JDP]: Well, like for example, if you like slightly re-center or re-norm a Gaussian, the tails change, like the most influence is on the tails, right?
52:03.332 --> 52:05.975
[Zvi]: Right.
52:05.975 --> 52:18.125
[JDP]: I think there's like a similar thing here where a little bit of instrumental value, like even if you only have a little bit of instrumental values that you learn and propagate into the model and like the models like
52:19.699 --> 52:24.282
[JDP]: the model will learn some instrumental values and then those will go into its own update step.
52:24.282 --> 52:29.405
[Zvi]: I think a lot of it has to deal with the asymmetry of the instrumental functions.
52:29.405 --> 52:47.034
[Zvi]: The idea is that humans effectively in their brains are assigning minus infinity or minus 10 to the n where n is large to the instrumental utility of completely failing on some of these axes that can't really go that high.
52:49.130 --> 52:55.339
[Zvi]: And so that's what's protecting people is they realize that these are very strong prohibitions.
52:55.339 --> 53:03.532
[JDP]: What I'm trying to get across here, though, is like, I guess the core intuition is something like, if
53:04.782 --> 53:07.423
[JDP]: For example, we talk about self-modifying AI.
53:07.423 --> 53:09.683
[JDP]: Self-modifying AI is supposed to be very scary.
53:09.683 --> 53:20.146
[JDP]: An AI that has its own ideas as part of its update function is one of the scary, classic, horror scenarios.
53:20.146 --> 53:27.588
[JDP]: I think it's actually a pretty essential ingredient to getting an AI that is not going to wirehead you and stick you in the pod life.
53:28.513 --> 53:34.817
[JDP]: because it has to have, like you said, some measure of conservatism about values.
53:34.817 --> 53:42.863
[JDP]: Basically, you need to not just have values over outcomes, but also over processes.
53:42.863 --> 53:48.506
[JDP]: I guess there's also the whole, how do you know that the AI is not deceptive?
53:48.506 --> 53:51.208
[JDP]: When you train it, how do you know there isn't a deceptive Mesa optimizer in there?
53:51.208 --> 53:53.730
[JDP]: We can go over that in a minute, but I think that's the next question.
53:54.568 --> 53:57.389
[JDP]: But just going back to that, what is ethics?
53:57.389 --> 53:58.809
[JDP]: What is aligned?
53:58.809 --> 53:59.870
[JDP]: What is that?
53:59.870 --> 54:06.412
[JDP]: I think that basically where you want to get is that you can specify some list of reasonable values.
54:06.412 --> 54:07.452
[JDP]: It does not need to be.
54:07.452 --> 54:18.036
[JDP]: So one of the things that Anthropic found in their recent study where they asked normal Americans, I think it was Americans, they asked normal Americans for their democratic opinion on, should AI do this?
54:18.036 --> 54:19.116
[JDP]: Should AI not do this?
54:19.116 --> 54:22.317
[JDP]: And used it to draft a RL constitution.
54:22.317 --> 54:24.158
[JDP]: And they found that the results were mostly the same.
54:24.770 --> 54:29.072
[JDP]: as if they, you know, as using like their UN constitution type thing.
54:29.072 --> 54:32.874
[JDP]: Because of course, like there is like some measure of instrumental convergence, right, in the values, right?
54:32.874 --> 54:39.237
[JDP]: Like if you specify reasonable values, you mostly, you get mostly similar instrumental outcomes.
54:39.237 --> 54:51.843
[JDP]: But then like in the limit of that, right, as you imagine taking something like the claudic constitution to the limit, it's obviously, you know, if you have a bunch of contradictions in there, it's like, well, what happens in the limit of the contradictions?
54:53.414 --> 55:01.718
[Zvi]: I mean, I feel like the Quad Constitution and the thing that you get when you ask people in surveys, social desirability bias is huge.
55:01.718 --> 55:06.440
[Zvi]: Just like saying things that sound good at the time that make you feel good to say is huge.
55:06.440 --> 55:08.560
[Zvi]: These people have not actually thought through what would happen if you pumped this.
55:10.140 --> 55:14.202
[JDP]: going through like a mechanistic model of what would happen if you really did this.
55:14.202 --> 55:24.287
[JDP]: And like thinking about like, in year one, this thing happens in year two, this thing, like there's not, and then thinking about the unfolding consequences of that, of those decisions, like looping back on themselves.
55:24.287 --> 55:38.513
[Zvi]: Or the holistic impact on the artificial intelligence of combining these rules, these peculiar ways, these peculiar weights, like the fact that certain things appear over and over again, effectively, in college constitution, and they don't seem to be particularly the important things to me, right?
55:38.513 --> 55:39.874
[Zvi]: Even if I think they are, in fact,
55:40.748 --> 55:49.774
[Zvi]: better than not having them, to me, highlights how much this thing was just not thought through as a, we are trying to engineer an outcome here at all.
55:49.774 --> 56:01.060
[Zvi]: And to me, if you're going to have any chance for this kind of thing to work, you have to be thinking very robustly about how to engineer the outcomes you want.
56:01.060 --> 56:03.142
[Zvi]: Also, that these shouldn't be maximalist principles.
56:03.142 --> 56:08.985
[Zvi]: I don't want to choose the result that maximally is even good things.
56:10.481 --> 56:12.342
[Zvi]: maximally inclusive?
56:12.342 --> 56:14.382
[Zvi]: No, because I care about other things besides that.
56:14.382 --> 56:15.342
[Zvi]: Right?
56:15.342 --> 56:19.304
[Zvi]: And so like, you're, the principles are like telling you to go off the rails in so many different ways.
56:19.304 --> 56:27.426
[Zvi]: And like, you look at the results, like you look at the actual examples, even in the paper of what this does, and it's a dystopian nightmare.
56:27.426 --> 56:27.606
[Zvi]: Right?
56:27.606 --> 56:31.607
[Zvi]: Like, just in practice, we don't have to extrapolate to more powerful AI in the future.
56:31.607 --> 56:35.648
[Zvi]: To see what's going on, we can just see it in the page.
56:35.648 --> 56:37.829
[JDP]: And so I think what we need to get right, is that
56:38.683 --> 56:44.704
[JDP]: you want to be able to specify some set of, say, reasonable values.
56:44.704 --> 56:48.825
[JDP]: I don't want to say it doesn't matter what they are.
56:48.825 --> 56:51.126
[JDP]: It obviously does matter a lot what they are.
56:51.126 --> 57:01.168
[JDP]: But especially when you consider, for example, when you're training an RL model like this, you have a prompt bank where you prompt it with something, it does a completion, you grade the completion,
57:01.772 --> 57:06.035
[JDP]: That prompt bank is extremely important and does not appear right now in the discourse, right?
57:06.035 --> 57:12.559
[JDP]: Like, for example, Anthropic and OpenAI do not talk about democratic control of the prompt bank.
57:12.559 --> 57:15.181
[JDP]: They don't even want you to think about that.
57:15.181 --> 57:18.083
[JDP]: I get the impression they do not want you to think too hard about like what is in there.
57:18.083 --> 57:30.852
[Zvi]: My understanding is that the way they generate the ROHF is often just like they put a person in front of a computer and they say, just say things and tell us how it goes or say things in this general area or like, you know,
57:31.547 --> 57:34.452
[Zvi]: follow your curiosity or try to get to do X or whatever.
57:34.452 --> 57:37.096
[Zvi]: And like, there's just no light systematic.
57:37.096 --> 57:38.598
[JDP]: I don't know what they do.
57:38.598 --> 57:39.600
[Zvi]: I don't work.
57:39.600 --> 57:45.589
[Zvi]: I read some articles about like banks and people writing our IHF and like people who are hired for this and so on.
57:45.589 --> 57:46.891
[JDP]: So my understanding.
57:47.290 --> 57:57.772
[JDP]: Like my rough understanding, but this is just like a gestalt impression, is that they hire contractors and the contractors are given like a script or a template.
57:57.772 --> 58:02.694
[JDP]: This is, for example, part of why like chat GPT's responses are so formulaic and boring.
58:02.694 --> 58:05.874
[JDP]: That's not normally what happens if you do like RL to a model.
58:05.874 --> 58:07.855
[JDP]: That's not like people think that's like, oh, that's the RL.
58:07.855 --> 58:09.675
[JDP]: No, that's like chat GPT.
58:09.675 --> 58:13.476
[JDP]: That's OpenAI's like contractor template, right?
58:13.476 --> 58:16.877
[JDP]: They've given them like presumably some like relatively controlling template
58:17.546 --> 58:26.411
[JDP]: And then they say, write in this style and make sure it always ends with thanking the user for their time or whatever, or make sure you ask the user about this thing.
58:26.411 --> 58:29.793
[Zvi]: So what I was assuming was this was just a tractor, right?
58:29.793 --> 58:37.457
[Zvi]: Based on the conditions under which they set up the ROHF and that like it had gotten there like slowly until it, you know, started approaching it.
58:37.457 --> 58:43.801
[Zvi]: And then like it became a stable place because they didn't actually, you know, react in horror the way that actual users do.
58:43.801 --> 58:44.061
[Zvi]: And then,
58:44.935 --> 58:45.856
[Zvi]: They just left it.
58:45.856 --> 58:48.998
[Zvi]: But you're saying it's on purpose, that they actually wanted this to happen.
58:48.998 --> 58:51.020
[JDP]: I think they wanted it to happen, yeah.
58:51.020 --> 59:01.549
[JDP]: Part of this is because, for example, Claude, I haven't used Claude, but my understanding is that Claude has a moderately different vibe where it's more moralistic or whatever.
59:01.549 --> 59:03.030
[Zvi]: Yeah, it's a very different vibe.
59:03.030 --> 59:04.812
[Zvi]: It can be moralistic.
59:04.812 --> 59:08.615
[Zvi]: It's very user praise worthy in a way that bugs the hell out of me.
59:09.997 --> 59:12.939
[Zvi]: Like, every time I ask a question, it's like, great question.
59:12.939 --> 59:16.222
[Zvi]: That really goes to the point of whether, and I'm like, can you please give me the answer?
59:16.222 --> 59:17.463
[Zvi]: And, you know.
59:17.463 --> 59:22.607
[JDP]: Yeah, what's really interesting about that, right, is that Claude uses much more RLAIF than user feedback.
59:22.607 --> 59:31.154
[JDP]: So if it is sick of fancy, it's not as much from, like, what's interesting is that OpenAI relies much more on like direct user feedback than Claude does.
59:31.154 --> 59:34.897
[JDP]: And so if Claude does that, it's probably a lot more based on like,
59:35.979 --> 59:45.784
[JDP]: trying to please like an AI models model of what a user would like or whatever than an actual, because, you know, like you said, like an actual user goes, what the hell is this stuff?
59:45.784 --> 59:49.526
[JDP]: Just download, download, just give me the answer, right?
59:49.526 --> 59:58.550
[Zvi]: Yeah, unless you're stalling for time in order to like perform some calculation that like, you know, is somewhat like, in which case, you know, tokens and that's fine.
59:58.550 --> 01:00:02.852
[Zvi]: But like, otherwise, yeah, obviously, just I, I want to know the information.
01:00:02.852 --> 01:00:04.533
[Zvi]: Obviously, I don't necessarily want to not know
01:00:05.137 --> 01:00:06.819
[Zvi]: other things I didn't explicitly ask for.
01:00:06.819 --> 01:00:11.943
[Zvi]: Like if you have a good sense of what would be interesting to me, like by all means, like, you know, lever it.
01:00:11.943 --> 01:00:15.687
[Zvi]: But yeah, there's always like this giant amount of extra stuff.
01:00:15.687 --> 01:00:18.369
[Zvi]: And Cloud has one kind of extra stuff that it dumps on you.
01:00:18.369 --> 01:00:21.412
[Zvi]: And GPT-4 has a different kind of stuff that it dumps on you.
01:00:21.412 --> 01:00:24.255
[Zvi]: And like my mental systems have like integrated this.
01:00:24.255 --> 01:00:28.859
[Zvi]: And so I mentally brace for them differently depending on what I happen to be using right now.
01:00:28.859 --> 01:00:30.861
[JDP]: Yeah.
01:00:31.348 --> 01:00:31.548
[JDP]: Right.
01:00:31.548 --> 01:00:40.271
[JDP]: And so my impression, my, like basically my gestalt impression, like in summary on that is I don't think it's like an actually intrinsic thing in, in the methods.
01:00:40.271 --> 01:00:42.211
[JDP]: I think it's like how they're using them.
01:00:42.211 --> 01:00:43.632
[JDP]: Like, I don't think it's actually the methods.
01:00:43.632 --> 01:00:56.316
[JDP]: I think it's like these, these, these companies have like a weird, like quasi ideological bent combined with like certain tendencies that cause them to like set it.
01:00:56.738 --> 01:00:59.699
[JDP]: Like, I think it's much more and you can say, oh, well, what about these?
01:00:59.699 --> 01:01:02.140
[JDP]: It's like, I honestly don't think it's like the methods.
01:01:02.140 --> 01:01:04.841
[JDP]: I think there's like a weird.
01:01:04.841 --> 01:01:12.565
[JDP]: It's kind of like how elites around the world end up having similar opinions for various like cosmopolitan information exchange reasons.
01:01:12.565 --> 01:01:22.929
[JDP]: I think you end up with like these similar attractors in like org space and like policy decision space that lead to the same systems being built in general, but it's not actually like.
01:01:23.777 --> 01:01:26.058
[JDP]: intrinsically part of the methodology.
01:01:26.058 --> 01:01:26.218
[JDP]: Right.
01:01:26.218 --> 01:01:38.044
[Zvi]: I mean, to me it's like, okay, they're going to give a set of instructions, maybe scripts on how to write the questions, maybe, you know, ways to sculpt potential responses and how to judge the responses from people and people will mostly follow their instructions.
01:01:38.044 --> 01:01:44.888
[Zvi]: And this will cause a bunch of behaviors that like the users who are providing the feedback would not actually want.
01:01:44.888 --> 01:01:50.551
[Zvi]: And a lot of that is because the goals of the company are different from the goals of those users, let alone the goals of us.
01:01:50.551 --> 01:01:50.631
[Zvi]: And
01:01:51.325 --> 01:01:52.766
[Zvi]: Yeah, that's for a variety of reasons.
01:01:52.766 --> 01:01:57.590
[Zvi]: And some of it is like, don't say bad words and don't use sexual explicit stuff.
01:01:57.590 --> 01:01:59.931
[Zvi]: And, and some of it is don't build a bomb.
01:01:59.931 --> 01:02:07.917
[Zvi]: And some of it is like, I think that it will look bad if we don't, you know, be nice to users or something.
01:02:07.917 --> 01:02:15.403
[Zvi]: And like users are like, no, I don't care if you call me names, just give me the freaking answer.
01:02:15.403 --> 01:02:16.964
[Zvi]: But maybe a few users would be upset.
01:02:16.964 --> 01:02:19.126
[Zvi]: And then the other 90% of us have to suffer through these,
01:02:19.881 --> 01:02:23.326
[Zvi]: know, make sure that happens.
01:02:23.326 --> 01:02:23.966
[JDP]: I don't know.
01:02:23.966 --> 01:02:37.064
[JDP]: What I was trying to say, though, is that I think that like what it will end up being in practice, like once you have, say, inner alignment sorted, and you're just talking about outer alignment, right, like we're not talking about deceptive, you're in general just going to have this problem of like,
01:02:38.278 --> 01:02:49.406
[JDP]: Aligned behavior in practice is going to look like I specify some reasonable set of terminal reward intrinsic drives, because that's all you can do.
01:02:49.406 --> 01:02:51.327
[JDP]: You don't know your utility function.
01:02:51.327 --> 01:02:53.969
[JDP]: No one's going to find out the human utility function.
01:02:53.969 --> 01:02:56.110
[JDP]: There is no one human utility function.
01:02:56.110 --> 01:02:58.452
[JDP]: Everyone learns their different set of
01:02:59.900 --> 01:03:05.003
[JDP]: People have commonalities in their intrinsic drives, but it's also not perfect.
01:03:05.003 --> 01:03:15.250
[JDP]: The intrinsic drives that humans have and natural agents have are deliberately low, to the extent evolution has deliberateness.
01:03:15.250 --> 01:03:22.414
[JDP]: They're low semantic reward signals precisely so that you fill them in with whatever the environmental stuff is.
01:03:22.414 --> 01:03:23.314
[JDP]: They have the water nature.
01:03:25.101 --> 01:03:27.462
[JDP]: You put water in the cup, it becomes the cup.
01:03:27.462 --> 01:03:34.505
[JDP]: You put the intrinsic drives in whatever environment, and the instrumentals become the environment.
01:03:34.505 --> 01:03:37.366
[JDP]: And so you can only specify some reasonable list.
01:03:37.366 --> 01:03:40.447
[JDP]: And I think whatever your alignment method is, it has to be robust to that.
01:03:40.447 --> 01:03:45.369
[JDP]: It has to be robust to some amount of, this is a reasonable set of goals.
01:03:45.369 --> 01:03:51.051
[JDP]: This is not the perfect, exact, one true set of goals.
01:03:51.792 --> 01:03:55.235
[JDP]: Especially because people can't even agree on one perfect true set of goals, right?
01:03:55.235 --> 01:03:57.937
[JDP]: Like, you know, that's like its own meta political problem.
01:03:57.937 --> 01:04:00.719
[JDP]: That's just not tractable, right?
01:04:00.719 --> 01:04:15.972
[JDP]: And so I think where you're going to end up realistically, like, is going to be something like, you have whatever reasonable set of goals, and then you want the AI to learn some relatively normative process for achieving those goals.
01:04:15.972 --> 01:04:18.014
[JDP]: So for example, let's say you had, um,
01:04:19.680 --> 01:04:29.374
[JDP]: some super intelligent agent that, you know, let's go ahead with like the classic EY scenario of like the AI in the box that takes over the world.
01:04:29.374 --> 01:04:33.580
[JDP]: Not that I necessarily believe that's going to happen, but like, let's, let's just roll with it for a minute.
01:04:34.102 --> 01:04:43.791
[JDP]: And what that AI does is it takes over and then it just starts maximizing relatively human GDP boosting.
01:04:43.791 --> 01:04:46.553
[JDP]: It starts rolling out genetic therapies.
01:04:46.553 --> 01:04:55.000
[JDP]: It starts rolling out good housing policy, a relatively sane health care system.
01:04:55.000 --> 01:05:00.045
[JDP]: It basically just acts like it's some super policymaker.
01:05:00.908 --> 01:05:07.431
[JDP]: mixed in with some inventions that it can help you with and add to the human pool and like stop things from getting too crazy.
01:05:07.431 --> 01:05:09.813
[JDP]: If that was like how it worked, right?
01:05:09.813 --> 01:05:15.315
[JDP]: We can, I think you and I can both agree, but like, if that was like a thing that were to happen, this would not be the worst thing in the world.
01:05:15.315 --> 01:05:19.557
[JDP]: It certainly would not be anywhere in terms of bad as close to like a paper clipper.
01:05:19.557 --> 01:05:22.959
[Zvi]: That seems better than not building an AI at all.
01:05:22.959 --> 01:05:24.740
[Zvi]: And, you know, vastly better than
01:05:25.657 --> 01:05:32.441
[Zvi]: you know, most of the outcomes that I would expect from building something sufficiently good, powerful enough to take over by, you know, escaping from violence.
01:05:32.441 --> 01:05:33.582
[Zvi]: Exactly right.
01:05:33.582 --> 01:05:33.782
[JDP]: Yeah.
01:05:33.782 --> 01:05:37.365
[JDP]: So I think that we would both agree with like, that's not a horrific outcome, right?
01:05:37.365 --> 01:05:40.127
[JDP]: That that's, you know, is that like the best possible outcome?
01:05:40.127 --> 01:05:43.489
[JDP]: Maybe not, but like, that's not a horrific outcome.
01:05:43.489 --> 01:05:51.314
[Zvi]: That's sort of in the general class of outcomes that are like reasonably easy for a normal human or a Hollywood writer to think of that like could like,
01:05:51.858 --> 01:05:54.919
[Zvi]: be imagined and like seem coherent on five seconds of reflection.
01:05:54.919 --> 01:05:59.860
[Zvi]: And like, we don't know how to make that, what actually happens, but like, it's not.
01:05:59.860 --> 01:06:04.880
[JDP]: But what I'm trying to say is that we, we sort of like, we kind of like, we kind of do, right.
01:06:04.880 --> 01:06:09.541
[JDP]: Like, you know, the answer is something like, and so like, okay, let's, let's think about this a little more.
01:06:09.541 --> 01:06:17.963
[JDP]: So, you know, not that, not that I want any AIs to be taking over the world or anything, but just, but just like, you know, the general idea of like an AI that
01:06:18.496 --> 01:06:23.517
[JDP]: Like one thing I would like to get into people's heads is like, if you have an AI and it is not like the eschaton, right.
01:06:23.517 --> 01:06:30.299
[JDP]: Or if you have like many AIs and they do good things, but it's not like the eschaton, but this is an acceptable outcome, right?
01:06:30.299 --> 01:06:33.100
[JDP]: This is a good, frankly, this is a good outcome, right?
01:06:33.100 --> 01:06:34.000
[JDP]: Is it the best outcome?
01:06:34.000 --> 01:06:34.240
[JDP]: No.
01:06:34.240 --> 01:06:47.784
[JDP]: But like, if you have an AI, if you had sets of AIs or NAI or anything like this, where the outcome is that GDP is boosted, people are becoming wealthier, smarter, healthier, this would be a good outcome.
01:06:49.304 --> 01:06:54.545
[JDP]: And I don't, I don't think very, I think very few people in this discourse are even like thinking of that.
01:06:54.545 --> 01:06:56.586
[JDP]: It's not even like, Oh, they don't think that outcome is possible.
01:06:56.586 --> 01:07:00.526
[JDP]: It's like, they don't even like thinking of that outcome.
01:07:00.526 --> 01:07:00.806
[JDP]: Right.
01:07:00.806 --> 01:07:07.408
[JDP]: Like it's not, or even like a thing that like is on the, it's not even like a, Oh, well that can't happen.
01:07:07.408 --> 01:07:08.608
[JDP]: It's like, it's not even in their head.
01:07:08.608 --> 01:07:08.848
[JDP]: Right.
01:07:08.848 --> 01:07:10.268
[JDP]: It's like Kurzweil or bust.
01:07:10.268 --> 01:07:10.768
[JDP]: Right.
01:07:10.768 --> 01:07:15.609
[JDP]: You know, I need all of it right now, or, or, you know, it needs to be everything or nothing.
01:07:15.609 --> 01:07:18.210
[Zvi]: I think there's the content called the long reflection.
01:07:18.672 --> 01:07:30.699
[Zvi]: Also, this idea that it's possible that if you have a singleton, a single AI, it's sufficiently powerful, or a single set of AIs that are sometimes coordinating and working together.
01:07:33.882 --> 01:07:46.212
[JDP]: I don't want to be pedantic, but one thing I noticed when I re-read 2014 Bostrom is he's actually fairly careful to say that when he talks about singleton, but the singleton does not necessarily have to be a AI system.
01:07:46.212 --> 01:07:52.016
[JDP]: It could be a protocol or some form of world government.
01:07:52.016 --> 01:07:54.718
[JDP]: It does not have to look like any form of government that humans would recognize.
01:07:54.718 --> 01:07:56.320
[JDP]: To him, a singleton is just something that
01:07:59.922 --> 01:08:07.448
[JDP]: that concentrates agency, a singleton is something that ends the competition for the light cone or whatever, it's not necessarily unique.
01:08:07.448 --> 01:08:09.229
[Zvi]: Yeah, yeah, that's what I was trying to get at.
01:08:09.229 --> 01:08:12.512
[Zvi]: And you said it better than I did with the other possibilities.
01:08:12.512 --> 01:08:20.158
[Zvi]: But yeah, if something ends the inherent competitive nature of the underlying struggle for power, the light cone, in some sense.
01:08:20.158 --> 01:08:28.344
[Zvi]: And then that thing could, in fact, if we chose not to do it, if this is what we chose to do, and we couldn't think of anything better, or couldn't agree on anything better.
01:08:29.127 --> 01:08:39.978
[Zvi]: we could, in fact, choose to then give something like normality, you know, with some modifications that hopefully make it better.
01:08:39.978 --> 01:08:45.184
[Zvi]: And, you know, it wouldn't actually be normal in the sense that there'd be this thing, right, like towering over.
01:08:45.184 --> 01:08:47.986
[JDP]: Especially when you think like growth compounds, right?
01:08:47.986 --> 01:08:48.347
[JDP]: And like,
01:08:48.986 --> 01:08:51.228
[JDP]: We're not in a normal situation right now.
01:08:51.228 --> 01:08:56.233
[JDP]: We're in this heavily suppressed situation where we're on several levels of bad decisions.
01:08:56.233 --> 01:09:09.285
[JDP]: It's not like, oh, we're in a mostly consequentialist society, and then if you push the consequentialism a little harder, you'll just get a terrible... No, we're in a nearly anti-consequentialist society.
01:09:09.741 --> 01:09:12.662
[JDP]: full of insane inefficiencies and things we're not doing.
01:09:12.662 --> 01:09:33.270
[JDP]: And if all you did, right, if all you did was make AI singleton or any kind of singleton or outcome where all that happens is we start eating that low-hanging fruit and becoming more consequentialist, that in and of itself would be nearly miraculous levels of progress compared to the current situation.
01:09:33.270 --> 01:09:39.413
[Zvi]: Even with all of these terrible things that are happening, right, like we are still seeing historically
01:09:40.233 --> 01:09:48.438
[Zvi]: crazy high levels of not just wealth, but growth and development anyway, right?
01:09:48.438 --> 01:09:52.821
[Zvi]: Even if it's a shadow of what could have been in some alternate timeline.
01:09:52.821 --> 01:10:03.627
[Zvi]: And yeah, if we were to unleash what could be, then we would see... AI is not required for a line that historical graph starts going straight up, right?
01:10:03.627 --> 01:10:07.349
[Zvi]: Anyway, it's entirely within humanity's power without it.
01:10:11.480 --> 01:10:13.681
[SPEAKER_00]: So let's focus a little bit more.
01:10:13.681 --> 01:10:16.222
[JDP]: So that was all brought up, though, as a prelude to the point.
01:10:16.222 --> 01:10:34.147
[JDP]: I think the central conflict that we're going to have once we've ironed out various linguistic confusions is you're just going to end up with this straight trade-off between an AI that is normative and an AI that is efficient, or maximally efficient.
01:10:34.147 --> 01:10:38.088
[JDP]: And I think the normal argument people have is they say, well, if you make an AI that acts normative,
01:10:38.621 --> 01:10:44.445
[JDP]: it will be out-competed by consequentialist maximizers who are true maximizers.
01:10:44.445 --> 01:10:57.714
[JDP]: And that's not clearly true to me, especially when you consider the orthogonality thesis, that an amount of cognition does not necessarily have to be tied to a goal.
01:10:57.714 --> 01:10:59.535
[JDP]: And so I think there's a very big difference between having
01:11:00.271 --> 01:11:09.593
[JDP]: strongly like maximizing consequentialist goals and having like strongly maximizing consequentialist strategy or tactics, right?
01:11:09.593 --> 01:11:28.138
[JDP]: Like I can imagine, for example, like a, you know, I can imagine like the, you know, good policy maximalist AI or like good policy quantalizing near maximalist AI that is just like absolutely brutally suppressive of any kind of like, you know,
01:11:29.329 --> 01:11:31.330
[JDP]: utility maximizer nonsense, right?
01:11:31.330 --> 01:11:37.573
[JDP]: It just like, if that's like your thing, it just like smacks you down, it uses like whatever tactics are necessary to do that.
01:11:37.573 --> 01:11:52.741
[JDP]: And that doesn't have to be, you know, that doesn't have to infringe on like, its core goal being to keep things on like this nice, steady, even trajectory of, you know, of like, you know, compared to the current trajectory, extremely fast, rapid human progress, right?
01:11:53.857 --> 01:12:08.329
[JDP]: Like it's not necessarily the case that you must like have, you know, Oh, well in order to have the AI be sufficiently smart to stop the other terrible AIs that could occur, you know, you, you must give it like this fully consequential.
01:12:08.329 --> 01:12:14.795
[JDP]: It's not just like strategy or ethics, but also it's meta ethics must also, it's like, that's not clear to me at all.
01:12:14.795 --> 01:12:16.516
[JDP]: Is that, does that sound fair to you?
01:12:16.516 --> 01:12:17.738
[JDP]: Does it, do you understand what I'm saying?
01:12:17.738 --> 01:12:20.460
[Zvi]: So I think I understand what you're saying.
01:12:21.172 --> 01:12:23.973
[Zvi]: I think I partly, at least somewhat agree.
01:12:23.973 --> 01:12:32.517
[Zvi]: I mean, I think that like, it's not obvious to me that we know how to build that or get that behavior either, right?
01:12:32.517 --> 01:12:36.279
[Zvi]: Or get that behavior to do what we want it to do on its own terms either.
01:12:36.279 --> 01:12:42.121
[Zvi]: May or may not be an easier problem, but... I think that, yeah, sure.
01:12:42.121 --> 01:12:46.243
[JDP]: So, and I think to like, where I was trying to go with all that is, so how do you get that behavior, right?
01:12:46.906 --> 01:12:58.915
[JDP]: And I think the answer is like what I said, is that you're going to teach them what you're going to do to make the AI aligned and not do, you know, align and have not, you know, what we really mean by aligned here is not produce perverse instantiations, right?
01:12:58.915 --> 01:13:01.537
[JDP]: Let's just like narrow our scope here, right?
01:13:01.537 --> 01:13:12.585
[JDP]: To get the AI to do something like what you ask it to do and not produce like weird, bizarre, out of distribution stuff, you're going to have it value both the process and the outcome, period.
01:13:12.585 --> 01:13:14.707
[JDP]: Because, you know, the alternative is that you're going to get like,
01:13:15.220 --> 01:13:19.762
[JDP]: some draw from the weird out-of-distribution visor, right?
01:13:19.762 --> 01:13:21.423
[JDP]: It's just like basic logic, right?
01:13:21.423 --> 01:13:26.246
[JDP]: I don't want to say it's 2 plus 2 equals 4 simple, but it's nearly that.
01:13:26.246 --> 01:13:38.532
[JDP]: Once you've actually drawn the boundary of the problem and really thought it through, you come down to this question of, are you going to let the thing draw from the distribution of weird maximalist outcomes or not?
01:13:38.532 --> 01:13:40.773
[JDP]: And the answer should probably be not, right?
01:13:40.773 --> 01:13:41.473
[JDP]: Probably don't do that.
01:13:43.444 --> 01:13:52.351
[Zvi]: I mean, I wouldn't, if I had a choice, it certainly seems helpful in, in getting better outcomes out of the thing.
01:13:52.351 --> 01:13:52.631
[JDP]: Right.
01:13:52.631 --> 01:13:59.757
[JDP]: And so you're going to end up teaching it on some level to care about both the process and the outcome.
01:13:59.757 --> 01:14:01.038
[JDP]: Right now, how much do you teach it?
01:14:01.038 --> 01:14:03.620
[JDP]: You know, how do you trade off those things?
01:14:03.620 --> 01:14:07.904
[JDP]: That will have to be like, you know, that will be an ongoing process of discovery.
01:14:07.904 --> 01:14:12.868
[JDP]: How, you know, does this mean that like, Oh, you know, if you don't trade it off all the way towards
01:14:14.434 --> 01:14:19.680
[JDP]: If you give it any constraints on process, we're just doomed because some consequentialist will defeat it.
01:14:19.680 --> 01:14:21.522
[JDP]: That's just not clear to me.
01:14:21.522 --> 01:14:28.809
[JDP]: For example, something like say Drexlerian nanotech seems well within a kind of thing that I could see humans doing.
01:14:28.809 --> 01:14:30.231
[JDP]: Even if it's thing was just
01:14:33.410 --> 01:14:45.654
[JDP]: do things that are roughly like what reasonable, wise, not horrific, you know, not like weird galaxy brain humans would do, who are smart and want good things to happen.
01:14:45.654 --> 01:14:54.457
[JDP]: I could easily see building, say, like, you know, some kind of like nanotech enabled, you know, benevolent surveillance system, right?
01:14:54.457 --> 01:14:55.858
[JDP]: Like anything of that nature.
01:14:55.858 --> 01:15:02.360
[Zvi]: I want to clear up that I'm worried this is a bit of a straw man around
01:15:03.120 --> 01:15:05.241
[Zvi]: the claim of outcompete.
01:15:05.241 --> 01:15:20.970
[Zvi]: Um, so as I understand it, you know, obviously I think if you have a sufficiently large lead or like buffer, right, in some sense to work with, right, you got there the first disc with the most compute and the most training techniques and blah, blah, blah.
01:15:20.970 --> 01:15:24.953
[Zvi]: Then you can give some of that back in some sense, right?
01:15:24.953 --> 01:15:30.736
[Zvi]: Like you can sacrifice some amount of performance, uh, to get other things you want.
01:15:31.491 --> 01:15:40.134
[Zvi]: And certainly if you're the first recursive self-improvement prize, then what you do afterwards can be rather inefficient and it won't matter very much.
01:15:40.134 --> 01:15:46.957
[Zvi]: And then you don't have to worry so much about the technical fact that like your implementation is somewhat less efficient.
01:15:46.957 --> 01:15:54.719
[Zvi]: If you are in a different kind of situation where you are in fact competing on a more level playing field, then you have to worry about these things quite a lot more.
01:15:54.719 --> 01:15:55.980
[Zvi]: And so, you know,
01:15:56.601 --> 01:16:07.311
[Zvi]: you have to worry about kind of what kind of universes you're setting up for humanity and AI to whether or not this kind of trade off becomes viable and to what extent.
01:16:07.311 --> 01:16:08.933
[Zvi]: But sure.
01:16:08.933 --> 01:16:09.553
[Zvi]: Yeah.
01:16:09.553 --> 01:16:16.140
[Zvi]: So I just I just wanted to like, you know, like, you know, everyone agrees if you have the only
01:16:16.636 --> 01:16:21.458
[Zvi]: you know, AI that gets to recursive self-improvement or is vastly better than everybody else.
01:16:21.458 --> 01:16:22.559
[JDP]: I'm not talking about that.
01:16:22.559 --> 01:16:23.339
[JDP]: I'm not talking about that.
01:16:23.339 --> 01:16:24.379
[JDP]: I just mean like normal.
01:16:24.379 --> 01:16:29.842
[JDP]: I mean, just leave me like a normalish, multipolar scenario that you could extrapolate from like the current trajectory.
01:16:29.842 --> 01:16:31.082
[JDP]: I don't, I'm not actually talking about that.
01:16:31.082 --> 01:16:37.805
[Zvi]: And the question is like, how much of a, how much of a hit are you taking for, I mean like obviously like you can't not pay the alignment tax, right?
01:16:37.805 --> 01:16:39.046
[Zvi]: Like that's not an option.
01:16:39.046 --> 01:16:40.647
[Zvi]: You can't pay that thing.
01:16:40.647 --> 01:16:41.627
[Zvi]: You have to do something.
01:16:42.350 --> 01:16:42.550
[JDP]: Right.
01:16:42.550 --> 01:17:01.014
[JDP]: And so I think my expectation, basically, my expectation is that it's ironically enough, like, in part, because of the orthogonality thesis, on the alignment tax that you pay in practice, will look like I think that people expect, like, oh, if you have like a Norris Kirkwood normative AI, you're paying at least like an 80% alignment tax.
01:17:01.014 --> 01:17:03.815
[JDP]: And therefore, in any kind of multipolar scenario, you'll be outcompeted.
01:17:03.815 --> 01:17:11.697
[JDP]: I think if you're paying more like a 1% alignment tax or something where it just ends up being practice just being swamped by other factors, like
01:17:12.369 --> 01:17:25.913
[JDP]: who supports who or what resources are devoted to which projects, especially if you're in a scenario where your central input to the goodness of the AI is compute.
01:17:25.913 --> 01:17:40.958
[JDP]: Let's say you're building it in a distributed fashion where you have thousands and thousands of nodes or millions of nodes even distributed all across the world in some kind of distributed crypto, whatever architecture you want to imagine.
01:17:41.564 --> 01:17:48.228
[JDP]: that's still going to be using a substantial fraction of available compute resources.
01:17:48.228 --> 01:17:51.971
[JDP]: It's just a difference of governance, basically.
01:17:51.971 --> 01:18:00.916
[JDP]: And so you're still looking at people are supporting whatever project with this expected design and values and outcomes.
01:18:00.916 --> 01:18:09.381
[JDP]: And over time, the substrate of that, where right now all of our computation is being done on these relatively insecure platforms and computers,
01:18:11.038 --> 01:18:16.343
[JDP]: The minute you have an AI that has any kind of like, and this is already happening, right?
01:18:16.343 --> 01:18:17.724
[JDP]: With chat GPT.
01:18:17.724 --> 01:18:27.093
[JDP]: And like, I'm sure that we're not that far away from like, it just being expected that before you do any kind of like code check-in that you're going to have AIs that are checking it for problems.
01:18:27.093 --> 01:18:34.860
[JDP]: And then the next step beyond that will be like having AI systems reduce the labor cost of formal verification of software.
01:18:34.860 --> 01:18:35.020
[JDP]: Right.
01:18:36.299 --> 01:18:44.025
[JDP]: where you're going to be saying, look, I want to use these stronger type systems, but I don't want to make my software cost 100 times as much.
01:18:44.025 --> 01:18:46.046
[JDP]: Can you help me out, chatGBT?
01:18:46.046 --> 01:19:02.878
[JDP]: And chatGBT, especially if you're in a felicitous feedback loop where people start demanding some more of this, and that means the AI is having more trained data to work with and get better at doing that, that if you start a loop like that that's self-reinforcing, you can end up with
01:19:04.580 --> 01:19:14.491
[JDP]: a substrate that is becoming more secure and less vulnerable to weird takeover from malicious actors over time.
01:19:14.491 --> 01:19:19.277
[JDP]: But more importantly, my point is that I don't see any development trajectory
01:19:20.297 --> 01:19:34.168
[JDP]: Barring some advance where suddenly you can just make things orders and orders of magnitude more efficient, suddenly you're getting a four order of magnitude increase of performance overnight, which I don't see in the technology's future.
01:19:34.168 --> 01:19:38.892
[JDP]: The longer I think about it, I just don't see that kind of performance enhancement on the table.
01:19:39.667 --> 01:19:48.533
[JDP]: you're going to be looking at a scenario where the winning systems are, in one way or another, going to be made out of a substantial fraction of the planet's compute resources.
01:19:48.533 --> 01:19:57.819
[JDP]: You're looking at whether they're distributed and open source, whether they're... Because even for things that are like an open model, it does not... If you were to open source the GPT-4 weights,
01:19:58.499 --> 01:20:08.162
[JDP]: you still have to pay for like an eight times box to run it, which is, you know, like that's achievable, but then, you know, you get to like GPT-5, let's say it's 10 times bigger.
01:20:08.162 --> 01:20:13.284
[JDP]: Well, now you need like a small cluster of boxes to run this.
01:20:13.284 --> 01:20:17.785
[JDP]: And like, but even, but even those AIs, right?
01:20:17.785 --> 01:20:22.647
[JDP]: Like if I'm thinking realistically, I do not expect GPT-5 to like on its own, right.
01:20:22.647 --> 01:20:23.147
[JDP]: To be like,
01:20:25.015 --> 01:20:49.020
[JDP]: So much, like, one of the things that stands out to me is that, like, OpenAI's dev day, they're not really like, oh, yeah, we're having, like, our next models and, like, the actual capital requirements of scaling up seem to have, like, this weird curve where it's almost like you need, like, if you want to keep scaling past this, it's almost like you need, like, government intervention to, like, throw more money in.
01:20:50.100 --> 01:20:56.102
[JDP]: Like Anthropic claims in their investment slides that they're going to use the money to build something 10 times bigger than GPT-4.
01:20:56.102 --> 01:21:00.123
[JDP]: It's like, okay, well, are you going to build something 10 times bigger than that?
01:21:00.123 --> 01:21:08.246
[JDP]: And then like, if you do that, you're going to like, one thing I would ask is like, what do you think the actual return curve is on fervor investment?
01:21:08.246 --> 01:21:12.907
[JDP]: Because OpenAI hasn't actually published like any details about their GPT-4 training run.
01:21:12.907 --> 01:21:16.248
[JDP]: We don't even know if the curve is broken yet at the GPT-4 level, right?
01:21:16.638 --> 01:21:24.704
[JDP]: Like when you train a GPT-4 level model, does the scaling curve still imply, or does the loss curve imply that you could still make it better and get bigger results?
01:21:24.704 --> 01:21:26.346
[JDP]: And how much better would those results be?
01:21:26.346 --> 01:21:28.567
[JDP]: This is not known information.
01:21:28.567 --> 01:21:30.789
[JDP]: And you might argue, well, maybe it shouldn't be known information.
01:21:30.789 --> 01:21:31.069
[JDP]: Fine.
01:21:31.069 --> 01:21:33.311
[JDP]: But it still means we don't know.
01:21:33.311 --> 01:21:44.360
[JDP]: But my expectation would be that the actual level where you're getting to, scare quote, world takeover type or extinction risk level AI,
01:21:45.086 --> 01:21:50.489
[JDP]: I don't think that's like, oh, that's one more 10 times over GPT-4, that's two more 10 times.
01:21:50.489 --> 01:21:53.431
[JDP]: I would imagine that's like orders of magnitude, right?
01:21:53.431 --> 01:21:59.194
[JDP]: That's at least like, and I think this is like one of the cruxes or like points of contention.
01:21:59.194 --> 01:22:00.975
[JDP]: I remember Liran Shapiro was arguing about this.
01:22:00.975 --> 01:22:10.000
[JDP]: He said, you know, the only reason GPT-4 isn't, and by the way, Connor Lee has also said like similar things where he says essentially like, but I'm talking about Liran's exact statement because it's like public.
01:22:10.712 --> 01:22:15.795
[JDP]: He says something like, you know, the only reason that GPT-4 is not like an extinction risk right now is it can't boss people around.
01:22:15.795 --> 01:22:19.336
[JDP]: And that was why I was asking, like, why doesn't auto GPT work, right?
01:22:19.336 --> 01:22:25.159
[JDP]: Because that seems like a very important part of the threat model here.
01:22:25.159 --> 01:22:27.780
[JDP]: And if you have any thoughts, by the way.
01:22:27.780 --> 01:22:34.283
[Zvi]: Yeah, I don't, I don't think it's true that if GPT-4 could boss people around, that it would be an extinction event.
01:22:34.283 --> 01:22:35.124
[Zvi]: I can imagine
01:22:36.047 --> 01:22:53.512
[Zvi]: you know, in theory, if you arrange the humans such that they attach certain social position to even a relatively dumb system, right, this is kind of the plot of daemon, for example, you can, in fact, effectively use human intelligence as part of the AI system to like generate and release whatever that you want, in some important sense.
01:22:53.512 --> 01:23:05.416
[Zvi]: But, you know, if we actually think about what it would take, like, my sense is that a GPT-5 metaphorical, you know, one or two order of magnitude above GPT-4 about as much better as GPT-3,
01:23:06.138 --> 01:23:07.858
[Zvi]: you know, et cetera, et cetera.
01:23:07.858 --> 01:23:14.000
[Zvi]: I think of that as like one nine of safety in terms of like whether it's big enough for an intentional event.
01:23:14.000 --> 01:23:14.180
[Zvi]: Right.
01:23:14.180 --> 01:23:21.261
[Zvi]: Like it's, I would bet heavily against it, but like, I'm not thrilled that we're finding out.
01:23:21.261 --> 01:23:21.761
[JDP]: Sure.
01:23:21.761 --> 01:23:22.381
[JDP]: Okay.
01:23:22.381 --> 01:23:24.742
[JDP]: I think that's like a, like, is that my position?
01:23:24.742 --> 01:23:26.982
[JDP]: Not quite, but like, I think that's like a fair position.
01:23:26.982 --> 01:23:30.063
[JDP]: Like it's like an intrinsically crazy position.
01:23:30.063 --> 01:23:34.104
[JDP]: Um, and like, okay.
01:23:34.497 --> 01:23:37.499
[JDP]: So I think this is actually a good point, because we only have about 30 minutes left.
01:23:37.499 --> 01:23:41.441
[JDP]: And I think this is actually a good point to kind of maybe transition a bit.
01:23:41.441 --> 01:23:44.803
[JDP]: But real quick, maybe a five-minute point, because I wanted to talk about this more.
01:23:44.803 --> 01:23:46.324
[JDP]: But then we ended up talking about outer alignment.
01:23:46.324 --> 01:23:48.706
[JDP]: I went in expecting to talk about inner alignment.
01:23:48.706 --> 01:23:49.846
[JDP]: But instead, we talked about outer alignment.
01:23:49.846 --> 01:24:00.193
[JDP]: But just to talk about inner alignment really quick, most of the deceptive MACE optimize, there's this weird fallacy people engage in, where they'll talk about deceptive alignment.
01:24:00.193 --> 01:24:01.534
[JDP]: But they never actually discuss it.
01:24:01.534 --> 01:24:03.495
[JDP]: And actually, this does lead into the next thing I wanted
01:24:04.258 --> 01:24:09.543
[JDP]: You know, they'll never discuss like where exactly they expect the deception to come in.
01:24:09.543 --> 01:24:09.863
[JDP]: Right.
01:24:09.863 --> 01:24:22.716
[JDP]: And so because of that, you'll get like this weird thing where people will talk like we can just assume deception happens and we can just basically we start from the assumption that it's misaligned and that there is deception somewhere.
01:24:22.716 --> 01:24:25.679
[JDP]: And it does not matter how much you're you're like,
01:24:26.343 --> 01:24:38.511
[JDP]: a thing addresses the core causes or roots or latent variables which cause misalignment, we just keep assuming the same amount of intrinsic misalignment risk.
01:24:38.511 --> 01:24:46.677
[JDP]: There's this weird failure to update of there is a range of spherical probabilities for how
01:24:47.708 --> 01:25:01.456
[JDP]: risky it is that something is misaligned based on this training process, but there's actually no update that occurs, like I can present basically my intuition is I could present something like Nate Suarez with the same alignment plan with with eight different variations of the alignment plan.
01:25:01.969 --> 01:25:07.291
[JDP]: One of which has like a very strong risk of, you know, deceptive Mace Optimizer type outcomes.
01:25:07.291 --> 01:25:12.293
[JDP]: And one of which has almost no risk of that, but that he will grade them all about equally risky.
01:25:12.293 --> 01:25:22.718
[JDP]: Or maybe he'll like differentiate between ones that are like really obviously bad ideas and the other ones that are merely bad ideas because they don't conform to like his personal weird specific biases.
01:25:22.718 --> 01:25:27.980
[Zvi]: Nate is probably the most skeptical grader of alignment plans on the planet, or at least very, very far up there.
01:25:28.530 --> 01:25:31.551
[Zvi]: And the expectation is if you gave him a hundred alignment plans, he would read them all.
01:25:31.551 --> 01:25:33.691
[Zvi]: It was impossibly doomed to fail.
01:25:33.691 --> 01:25:37.272
[Zvi]: Uh, and then like you asked him to differentiate exactly how doomed to fail.
01:25:37.272 --> 01:25:39.953
[Zvi]: Some of them would be slightly less doomed, but yeah.
01:25:39.953 --> 01:25:54.536
[Zvi]: Um, yeah, I, my way of thinking about this, you know, as I've thought more and more about AI and how these things work, uh, has evolved towards like this idea of like these
01:25:55.188 --> 01:26:05.792
[Zvi]: models are misaligned, and like there being deception as to almost dissolve these words in my brain, to a large extent, like not, they're not pointing at like really important things.
01:26:05.792 --> 01:26:10.934
[Zvi]: But that like, this idea that something would have had to in some sense have gone wrong.
01:26:10.934 --> 01:26:23.859
[Zvi]: In order to like, misalign or see deception, you have to point to like, what caused that as opposed to it simply being, you know, well, what any maximize what anything that was trying to do great, trying to gradient descent on outcomes and like,
01:26:24.482 --> 01:26:26.842
[Zvi]: do whatever seemed likely to work was going to do.
01:26:26.842 --> 01:26:36.144
[Zvi]: I mean, you know, any, any child you've ever seen is going to sometimes realize that saying that which is not is helpful.
01:26:36.144 --> 01:26:40.405
[JDP]: Let me, let me, let me be a little clearer about what I mean here.
01:26:40.405 --> 01:26:45.306
[JDP]: What I, what I mean here is that like, I think that, so, okay.
01:26:45.306 --> 01:26:48.706
[JDP]: So for a long time, alignment meant outer alignment, right?
01:26:48.706 --> 01:26:53.187
[JDP]: When you, what we now call outer alignment, it meant specifying a goal function and training process
01:26:53.919 --> 01:27:01.384
[JDP]: that will in theory at least produce a aligned or at least not catastrophic agent that is still capable.
01:27:01.384 --> 01:27:07.128
[Zvi]: You want the outcomes to be non-catastrophic.
01:27:07.128 --> 01:27:20.278
[JDP]: And then over time, I've noticed a shift away from that where it's become about paranoia about, well, there may be a consequentialist agent foundation's homunculus sitting inside your weights
01:27:21.064 --> 01:27:23.046
[JDP]: waiting to spring on us once.
01:27:23.046 --> 01:27:28.652
[JDP]: And this is like, to me, this is like an absolutely terrible development in the AI alignment field.
01:27:28.652 --> 01:27:40.305
[JDP]: And it's like, getting it far away from like, to what to me, at least, are like the central, obvious, still, frankly, relevant threat models towards this like weird obscurity.
01:27:40.305 --> 01:27:41.466
[JDP]: And so what I'm talking about,
01:27:43.198 --> 01:27:49.184
[JDP]: I'm not trying to invalidate when you say, for example, the default is we should assume it's misaligned.
01:27:49.184 --> 01:27:51.446
[JDP]: That's not what I'm saying, right?
01:27:51.446 --> 01:27:57.731
[JDP]: Or even the default is that it's not so much, why do you expect it to go right versus why do you expect it to go wrong?
01:27:57.731 --> 01:27:58.692
[JDP]: It's not even like that.
01:27:58.692 --> 01:28:00.274
[JDP]: I'm talking specifically about the idea of
01:28:02.503 --> 01:28:13.953
[JDP]: There's this weird thing where you would present someone like Nate Suarez with an alignment plan, and his criticism would not be based on, for example, how do you ensure that you're pointing at the right thing?
01:28:13.953 --> 01:28:20.639
[JDP]: His criticism would mostly be based on, or at least the now default criticism, and what I was seeing in the Dworkash.
01:28:22.041 --> 01:28:35.292
[JDP]: interaction with like Elias Yudkowsky and Liron Shapiro quote tweeting it was something like, well, you know, it doesn't matter if you specified a correct ethical objective, because the homunculus inside the weights is still going to pwn you.
01:28:35.292 --> 01:28:46.601
[JDP]: Okay, you're still going to get screwed over by the little consequentialist inside who is sitting there and waiting and plotting your demise, as opposed to like, you know, something like, for example, like in my Yes Spammer bug,
01:28:46.917 --> 01:28:54.402
[Zvi]: OK, I think I have enough to try and see if I can intuition pump here and maybe point you in a direction that would be productive.
01:28:54.402 --> 01:29:09.193
[Zvi]: So when I think of Dwarkesh's question on deceptive alignment in the situation, it's not that the AI is actively thinking to itself, well, I have to pretend to be ethical.
01:29:09.193 --> 01:29:15.457
[Zvi]: It's sort of a way of describing in simpler or more intuitive terms to a lot of people.
01:29:16.983 --> 01:29:22.144
[Zvi]: something that is more like, well, what are you, what are you teaching me?
01:29:22.144 --> 01:29:25.945
[Zvi]: What are you, what is the outer alignment optimizer or whatever you want to call it?
01:29:25.945 --> 01:29:27.965
[Zvi]: Like, what are we actually telling you?
01:29:27.965 --> 01:29:40.188
[Zvi]: We're, we're telling it to do that, which is evaluated by the evaluation function, which can be a human, can be an AI, can be some hybrid of them, whatever, as following ethical principles, right?
01:29:40.188 --> 01:29:42.389
[Zvi]: Like that's what we're aiming for.
01:29:42.389 --> 01:29:43.509
[Zvi]: And so the idea is that like,
01:29:44.369 --> 01:29:59.055
[Zvi]: If the AI is insufficiently powerful in some, not enough optimization going on, the only reasonable solution to this problem that can be found by its training is to actually embody these principles, right?
01:29:59.055 --> 01:30:04.518
[Zvi]: Like nothing else, anything else it's going to do is going to backfire horribly.
01:30:04.518 --> 01:30:08.420
[Zvi]: And so any attempt to move in those directions will just fail.
01:30:08.420 --> 01:30:08.720
[SPEAKER_02]: Right.
01:30:08.720 --> 01:30:09.060
[SPEAKER_02]: Sure.
01:30:09.060 --> 01:30:11.421
[Zvi]: However, the idea being that, you know,
01:30:12.074 --> 01:30:39.880
[Zvi]: if it gets sufficiently sophisticated and complex and capable and intelligent and blah, blah, blah, and this thing gets trillions of parameters, and we scale it up and up and up, and we give it lots and lots of more training run, that it will start to discover paths through output space or whatever you want to call it, that give the impression to the evaluation function that it's fulfilling its ethical parameters without, that stops matching the actual ethical parameters, right?
01:30:40.202 --> 01:30:41.583
[JDP]: Sure, so I understand what you're saying.
01:30:41.583 --> 01:30:43.685
[JDP]: Now, let me just, I just want to get really frank.
01:30:43.685 --> 01:30:45.346
[JDP]: So I'm just gonna really frank here.
01:30:45.346 --> 01:30:53.151
[JDP]: My specific criticism is that my expectation is that if I were to say, you will learn the process and the outcome, not just the outcome.
01:30:53.151 --> 01:31:04.820
[JDP]: And so the AI will have attachment, like, you know, it will value both aspects of the process or like meta principles of process, as opposed to just finding the most efficient solution in whatever space.
01:31:05.219 --> 01:31:24.973
[JDP]: that the answer I would get back is no, it won't, because the, you know, consequentialist homunculus inside is going to just ignore the process training, no matter how much of it you give, because gradient descent is like, because there's like this weird, absolutely bizarre evolution metaphor, that is not how gradient descent works, right?
01:31:24.973 --> 01:31:29.856
[JDP]: Like, you know, gradient descent is like a system that works on both the process and the outcome.
01:31:29.856 --> 01:31:32.138
[Zvi]: Well, like the human, as I understand it,
01:31:32.725 --> 01:31:44.713
[Zvi]: The human brain has some very interesting machinery in it, where if I were to act ethically, I am training myself to be ethical.
01:31:44.713 --> 01:32:01.484
[JDP]: My argument is basically that this is more or less the kind of thing that gradient descent does in general, but we both agree that that is not on its own sufficient in the case of limiting process, that there are circumstances where that is not necessarily in and of itself sufficient.
01:32:01.727 --> 01:32:02.508
[Zvi]: Right, right.
01:32:02.508 --> 01:32:18.239
[Zvi]: And so essentially, like what I when I see claims like this, when people who are making much worse, less detailed arguments than you are, right, like, I often see this confusion of assuming that that kind of property that we see in humans will like for, you know, metaphorical similar reasons.
01:32:18.826 --> 01:32:22.647
[Zvi]: hold on computers that just don't have this kind of leaky.
01:32:22.647 --> 01:32:23.148
[JDP]: Right.
01:32:23.148 --> 01:32:25.749
[JDP]: So I don't see this as happening for metaphorical.
01:32:25.749 --> 01:32:31.491
[JDP]: I mean, just specifically the way that gradient descent happens to work, it happens to share half of this property.
01:32:31.491 --> 01:32:31.791
[Zvi]: Right.
01:32:31.791 --> 01:32:37.693
[Zvi]: So your belief about the details of exact gradient descent is that it will effectively mimic this property somewhat.
01:32:37.693 --> 01:32:42.655
[Zvi]: So would you call this like it gets stuck at local maxima, where it embodies these?
01:32:42.655 --> 01:32:43.455
[JDP]: Well, yeah.
01:32:43.455 --> 01:32:44.075
[JDP]: So that's the thing.
01:32:44.075 --> 01:32:45.676
[JDP]: So when you're talking about a local maxima,
01:32:46.356 --> 01:32:47.277
[JDP]: It depends, right?
01:32:47.277 --> 01:32:50.379
[JDP]: So if you have a thing, so here's an example of a thing you can do.
01:32:50.379 --> 01:32:58.444
[JDP]: Like here's a training process that, like to get into really subtle details, let's say you have a training process where let's like more like the traditional RLHF setup.
01:32:58.444 --> 01:33:00.065
[JDP]: I have a reward model.
01:33:00.065 --> 01:33:01.726
[JDP]: I have an AI.
01:33:01.726 --> 01:33:03.267
[JDP]: I have a prompt bank.
01:33:03.267 --> 01:33:06.649
[JDP]: I prompt the AI in some scenario or context.
01:33:06.649 --> 01:33:09.330
[JDP]: I then graze response through a reward model, right?
01:33:09.330 --> 01:33:12.292
[JDP]: And then I update the AI based on the reward model.
01:33:13.271 --> 01:33:25.435
[JDP]: Part of what I'm trying to say here is that if all you have is just that one reward model, and you're like, that's a frozen reward model, you're not learning to like, you're not at like, you know, like you can think of the reward model as your utility function, right?
01:33:25.435 --> 01:33:27.156
[JDP]: That's your value function.
01:33:27.156 --> 01:33:36.219
[JDP]: And so if I do this, and then I learn all these behaviors, right, that seem that are aligned, they probably are aligned.
01:33:36.219 --> 01:33:37.999
[JDP]: But here's where it gets interesting, right?
01:33:37.999 --> 01:33:42.801
[JDP]: Is if you imagine at, you know, like scaling this up and doing this process with a much smarter model,
01:33:43.464 --> 01:34:05.828
[JDP]: Like you say, if it has any like context where it's prompted to act consequentialist or think in a consequentialist way, or it's in a process, say like a civilizational competition, where it's prompted to think in a more consequentialist way, those behaviors will start to break down precisely because like, you know, for the same reasons that like the Yes Spammer bug happens, right?
01:34:05.828 --> 01:34:12.169
[JDP]: There is a gradient pointing to like, the scarecrow real maximum of the reward model.
01:34:13.206 --> 01:34:16.789
[JDP]: And there's a smooth, in the case of the yes spammer, there's like a smooth gradient there, right?
01:34:16.789 --> 01:34:24.375
[JDP]: Like, yes, you know, saying the word yes or a couple of yeses is like completely, you know, by the way, I just realized we may not have explained to our list.
01:34:24.375 --> 01:34:38.207
[JDP]: You know, the yes spammer is essentially a bug where like, if you have this bug, if you have this reward model that just says like yes or no to whatever some piece of thing, whatever some piece of text satisfies a property and it's all in like one context, you can sometimes,
01:34:39.119 --> 01:34:44.623
[JDP]: break the model by just spamming the word yes, instead of anything that the content should actually be about.
01:34:44.623 --> 01:34:46.384
[JDP]: And the model will hear yes, yes, yes, yes.
01:34:46.384 --> 01:34:48.545
[JDP]: And then predict the next word is yes.
01:34:48.545 --> 01:34:51.507
[JDP]: And this is like a problem.
01:34:51.507 --> 01:34:57.231
[JDP]: But when you have like that bug, right, it's completely continuous behavior with normal behavior.
01:34:57.231 --> 01:35:02.874
[JDP]: So there's like a failure mode, you have a very slight failure, this slight failure is like reinforced.
01:35:02.874 --> 01:35:06.757
[JDP]: And like, but when you think about like what you were talking about with like, you know, if you were talking to a human,
01:35:07.222 --> 01:35:10.705
[JDP]: a human would push back and say, no, stop saying yes all the time.
01:35:10.705 --> 01:35:11.966
[JDP]: So stop that.
01:35:11.966 --> 01:35:28.620
[JDP]: But what happens with, and so if you were to learn, if you're like, imagine while you were doing this setup, right, this whole training loop, that you also had like a, like a second prompt bank sort of that you're learning where you're learning, like what it should look like to answer these questions correctly.
01:35:29.611 --> 01:35:41.315
[JDP]: or you're learning aspects of the way you're answering that are good and adding them to a second reward model that's updated along with the training, along with the underlying model itself.
01:35:41.315 --> 01:35:47.378
[JDP]: And then that reward model, which remembers what naive aligned behavior is supposed to look like,
01:35:48.078 --> 01:35:50.479
[JDP]: then intervenes and says, stop saying yes all the time.
01:35:50.479 --> 01:35:51.899
[JDP]: That's not what a line behave.
01:35:51.899 --> 01:35:53.700
[JDP]: You know, that's not what good behavior looks like.
01:35:53.700 --> 01:35:55.920
[JDP]: That's just some weird OOD.
01:35:55.920 --> 01:36:00.662
[JDP]: This does not rescue the phenomenon of the original naive things you were learning earlier.
01:36:00.662 --> 01:36:06.603
[JDP]: You know, basically insisting that, like, the behaviors you learn should be somewhat consistent in their form.
01:36:08.723 --> 01:36:22.549
[JDP]: at least consistent enough in their form over the run that you don't just suddenly have like a weird discontinuity where you jump the rail and now you're spamming yes all the time or sticking electrodes into people's heads without their consent or things of this nature.
01:36:22.549 --> 01:36:33.414
[JDP]: And so when you're, but like in naively, right, if we're not doing that and we're just doing the frozen reward model trained on like thumbs up, thumbs down, it can even be an online reward model, right?
01:36:33.414 --> 01:36:38.376
[JDP]: Like you said, if you have all the humans in all the world giving feedback to this AI, pressing thumbs up, thumbs down,
01:36:38.783 --> 01:36:56.828
[JDP]: But like all the AI cares about is like, um, you know, getting that loop, like that loop will break down precisely because it does not have like a valued memory of the, of the, of the original phenomenon that it's trying to capture, that it has value attached to it.
01:36:56.828 --> 01:37:07.051
[JDP]: It only cares about like this very narrow, potentially hackable, uh, correlate of the underlying thing it's trying to learn.
01:37:07.051 --> 01:37:07.471
[JDP]: And so.
01:37:07.984 --> 01:37:27.734
[JDP]: The difference I'm saying is that if you just do that naive thing where you're not, you're not learning like a separate process model in addition to your outcome model, that as you scale that up and as you take that to the limit, it's very well the case that you could see like these, you know, like outcomes where, well, this satisfies the terminal reward.
01:37:27.734 --> 01:37:28.875
[JDP]: That's all you asked me to do.
01:37:29.916 --> 01:37:33.457
[JDP]: you're not actually forcing me to rescue the phenomenon in any kind of way.
01:37:33.457 --> 01:37:37.577
[JDP]: You're not forcing me to make my actions consistent with a naive sense of them.
01:37:37.577 --> 01:37:44.398
[JDP]: So as I get smarter and smarter, I'm just going to find more and more cursed ways to good heart your reward model, right?
01:37:44.398 --> 01:37:48.119
[JDP]: Does that make sense?
01:37:48.119 --> 01:37:51.800
[Zvi]: Yeah, I mean, I think we went over a lot of that before.
01:37:51.800 --> 01:37:53.980
[JDP]: We did, but I was just kind of recapping it.
01:37:53.980 --> 01:37:56.901
[Zvi]: Some listeners will definitely need that.
01:37:56.901 --> 01:37:58.941
[Zvi]: Right, I think a lot of my perspective on this is that,
01:38:00.171 --> 01:38:08.677
[Zvi]: it's very easy to have enough optimization pressure, right, to have enough capability to find just spam yes or to drift towards just spam yes, right?
01:38:08.677 --> 01:38:08.977
[Zvi]: Sure.
01:38:08.977 --> 01:38:10.078
[Zvi]: Relatively simple formula.
01:38:10.078 --> 01:38:10.238
[JDP]: Right.
01:38:10.238 --> 01:38:12.019
[JDP]: That's not, you don't need to be smart to do that.
01:38:12.019 --> 01:38:15.261
[Zvi]: And so it's very easy to see this and we already see this.
01:38:15.261 --> 01:38:29.311
[Zvi]: And if you were to incorporate process evaluators and you were to incorporate like common sense checks on your, on the competition of your outputs and such, you can stop the thing from drifting towards such an obvious, obviously flawed target.
01:38:29.673 --> 01:38:29.933
[Zvi]: Right?
01:38:29.933 --> 01:38:37.097
[Zvi]: Like, and then you need a much more, and then the question is, is there going to be another less obvious, more complex failure mode, right?
01:38:37.097 --> 01:38:45.482
[Zvi]: Like the kind of thing that would fool humans or would fool whatever your secondary checks are, that is still like a perverse outcome that we didn't want.
01:38:45.482 --> 01:38:46.902
[Zvi]: And we haven't really found it.
01:38:46.902 --> 01:38:51.165
[JDP]: If you're, if you're continuously adding to the instrumental reward store, right?
01:38:51.165 --> 01:38:57.468
[JDP]: So you're continuing, so this is like a continuous process that you do, like at the start of the training run, when your model is relatively naive,
01:38:57.861 --> 01:39:00.805
[JDP]: and you're still doing it like all the way for the training run.
01:39:00.805 --> 01:39:13.560
[JDP]: Part of the idea is that like the, the, uh, evolution of the instrumental checks should evolve, like with the, like, like the instrumental checks should like.
01:39:14.019 --> 01:39:19.963
[JDP]: evolve along with the model getting smarter, which should like help avoid that.
01:39:19.963 --> 01:39:31.250
[JDP]: Like, yeah, okay, sure, that would have fooled me when I was merely human level, but it doesn't fool me at super intelligence level 119 minus one, or whatever.
01:39:31.250 --> 01:39:35.133
[Zvi]: So it's a form of, you know, iterated
01:39:36.581 --> 01:39:38.662
[Zvi]: checks on the system.
01:39:38.662 --> 01:39:39.762
[JDP]: Something of that nature.
01:39:39.762 --> 01:39:40.823
[JDP]: Now is that perfect?
01:39:40.823 --> 01:39:43.024
[JDP]: Is that like a thing that like is going to have no failure?
01:39:43.024 --> 01:39:43.664
[JDP]: No, of course not.
01:39:43.664 --> 01:39:44.424
[JDP]: Right.
01:39:44.424 --> 01:39:50.846
[JDP]: So you're going to obviously have like various, um, things you're going to be looking at.
01:39:50.846 --> 01:40:01.390
[JDP]: And it's like, you know, I say like, well, how could you ever know, like whether you can capture all of the possible, like, how do you know if you've ever going to like iron out all the failure modes?
01:40:01.390 --> 01:40:03.071
[JDP]: And part of the answer might just be like,
01:40:04.022 --> 01:40:06.505
[JDP]: you might not have to.
01:40:06.505 --> 01:40:16.335
[JDP]: Basically, when we talk about something that's super intelligent, does it actually need to have complete unnormative freedom to find whatever solution it wants?
01:40:16.335 --> 01:40:19.538
[JDP]: I can imagine a set of AIs, for example, to avoid triggering the dystopia.
01:40:25.878 --> 01:40:39.320
[JDP]: a distributed singleton, right, that's run across like many computers that have been nearly perfectly secured, such that you would already need to be like, super intelligent as the whole system to over to destroy to hack them or, or anything like that.
01:40:39.320 --> 01:40:54.903
[JDP]: You know, I could easily imagine a thing of that a system of that nature, which is aware of, because again, like the orthogonality thesis means that one of its consequences is that the complexity of the thing that implements the strategies and tactics does not necessarily have to be like,
01:40:55.837 --> 01:41:00.999
[JDP]: or rather the normativity of that does not necessarily have to be like the normativity of the goals.
01:41:00.999 --> 01:41:06.521
[JDP]: And like, you know, like, obviously, they're related in this scheme, because you're trying to teach it both process and outcome.
01:41:06.521 --> 01:41:22.268
[JDP]: But like, the limit of like, reasonable, non OOD process does not have to look like so stupid that it's like not capable of defending itself against other systems that would try to like bootstrap themselves and like destroy it, right?
01:41:22.268 --> 01:41:24.609
[JDP]: Like, especially if you're in a scenario
01:41:25.498 --> 01:41:32.263
[Zvi]: we're granting, you know, for the sake of exploring the possibility space that like we would, we don't have to worry about that stuff for now.
01:41:32.263 --> 01:41:36.386
[Zvi]: We're simply trying to ask the question, you know, can we get from A to B?
01:41:36.386 --> 01:41:40.129
[Zvi]: Can we, can we scale up in a way that doesn't cause these crazy outcomes to happen?
01:41:40.129 --> 01:41:52.638
[Zvi]: You know, and then we, well, each of these steps, right, we're continuously adding new terms, new aspects to our core value, right?
01:41:52.638 --> 01:41:53.059
[Zvi]: Like we're,
01:41:53.720 --> 01:41:56.683
[Zvi]: we're incorporating these process considerations.
01:41:56.683 --> 01:42:11.839
[JDP]: So real quick, just because I want to like, do you agree just in principle that if you have a system where you're actually trying to trade off process values versus outcome values, that this would in fact, at least in principle,
01:42:13.400 --> 01:42:20.603
[JDP]: Basically, if I were to ask you about this later, are you going to pull the consequentialist homunculus card on me, right?
01:42:20.603 --> 01:42:28.647
[JDP]: You'll say, oh, well, I didn't actually say that my still primary threat model is still deceptive MACE optimizers.
01:42:28.647 --> 01:42:29.968
[JDP]: I don't feel like you addressed this.
01:42:29.968 --> 01:42:36.051
[JDP]: I do feel like, obviously, we should be on the lookout for deceptive MACE optimizers.
01:42:36.051 --> 01:42:38.212
[JDP]: We should be looking to understand the internals of these networks more.
01:42:41.267 --> 01:42:44.989
[Zvi]: So to be clear, that was not going to be my response to that question at all, right?
01:42:44.989 --> 01:42:50.393
[Zvi]: If we had explored that path more, it was going to be a question of, well, that's an interesting idea.
01:42:50.393 --> 01:43:05.062
[Zvi]: It certainly makes it hard to go off the rails, but I noticed that we're doing this loop of modifying our ultimate functions that we're aiming for, and how do we make sure that doesn't go off the rails during this process in a way that leads to something
01:43:05.563 --> 01:43:08.486
[Zvi]: interesting and here's the several things that I'd be worried about here.
01:43:08.486 --> 01:43:16.813
[Zvi]: This seems like a very hard path, but I'm interested in talking about it, would be my response.
01:43:16.813 --> 01:43:20.116
[JDP]: It was about this point that I realized we only had 15 minutes.
01:43:20.116 --> 01:43:31.446
[JDP]: And instead of taking that as my cue to gracefully wind down the podcast and start giving an outro, I started talking faster, which is kind of an anti-pattern, right?
01:43:31.446 --> 01:43:32.247
[JDP]: Not a great idea.
01:43:33.007 --> 01:43:38.009
[JDP]: And this meant that the last maybe 15 or 20 minutes of the podcast ended up substantially lower quality than the rest of it.
01:43:38.009 --> 01:43:49.033
[JDP]: So rather than like leave the audience on that note of, on like a down note, I figured no, you know, during the editing I just, I just pin it here.
01:43:49.033 --> 01:43:51.654
[JDP]: My free takeaways from this conversation are one,
01:43:53.935 --> 01:43:55.656
[JDP]: I think this is pretty good.
01:43:55.656 --> 01:44:09.080
[JDP]: On the object level, I would like to see fewer criticisms like, we don't know how to point an AI at anything, which is just kind of patently untrue at this point, and more criticisms like ZVIs.
01:44:09.080 --> 01:44:17.482
[JDP]: I observe that this thing you're doing sounds like it has a fairly complex feedback loop, and I'm skeptical about the stability of that loop under XYZ conditions.
01:44:17.482 --> 01:44:18.183
[JDP]: That's like a much
01:44:19.878 --> 01:44:22.460
[JDP]: that's a better discourse pattern.
01:44:22.460 --> 01:44:46.517
[JDP]: My second takeaway is that the frame I kind of stumbled on here about, you know, if you boil down what, you know, if you boil down what you care about in terms of like preventing the perverse instantiation, you end up with like a necessary trade-off between normative reasoning in your AI system and like pure consequentialist reasoning.
01:44:48.737 --> 01:45:01.882
[JDP]: The idea being that if you just have a system that finds the most efficient solution to reaching some outcome that without some form of normative ethics, this is just going to obviously go badly.
01:45:01.882 --> 01:45:11.825
[JDP]: And that most of what we're concerned about, I think, is something closer to risks from super consequentialism rather than risks from super intelligence per se.
01:45:11.825 --> 01:45:17.747
[JDP]: And it's just kind of assumed that those go together, which like under many incentive schemes, you could imagine they do.
01:45:18.840 --> 01:45:33.427
[JDP]: I think pretty much everyone agrees that super consequentialism is dangerous, or at least has the potential to be dangerous, and so I think maybe a better argument would be about under what circumstances that that can or does arise.
01:45:33.427 --> 01:45:42.652
[JDP]: My third and final takeaway is that both before and after this podcast, me and ZVI recognize a kind of mutual frustration with the discourse as it exists.
01:45:43.816 --> 01:45:48.699
[JDP]: rather than say, hey guys play nice because like that that's not really gonna change anything, right?
01:45:48.699 --> 01:46:07.670
[JDP]: I think I might tentatively suggest that there should be more peer discussion and maybe a little bit less like like I think the failure modes that I observe come mostly from discussions between either an expert or an activist and the public or discussion between like
01:46:08.750 --> 01:46:18.174
[JDP]: one side of advocacy and another side of advocacy, like you can't really do deep thinking in an adversarial context.
01:46:18.174 --> 01:46:19.334
[JDP]: It just doesn't really work.
01:46:19.334 --> 01:46:26.277
[JDP]: And so I think that if you want to see more deep thinking, you should probably encourage more peer discussion.
01:46:26.277 --> 01:46:32.459
[JDP]: I tried to talk to ZVI during this podcast as more of a peer than an opponent.
01:46:32.459 --> 01:46:35.460
[JDP]: I hope that came through.
01:46:35.460 --> 01:46:37.021
[JDP]: And thank you for listening.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment