Reflections on AI

Lex Fridman interview with Eliezer Yudkowsky

Apr 5, 2023 Source

Yudkowsky is an AI researcher who believes AGI (artificial general intelligence) on its current trajectory will lead to the eventual extinction of humanity.

Fundamentally Yudkowsky’s reason for believing this is that our progress on enhancing the capabilities of AI systems is outpacing progress on alignment and interpretability by a vast margin. Therefore humanity will succeed in creating AGI far before we have the methods to understand it, contain it, or simply ensure that its behavior is aligned with the survival and well-being of humans.

My understanding of “alignment” is that it refers to the broad area of research aiming to find ways to prevent AGI systems from acting in ways that are harmful to humans or misaligned with our goals. By way of analogy, you could say that humans have advanced and effective alignment techniques for automobiles: we have paved roadways with painted yellow lines on them, guardrails, maps, GPS systems, steering wheels, etc. all ensuring that we are able to make cars go in the direction we want them to go. For AGI, in comparison, it’s a bit like:

  1. we don’t even know exactly where we want to go to begin with (other than “we don’t want to die”);
  2. there isn’t any road leading there;
  3. if there were a road, we wouldn’t be able to see it very clearly; and
  4. even if we could see the road, we don’t have a steering wheel to control the car and keep us on it.

And “interpretability” refers to efforts to explain how AI systems work, i.e. why they make the decisions they make from a mechanistic perspective. Currently we have some understanding of the human brain, i.e. we have identified different specialized areas and know some of the functions they perform, we know what neurons are and how they work, etc. Another way of thinking about this is that we have abstractions built up from lower-level operations that occur in the brain. State of the art AI systems like GPT-4 are black boxes in comparison.

Back to the interview with Yudkowsky. One thing I learned was that the so-called “paperclip maximizer” scenario, which I thought I understood, was not what I thought it was. Eliezer shared that he wished he had used a different name when he first wrote about it (I didn’t even realize he’s the one who introduced it). Most people (including yours truly until this interview) think the thought experiment is about building a paperclip nanofactory fueled by AI, without sufficient constraints, such that the autonomous system programmed only to produce paperclips by any means necessary ends up consuming all resources (including all life on earth) to build the maximum possible number of paperclips.

It turns out that is not what he meant. His original point was that without a sufficient alignment mechanism, an AGI system might begin pursuing things we couldn’t understand or explain. It might start with a human value such as “beauty” but evolve to consider certain forms to be the most beautiful of all, and then go on to maximize the occurrence of these forms, based on an underlying desire (that we instilled it with) to create beauty in the world. (I could be getting this slightly wrong.)

Fridman and Yudkowsky talked more than I expected about the question of consciousness and whether “there’s someone in there”. Yudkowsky said he hoped there was no one in there, as we aren’t collectively prepared to handle the ethical implications of that. (Are we creating conscious beings and keeping them imprisoned?)

Yudkowsky has become more alarmist recently because AI research has advanced faster than he expected. He did not expect the fundamental architecture of neural networks like GPT-4 to be able to get as far as they’ve gotten, to the point where he expressed uncertainty whether GPT-4 is already AGI. In response to Fridman stating confidently that it’s not, Eliezer interrogated his reasons for being so sure, pointing out that it does seem to have some reasoning capability and arguably passes the Turing test about as well as some humans probably could.

Fridman asked for the moment when we would know we’ve crossed a threshold and AI has “escaped the box”. Yodkowsky made the claim that we may have already crossed it. He pushed back on the idea it would be a single, discrete moment. We don’t all agree on how we would determine that, or even what is AGI; so we can’t define a threshold. AI-powered chatbots are already able to manipulate some humans to a significant degree.

Posing the question himself, “Why would AGI lead to the extinction of humanity”, Yudkowsky said simply that of the many possible end states for the many possible ways a new intelligent life form might seek to optimize the world, the vast majority don’t have humans in them. I was personally pretty dissatisfied with this explanation: it makes some sense to me, but it’s so high-level that I find it difficult to engage with. There’s nothing concrete to question or think more deeply about.

In the end Fridman asked Yudkowsky what still gives him hope, if anything. He said that he hopes he is wrong, that somehow we as a species are able to stop a force he currently views as unstoppable. For example, maybe there will be a public outcry forcing those in power before it’s too late to dramatically shift resources to make AI safety our top priority as a species. He doesn’t think that will happen, or if it does it will be too late, or there will be insufficient alignment among the existing world powers for it to be effective.

But he did acknowledge that maybe he’s wrong. Fridman commended him for that, and Yudkowsky reflected on how sad it is that simply acknowledging you could be wrong is something that warrants any praise at all these days.