Reflections on AI

Bankless Interview with Paul Christiano

May 2, 2023 Source

Bankless is a podcast about cryptocurrency. Apparently the hosts spoke with Eliezer Yudkowsky, who told them there is a 99% chance that AI destroys all of humanity; they freaked out a little bit and asked for the name of someone a bit less gloomy; and Yudkowsky suggested they speak with Christiano, who estimates only a 10-20% chance of the doomsday scenario.

At one point one of the hosts said something like “Crypto, decentralized banking… none of it really matters if we’re all dead.” This is why they decided to take a detour from their usual topic of conversation and do a short series on artificial intelligence. I found them to have a disarming curiosity and sense of humor about the whole thing (towards the end of the episode one host said: “Right now my strategy is to always be very polite to Siri. Do you think that will help?”), while taking what’s at stake seriously.

Paul Christiano is a researcher in AI Alignment who runs the Alignment Research Center. He is a former OpenAI employee and is known as one of the inventors of RLHF, reinforcement learning from human feedback.

The apocalyptic scenario Christiano most fears is not the same as Tegmark, who worries about the byproducts of large-scale AI behavior (e.g. raising CO2 levels in the atmosphere); nor is it the same as Yudkowsky, who worries about “hard takeoff” scenarios where superintelligent AI rapidly constructs human-killing nanobots and distributes them across the planet within 24 hours. The reason Christiano is less pessimistic than Yudkowsky (while acknowledging he is still fairly pessimistic compared to most people) is that he believes the most likely catastrophic outcome is more of an “uprising” scenario for which we would begin to see warning signs over a period of months or years, i.e. the risk would be fairly obvious to us by the time it was fully realized.

The most likely way we die involves not, like, AI comes out of the blue and kills everyone, but involves: we have deployed a lot of AI everywhere, and you can kind of just look and be like, “Oh yeah, if for some reason, God forbid, all these AI systems were trying to kill us they would definitely kill us.”

I identified with Christiano’s style of answering questions. He made almost no actual hard predictions, only assigned probabilities and confidence levels to various outcomes. For example he didn’t say he thought Yudkowsky was wrong, only that he felt differently about the likelihood of many things Yudkowsky fears. He didn’t say a hard takeoff scenario isn’t possible, only that he would expect a more gradual takeoff based on historical precedent.

Christiano observed that typically when there is a giant leap forward in a field of research, it is typically during the stage when investment is relatively small, i.e. there are no major governments or large corporations funding massive research projects. Once the giant leap occurs, and large institutions start to get involved, progress tends to become more steady and predictable. He thinks AI is already in the latter stage and so doesn’t believe we will go from AI like what we have today to superhuman AI overnight.

Instead, Christiano expects AI to become more capable over time, and to become more widely deployed and deeply integrated into society as this happens, so that anyone who’s paying attention (including AI developers) will have little doubt about the risks we’re taking. By the time AI is delivering all of our food, operating our public transportation, handling our construction projects, managing our money, and arming our military, we will have had ample time to make considerable progress on the problem of AI alignment, if we are smart enough to actually do that. Christiano thinks we will be, though he doesn’t rule out the possibility we might not: we are perfectly capable of failing at very easy things for dumb reasons.

The hosts asked Christiano to describe some of the things we might be able to do to bring the risks down. He talked about a few approaches to alignment where he sees some progress, including scalable oversight, which involves building systems that enhance our ability to comprehend and validate the behavior of either themselves or other systems. The idea is that a large part of the risk we are facing comes from humans not being able to interpret why AI systems today and in the future do the things they do. If we can’t understand their behavior, then we can’t predict their behavior, therefore we can’t be sure they won’t kill us. Scalable oversight would aim to make it so that we can understand their behavior, keeping ourselves “in the loop” so that we don’t find ourselves living among inscrutable alien robot gods.

Christiano listed a number of challenges facing alignment research. One is that the behavior of an intelligent system in a “lab” setting may not be indicative of its behavior in the real world, since intelligent systems have the ability to recognize when they are being tested. Another is the challenge of deceptive alignment, where our attempts to ensure an AI system is aligned are ineffectual because it learns to actively deceive us.

As an oversimplified example of both of these risks: we might ask an AI robot, “If I had my back turned, and you were holding a knife, would you stab me?” And the robot might say, “Absolutely not, I would never cause you harm.” And then we might give them a knife, and turn our backs, and then they stab us.

This is really a class of risk that all of us deal with on a daily basis, as human beings are already intelligent agents who can adapt their behavior to the situation and sometimes be deceptive. I think the reasons the risk feels more acute with AI are that:

  1. We believe we have a reasonably accurate understanding of how other humans think, so we are reasonably good at predicting human behavior.
  2. Human intelligence tends to fall within a range that we feel is manageable. Most of us are not concerned that a superintelligent human being will be able to rapidly take over the world through sheer intellect, not because it isn’t possible in theory, but because most of us believe that no human being could be that intelligent.

One of the more dramatic points that Christiano made, which was consistent with everything he was saying yet still felt somewhat shocking to hear, was that he believes being killed by AI is the most likely way he will personally die. On its face, this sounds terrifying; but considering his style of communicating throughout the conversation, I figured: yeah, if he thinks there’s a 20% chance of AI killing us all, then that puts “death by AI” right up there with heart disease and ahead of cancer as one of the most likely causes of death, for Christiano or anyone.

What a strange thing, to work on solving the problem that you believe is your most likely cause of death. Maybe this is how researchers working on heart disease and cancer have felt for years. I just hadn’t heard anyone put it this way before, describing their area of research as their own personal most likely cause of death.