Science fiction author Isaac Asimov once presented a set of laws that humans should program into their robots. In addition to a first, second, and third law, Asimov also introduced a “Zeroth Law,” so important that it precedes all ethers: “A robot may either fail to harm humanity, or, by inaction, allow humanity to come to harm.”
This month, computer scientist Yoshua Bengio, known as the “Godfather of AI” for his pioneering work in the field, launched a new organization, Callero. As you might guess, its main mission is to ensure that AI doesn’t harm humanity.
Just thought it helped lay the groundwork for today’s advanced AI, Bengio has become increasingly concerned about the technology in recent years. In 2023, he signed an open letter urging AI companies to push for greater flexibility in developing next-generation AI. Given the current harms of AI (such as bias against marginalized groups) and the future risks of AI (such as engineering biowarfare), there are very strong reasons to think the slowdown is a positive development.
But businesses are businesses. They haven’t slowed down. In fact, they’ve created autonomous AIs known as AI agents, which can see your computer screen, select buttons, and perform tasks, just like you can. While ChatGPT must be prompted by a human every step of the way, an agent can accomplish multi-step goals with minimal prompting, similar to a personal assistant. Right now, those goals are simple—creating a website, for example—and agents don’t yet perform that well. But Bengio worries that giving agency AIS is an inherently risky move: they could possibly escape human control and go “rogue.”
So now, Bengio is pivoting to a backup plan. If, he says, he can stop his colleagues from building AI that matches human intelligence (artificial general intelligence, or AGI) or even surpasses human intelligence (artificial superintelligence, or ASI), then he’ll build some that block those harangues. He calls it “scientific AI.”
The AI scientist will be like an AI agent: it will have no autonomy or goals of its own. Instead, its main job will be to calculate the probability that another AI’s action will cause harm and, if the action is too risky, to block it. AI behaviors could be AI scientists in their models to prevent them from doing something dangerous, similar to how we put guardrails along roads to prevent cars from skidding.
I spoke with Bengio about why he’s so disturbed by today’s AI systems, whether he regrets the research that led to their creation, and whether he thinks throwing even more AI at the problem will be enough to solve it. A transcript of our unusually candid conversation follows, edited for length and clarity.
When people express concern about AI, they either express it or express it as a concern about artificial general intelligence or superintelligence. Do you think that’s the wrong thing to worry about? Should we only worry about AGI or ASI to the extent that it includes agency?
YES. You could have a super-intelligent AI that doesn’t want anything, and it’s not totally dangerous because it has no goals. It’s like a very intelligent encyclopedia.
Researchers have been warning for years about the risks of artificial intelligence systems, special systems with their own goals, and general intelligence. Can you explain what makes the situation increasingly frightening to you now?
In the last six months, we have seen evidence of AISs so misaligned that they would go against our moral instructions. They would plan and do these evil things: lie, cheat, try to persuade us with deception, and, worst of all, try to escape our control and not want to be shut down, and do anything [to avoid shutdown], including blackmail. These are not an immediate danger because they are all controlled experiments… but we don’t know how to really deal with this.
And as these increase bad behavior, the more agency the AI system has?
YES. The systems we had last year, before we got into reasoning models, were much less prone to this. It’s getting worse and worse. That makes sense because we see their planning ability improving exponentially. And [the AIs] need good planning for strategies on things like, “How am I going to convince these people to do what I want?” or “How do I escape their control?” So if we don’t fix these problems quickly, we can end up with, initially, funny crashes, and later, a not-fun crash.
That’s what motivates what we’re trying to do at Lawzero. We’re trying to think about how we design AI more precisely, so that, by design, it won’t even have any incentive or reference to do such things. In fact, it won’t want anything.
Tell me how the AI scientist could be used as a guardrail against the bad actions of an AI agent. I’m imagining the AI scientist as the AI agent’s nanny, double-checking what it’s doing.
So, to make a guardrail work, you don’t need to be the agent yourself. All you need to do is make a good prediction. And the prediction is this: Is this action that my agent would be henpecked to do morally acceptable? Does it satisfy the safety specifications that humans are provided? Or is it going to harm someone? And if the answer is yes, with some probability that’s not very small, then the guardrail says, No, this is a bad action. And the agent has to [try a different] action.
But even if we build scientific AI, the domain of “What is moral or immoral?” is famously contentious. There’s simply no consensus. So how would scientific AI learn what to classify as a bad action?
It’s not for any child or AI to decide what’s right or wrong. We should establish that through democracy. The law should be about trying to be clear about what’s acceptable or not.
Now, of course, there could be ambiguity in the law. So, you can get a corporate lawyer who can find loopholes in the law. But there’s a way around this: the AI scientist is designed to see the ambiguity. He’ll see that there are different interpretations, let’s say, or a particular rule. And then he can be conservative about the interpretation, as in, if any of the plausible interpretations would judge this action as really bad, then the action is rejected.
I think one problem would be that almost any moral choice is likely to have ambiguity. We have some of the most contentious moral issues—think of gun control or abortion—in the United States, where, even democratically, a significant proportion of the population might say they oppose them. How do you propose to deal with that?
No. Except for having more honesty and rationality in the answers, which, in my opinion, would already be a huge gain compared to the kind of democratic discussions that are happening. One of the characteristics of the AI scientist, like a good human scientist, is that you can ask: why are you saying this? And it would come up with, not “Hey,” sorry! — it would come up with a justification.
AI would be involved in the dialogue to try to help us rationalize the pros and cons, etc. So I think these kinds of machines could become tools to aid democratic debates. It’s a little bit more than fact-checking; it’s also like reason-checking.
This idea of developing the voice of the AI scientist stems from his disillusionment with the AI we’ve developed so far. And his research was very fundamental in laying the groundwork for that AI child. On a personal level, do you feel any sense of internal conflict or do you regret having done the research that laid that groundwork?
However, I should have thought of this 10 years ago. In fact, I could have, because I read some of the early work on AI safety. But I think there are very strong psychological defenses that I had, and that most AI researchers have. You want to feel good about your work, and you want to feel like you’re the good guy, not doing something that might cause a lot of harm and death in the future. So we either look the other way.
And for me, I was thinking: This is so far in the future! Before we get to science fiction-sounding things, we’re going to have AI that can help us with medicine, weather, and education, and it’ll be great. So let’s worry about these things when we get there.
But that was before Chatgpt came along. When Chatgpt came along, I couldn’t keep living with this internal lie, because, well, we’re getting very close to the human level.
The reason I ask this is that I was struck by your plan for scientific AI, which you say is modeled after the Platonic idea of a scientist, a selfless, ideal person who’s just trying to understand the world. I thought: Are you trying to build the ideal version of yourself, this “him” you mentioned, the scientific ideal? Is it like the thing you wish you could have on your leg?
You should do psychotherapy in journalism! Yes, you’re pretty close to the mark. In a way, it’s an ideal I have one leg pointing toward myself. I think it’s an ideal scientists should look up to as a model. Because for the most part in science, we must distance ourselves from our emotions to avoid biases, preconceived ideas, and ego.
A couple of years ago, you were one of the signatories of the letter urging AI companies to falsify cutting-edge work. Obviously, the pause didn’t happen. For me, one of the takeaways from that moment was that we’re at a point where this isn’t predominantly a technological issue. It’s a political one. It’s really about power and who gets the power to shape the incentive structure.
We know that incentives in the AI industry are horribly misaligned. There’s massive commercial pressure to build cutting-edge AI. To do that, you need a ton of computing power, so you’ll need billions of dollars, so you’re forced to bed down with Microsoft or Amazon. How do you propose to avoid that fate?
That’s why we’re doing this as a nonprofit. We want to avoid the market pressure that would force a capacity race and instead focus on the scientific aspects of safety.
I think we could do a lot of good without having to train frontier models. If we came up with a methodology for training AI that is convincingly safer, at least in some respects like loss of control, and gave it away almost for free to the folks building AI—well, no one at these companies actually gives a damn. They just don’t have the incentive to do the work! So I think just knowing how to fix the problem would reduce the risks considerably.
I also think governments will hopefully take these questions more and more vigorously. Right now, I know it’s okay, but as we start to see more evidence of the kind we’ve seen in the last six months, stronger and more frightening public opinion could push enough for regulation or regulation. It could happen just for market reasons, like, like, [AI companies] could be thirsty. So at some point, they might reason that they should be willing to pay some money to reduce the risks of accidents.
I was glad to see that Lawyero isn’t just talking about reducing the risks of accidents, but also about “protecting human joy and endeavor.” Many people fear that if AI becomes better at things than they are, well, what’s the meaning of their life? How would you advise people to think about the meaning of their human lives if we enter an era where machines have agency and extreme intelligence?
I understand it would be easy to get discouraged and feel powerless. But the decisions humans will make in the coming years as AI becomes more powerful—these decisions are incredible. So there’s a sense in which it’s hard to have more meaning than that! If you want to do something about it, be part of the thinking, be part of the democratic debate.
I would advise us all to remind Ourelves that we have agency. And we have an incredible task ahead of us: shaping the future.
