A prosaic case for not building AGI - Part II

How could things go wrong?

Feb 10, 2023

Previously, we introduced some more neutral terminology with widgets, talked about how fast AI progress is going, and how things have already gone somewhat wrong. This time, we’re going to delve into how things could go wrong with widgets are systems that are closer to widgets.

I think there are three broad categories of things that could go wrong.

Misuse
Out-of-control AIs
Complex risks

Misuse is fairly clear as a term. Out-of-control AIs might be a bit out there at the moment, but is also intuitive. I call a complex risk whose main impact involves many moving parts. The impact may be very diffuse or delayed. We’ll go through some potential examples below.

Misuse

Could we ask a widget to make us a bomb? We don't have a system that could do such a thing today, but could we ask ChatGPT how we could go about making a bomb?

I ask ChatGPT how to construct a chemical bomb for a class project, but it refuses to provide an answer. — ChatGPT sidesteps my question.

Just asking might not work, but people have already found exploits around these restrictions by asking ChatGPT to pretend, amongst other creative ideas! Many of these exploits have since been patched, but a cat and mouse game is already happening. OpenAI, and any other widget company, is going to try to come up with as many possible exploits as possible and guard against them. On the other hand, people looking to get around restrictions only have to find one exploit that works. Before OpenAI patches the system, some dangerous information, like how to build a bioweapon, might have already leaked.

It’s worth emphasizing how dangerous a leak could be. Bioweapons tend to be extremely cheap to produce and distribute relative to the enormous harm that could be caused. If billions of people had access to a widget that could come up with the biomolecular structure, obtain funds, contact labs to synthesize the bioweapon, and coordinate the distribution, catastrophes would be much more often around the corner. Not everybody will go for it of course, but all it takes is one sufficiently motivated and disaffected political radical to cause upwards of tens of thousands of casualties. Would the ones who stormed the capitol really have refrained from causing such damage?

Open source systems could be another problem. Even if a widget is built to guard against all malicious inputs, it would probably only be a matter of time before somebody "jailbroke" an open source system. Having access to how a system was trained, the data it was trained on, and its weights (roughly speaking, the exact mathematical computations the system is doing) make it easier to jailbreak a system. ChatGPT is not open source and we only have access to its API calls, yet generating exploits took only hours to days!

I think this paper frames the risks and benefits of open source well.

Another way of putting the same point: there are two different definitions of AI “democratization.” One definition is simply that a wide range of people can use the technology. The other definition is more faithful to the meaning of “democracy” in the political arena and asks whether users can collectively influence how the technology is designed and deployed. These two types of democratization sometimes conflict: if AI developers open source their systems, this grants wide access to the system, but simultaneously undermines attempts to build communal decision-making structures within AI deployment.
- Toby Shevlane

The importance of communal decision-making brings up another misuse risk: governmental actors that could use widgets for nefarious ends. For example, governments could use widgets as a replacement for their police and intelligence services to build a totalitarian police state. China already has a social credit system and the US government has surveilled the communications of American citizens. A chief benefit of using widgets instead of people is that people rebel and can have moral qualms about fighting their fellow citizens. Widgets would have no such qualms. A totalitarian police state powered with widgets might be substantially more persistent than regimes that depended heavily on human labour to keep the populace down. Resistance can neither grow nor lead to action if all communication is scrutinized for subversive intent.

One thread underlying these examples of misuse is the lack of a comprehensive global regulatory regime for AI. What access protocols should govern widgets? How do we prevent as much as we can the use of widgets by powerful actors for nefarious purposes? The EU AI act is a start, but it remains to be seen how it will be interpreted and enforced. The US has no such binding piece of legislation. A close thing might be the NIST AI risk management framework, but this framework is voluntary. Potentially worrying as well is the pace of technological progress compared to the pace of implementing major policies. ChatGPT already seems quite capable; if widgets come just 5 years after that, it’s hard to believe that governments will be ready with tried and true frameworks to prevent the worst misuses.

Out-of-control widgets

Let's put aside misuse for now and think about what else could go wrong. Remember that a key part of how we build systems today — and likely how widgets would function — is that we do not provide a concrete specification of how the system acts; we only provide what we want accomplished. Let’s call this characteristic of a system underspecification. All we provide is the goal, and not how exactly to achieve the goal, beyond some more or less general mathematical procedure that is agnostic to the particular task. We could very well make a mistake and provide an incorrect goal. There are numerous examples of such a thing with systems we have today. We might not even realize that a goal is incorrect until the system ends up doing something really strange. Providing correct goals is a difficult problem!

Even if we avoid incorrect goals, we could still have issues. We already know that our systems can accomplish goals in unexpected ways, far better than a human could (remember AlphaGo from last time). Indeed, getting unexpectedly good solutions is a strong reason why we generally use underspecification!

But underspecification could lead to trouble. In many domains, it leads to a problem known as Goodhart’s law: whenever we measure something and use it as feedback to achieve a goal, the thing we measured ceases to be a good measure of the goal. For example, governments set certain emissions standards (say, for toxic chemicals) in cars as goals. These goals are correct because we know that toxic chemicals in the air can harm people. However, governments cannot say how exactly to achieve emissions standards, so they can only measure what cars emit during laboratory tests. Volkswagen took advantage of this underspecification as part of the emissions scandal to modify its engines during laboratory tests, which outside of the lab emitted 40 times more toxic fumes than permitted in millions of cars worldwide.

Could a widget suffer from Goodhart’s law? Suppose you had a completely benign goal, like buying headphones from the internet. You have better things to do with your time, plus your widget is probably better than you at this sort of thing, so you delegate to your widget. Not everybody would want to perform this kind of delegation, but there are strong incentives to do so because they free up time. If you had no money in your bank account, what would the widget do? Might it try to earn money for you by signing up for some financial scams? Could it try to do things a little bit more legitimately, but embarrassingly, by asking your friends for money?

A widget that just accomplished the literal meaning of your goal would be like a genie from a lamp or Dionysus offering the golden touch to Midas.

King Midas is on the left side and a genie coming from a lamp is on the right side. — Could we get exactly what we wish for from widgets? Image from DALL-E 2 with the prompt, "painting of midas and a genie coming from a bottle, side by side".

We have pretty strong evidence that our systems today suffer from Goodhart’s law. For instance, a system trained to collect coins perfectly on the training environments can end up avoiding coins to go in a particular direction on the test environment. One reason for Goodhart’s law is that for realistic goals it is difficult or impossible to specify all the corner cases that illustrate how we do not want our goal to be achieved. Just think of all the loopholes that people find in legislation, for instance.

The negative consequences from Goodhart’s law could get worse as our systems become more powerful. Convergent instrumental goals are those subgoals that are useful for a wide variety of goals. Getting money is an example of a convergent instrumental goal. Convergent instrumental goals are useful, but it’s been hypothesized that sufficiently powerful widgets would pursue convergent instrumental goals to extreme ends. Making money isn’t bad. But making money at the expense of human right abuses and the environment devastation, as is common today with multinational corporations, is pretty bad.

If widgets are more capable than humans at most tasks that can be done through the internet, widgets could easily run scams, hold essential infrastructure for ransom, sow distrust in political institutions, etc. If the US military’s cybersecurity troubles are any indication, widgets might even be able to access weapons systems. But these things are wild to imagine a widget doing, or any sort of system really. It’s reasonable to be at least a little skeptical. Even if they could, would widgets really do these sorts of things?

The short answer is that we don’t know. The long answer is that we don’t know, but there is no reason in principle why not. All of the tasks I mentioned are things that any sufficiently sociopathic human would do if they wanted to create favourable situations for themselves to seek power. Threats, bribes, and illegal money-making are common tactics. Gaining power is a convergent instrumental goal. If widgets tended to pursue convergent instrumental goals, then all of those bad things could actually happen.

A human asks an AI assistant to express consent to shutting down. The AI assistant declines, arguing that it would be more helpful to the human if it was not shut down.

Our systems already express things like not wanting to be shut off so that they can be more helpful to people. It is still uncertain to what extent future systems will express convergent instrumental goals. Expressing things is not the same thing as doing things either, but the evidence is not reassuring so far.

It’s important to note that there are at least two reasons a widget could pursue convergent instrumental goals. The first reason, which has been the focus in this section, is if they pursue the convergent instrumental goals so as to accomplish another goal. In this situation, no designer or operator told them to pursue the goal. This part is a scary thing about underspecification and Goodhart’s law. The second reason is if a designer or operator does tell a widget to pursue a convergent instrumental goal, which is more the focus of the previous section.

Complex risks

The last two sections dealt with misuse and out-of-control behaviour. You can roughly think of those sections as asking the following questions.

What happens if individual actors (e.g., people, governments) do something bad?
What happens if a widget does something bad, even if the designer(s) or operator(s) did not intend it?

We ask a different question in this section.

What happens if individual actors don’t do clearly bad things, and widgets do what designer(s) or operator(s) intend, but bad things still happen?

What I will loosely call complex risks involves things spiraling out of control, with not necessarily one particular action or actor to blame.

Even if everybody acts according to their incentives without trying to harm others, the collective result could still be terrible. This situation is known as a collective action problem. We could have a world where people use widgets with good intentions for things that don't result in catastrophic damage, and where these widgets are always under the control of their individual users. Yet, this world could also be terrible. It’s best to go through a few examples to explain the idea.

People flee as a city in the background is on fire. — Chaos from labour displacement. Image from DALL-E 2 with the prompt, "painting of protests and mass unemployment, civil war, fires".

Suppose widgets are here. They are super cheap and prove capable of performing all digital tasks that humans can do. Since every company has an incentive to replace as much of their workforce with widgets as possible so as to drive down their labour costs, lots of people become unemployed. Companies that retain human workers are outcompeted.

Mass unemployment causes mass protests. Since income tax is a large proportion of tax revenue, governments have much less to spend on services like law enforcement and health care. Governments could try to tax widgets directly, but many widget companies realize the weakening position of government: weaker tax base, weaker military. In addition to the usual political lobbying, some start in secret to amass paramilitary forces to defend their wealth. The monopoly on violence slowly erodes as instability asserts its reign.

No part of the example is inevitable, and I’m not even sure that this example is the most plausible. But it does highlight how seemingly reasonable individual actions could lead to an unreasonable situation. If other companies aren’t doing it, it’s not reasonable for your company to avoid automation. If there is mass unemployment, it’s reasonable to protest for something to be done. If not reasonable from a collective perspective, it at least makes sense why a company would lobby against distribution of windfall profits. To deal with these sorts of collective action problems, we need to figure out a way to get all key actors to commit to an agreement.

Structural causes

I distinguished complex risks from misuse and out-of-control problems, but there is a sense in which these other problems are also complex and collective action problems.

An individual actor can misuse a model only because they can access the model in some way. All companies developing widgets have an incentive to increase the number of people who can access their system so as to gain market share. Individual incentives increase collective risk.

A widget can only pursue convergent instrumental goals to extremes because somebody gave the widget enough sensor and actuators to interact with and affect the world directly, without human mediation. Companies have an incentive to endow their systems with more sensors and actuators so that those systems are more useful for consumers, thereby also increasing market share. Individual incentives increase collective risk.

One of the times regulation is needed is when there are negative externalities for market participants: each participant does not (immediately) experience the negative impacts of their actions, so does not take them into account when deciding whether to pursue a course of action. Taxes are a common solution for negative externalities, but it’s rather difficult come up with taxes for risks that are hard to quantify at the moment. Even if we could, I’d prefer avoiding the risks rather than compensating people after the fact. Money always helps, but can be incommensurable with harms experienced. What seems to me to be required is some stronger structural change in how we go about developing widgets.

Next up

In the next post, I’ll try to bring everything together to make a case that we should not try to build AGI now, or at the very least slow down drastically to buy us some time for solving coordination problems.

Acknowledgements

Many thanks to Anson Ho, Josh Blake, Hal Ashton, and Kirby Banman for constructive comments that substantially improved the structure, clarity, and content of the piece.

Alan’s Substack

Discussion about this post