Why aligning AI to our values may be harder than we think

Can we stop a rogue AI by teaching it ethics? That might be easier said than done.

glowing computer servers

Eerie looking supercomputer.

Credit: STR/JIJI PRESS/AFP via Getty Images
  • One way we might prevent AI from going rogue is by teaching our machines ethics so they don't cause problems.
  • The questions of what we should, or even can, teach computers remains unknown.
  • How we pick the values artificial intelligence follows might be the most important thing.


Plenty of scientists, philosophers, and science fiction writers have wondered how to keep a potential super-human AI from destroying us all. While the obvious answer of "unplug it if it tries to kill you" has many supporters (and it worked on the HAL 9000), it isn't too difficult to imagine that a sufficiently advanced machine would be able to prevent you from doing that. Alternatively, a very powerful AI might be able to make decisions too rapidly for humans to review for ethical correctness or to correct for the damage they cause.

The issue of keeping a potentially super-human AI from going rogue and hurting people is called the "control problem," and there are many potential solutions to it. One of the more frequently discussed is "alignment" and involves syncing AI to human values, goals, and ethical standards. The idea is that an artificial intelligence designed with the proper moral system wouldn't act in a way that is detrimental to human beings in the first place.

However, with this solution, the devil is in the details. What kind of ethics should we teach the machine, what kind of ethics can we make a machine follow, and who gets to answer those questions?


Iason Gabriel considers these questions in his new essay, "Artificial Intelligence, Values, and Alignment." He addresses those problems while pointing out that answering them definitively is more complicated than it seems.

 What effect does how we build the machine have on what ethics the machine can follow?


Humans are really good at explaining ethical problems and discussing potential solutions. Some of us are very good at teaching entire systems of ethics to other people. However, we tend to do this using language rather than code. We also teach people with learning capabilities similar to us rather than to a machine with different abilities. Shifting from people to machines may introduce some limitations.

Many different methods of machine learning could be applied to ethical theory. The trouble is, they may prove to be very capable of absorbing one moral stance and utterly incapable of handling another.

Reinforcement learning (RL) is a way to teach a machine to do something by having it maximize a reward signal. Through trial and error, the machine is eventually able to learn how to get as much reward as possible efficiently. With its built-in tendency to maximize what is defined as good, this system clearly lends itself to utilitarianism, with its goal of maximizing the total happiness, and other consequentialist ethical systems. How to use it to effectively teach a different ethical system remains unknown.

Alternatively, apprenticeship or imitation learning allows a programmer to give a computer a long list of data or an exemplar to observe and allow the machine to infer values and preferences from it. Thinkers concerned with the alignment problem often argue that this could teach a machine our preferences and values through action rather than idealized language. It would just require us to show the machine a moral exemplar and tell it to copy what they do. The idea has more than a few similarities to virtue ethics.

The problem of who is a moral exemplar for other people remains unsolved, and who, if anybody, we should have computers try to emulate is equally up for debate.

At the same time, there are some moral theories that we don't know how to teach to machines. Deontological theories, known for creating universal rules to stick to all the time, typically rely on a moral agent to apply reason to the situation they find themselves in along particular lines. No machine in existence is currently able to do that. Even the more limited idea of rights, and the concept that they should not be violated no matter what any optimization tendency says, might prove challenging to code into a machine, given how specific and clearly defined you'd have to make these rights.

After discussing these problems, Gabriel notes that:

"In the light of these considerations, it seems possible that the methods we use to build artificial agents may influence the kind of values or principles we are able encode."

This is a very real problem. After all, if you have a super AI, wouldn't you want to teach it ethics with the learning technique best suited for how you built it? What do you do if that technique can't teach it anything besides utilitarianism very well but you've decided virtue ethics is the right way to go?

If philosophers can't agree on how people should act, how are we going to figure out how a hyper-intelligent computer should function?

The important thing might not be to program a machine with the one true ethical theory, but rather to make sure that it is aligned with values and behaviors that everybody can agree to. Gabriel puts forth several ideas on how to decide what values AI should follow.

A set of values could be found through consensus, he argues. There is a fair amount of overlap in human rights theory among a cross-section of African, Western, Islamic, and Chinese philosophy. A scheme of values, with notions like "all humans have the right to not be harmed, no matter how much economic gain might result from harming them," could be devised and endorsed by large numbers of people from all cultures.

Alternatively, philosophers might use the "Veil of Ignorance," a thought experiment where people are asked to find principles of justice that they would support if they didn't know what their self-interests and societal status would be in a world that followed those principles, to find values for an AI to follow. The values they select would, presumably, be ones that would protect everyone from any mischief the AI could cause and would assure its benefits would reach everyone.

Lastly, we could vote on the values. Instead of figuring out what people would endorse under certain circumstances or based on the philosophies they already subscribe to, people could just vote on a set of values they want any super AI to be bound to.

All of these ideas are also burdened by the present lack of a super AI. There isn't a consensus opinion on AI ethics yet, and the current debate hasn't been as cosmopolitan as it would need to be. The thinkers behind the Veil of Ignorance would need to know the features of the AI they are planning for when coming up with a scheme of values, as they would be unlikely to choose a value set that an AI wasn't designed to process effectively. A democratic system faces tremendous difficulties in assuring a just and legitimate "election" for values that everybody can agree on was done correctly.

Despite these limitations, we will need an answer to this question sooner rather than later; coming up with what values we should tie an AI to is something you want to do before you have a supercomputer that could cause tremendous harm if it doesn't have some variation of a moral compass to guide it.

While artificial intelligence powerful enough to operate outside of human control is still a long way off, the problem of how to keep them in line when they do arrive is still an important one. Aligning such machines with human values and interests through ethics is one possible way of doing so, but the problem of what those values should be, how to teach them to a machine, and who gets to decide the answers to those problems remains unsolved.

A brief history of human dignity

What is human dignity? Here's a primer, told through 200 years of great essays, lectures, and novels.

Credit: Benjavisa Ruangvaree / AdobeStock
Sponsored by the Institute for Humane Studies
  • Human dignity means that each of our lives have an unimpeachable value simply because we are human, and therefore we are deserving of a baseline level of respect.
  • That baseline requires more than the absence of violence, discrimination, and authoritarianism. It means giving individuals the freedom to pursue their own happiness and purpose.
  • We look at incredible writings from the last 200 years that illustrate the push for human dignity in regards to slavery, equality, communism, free speech and education.
Keep reading Show less

Mathematical model shows how the Nazis could have won WWII's Battle of Britain

With just a few strategical tweaks, the Nazis could have won one of World War II's most decisive battles.

Photo: Heinrich Hoffmann/ullstein bild via Getty Images
Politics & Current Affairs
  • The Battle of Britain is widely recognized as one of the most significant battles that occurred during World War II. It marked the first major victory of the Allied forces and shifted the tide of the war.
  • Historians, however, have long debated the deciding factor in the British victory and German defeat.
  • A new mathematical model took into account numerous alternative tactics that the German's could have made and found that just two tweaks stood between them and victory over Britain.
Keep reading Show less

We’ve mapped a million previously undiscovered galaxies beyond the Milky Way. Take the virtual tour here.

See the most detailed survey of the southern sky ever carried out using radio waves.

Photo by Štefan Štefančík on Unsplash
Surprising Science

Astronomers have mapped about a million previously undiscovered galaxies beyond the Milky Way, in the most detailed survey of the southern sky ever carried out using radio waves.

Keep reading Show less

New data reveals Earth closer to a black hole and 16,000 mph faster

A new study shows our planet is much closer to the supermassive black hole at the galaxy's center than previously estimated.

Position and velocity map of the Milky Way Galaxy.

Credit: NAOJ
Surprising Science
  • A Japanese radio astronomy project revealed Earth is 2,000 light years closer to the supermassive black hole at the Milky Way's center.
  • The data also showed the planet is moving 7 km/s or 16,000 mph faster in orbit around the Galactic Center.
  • The findings don't mean Earth is in more danger from the black hole but reflect better modeling of the galaxy.
  • Keep reading Show less
    Technology & Innovation

    How has technology changed — and changed us — in the past 20 years?

    Apple sold its first iPod in 2001, and six years later it introduced the iPhone, which ushered in a new era of personal technology.

    Scroll down to load more…
    Quantcast