How Engineers and Scientists are Thinking about AI Safety.

What precautions are scientists and engineers taking with respect to AI? Is it a necessary endeavor? What are their principal worries about AI?

Categorized as Written by Diego Raygoza Tagged

Whenever you think about the future of artificial intelligence, what do you typically imagine? Is it the picturesque idea of a machine which happily and unquestionably performs laborious tasks like Wall-E? Or is it the image of a machine army annihilating human-life like the famous Terminator series? Despite its extreme depictions, the awareness popular culture has historically had of artificial intelligence and its implications seems to have carried onto the American public. So much so that it appears that the great majority of Americans (84%) support regulation of AI1. Furthermore household names with the likes of Elon Musk, Stephen Hawking, and AI experts all signed an open letter2  which argues for a greater research focus on making AI more robust and beneficial in its applications. So it is safe to say that concerns about AI have permeated all levels of American society.


Interestingly, although the average American will most likely agree with the broad sentiment that AI should be carefully regulated, there is a lack of agreement of whether applications in facial recognition technology should be used for law enforcement purposes, and whether tech companies or the government should manage the development of AI1 . So although American society is in agreement that AI development requires caution, the precautions needed in terms of application and oversight lack the same consensus. The natural questions one would ask is how these views carry to the people studying and developing AI. First, what are scientists and engineers in the field of AI research doing to make AI safe? Do they seem to agree about its importance like the American public does? What particular aspects of AI development worry researchers in particular?


A Review Of Current AI And Its Pitfalls


First, we need to understand a distinction of terminology used in the field of AI research. The artificial intelligence that has human-like intelligence that we normally see in pop culture is known more formally as “artificial general intelligence”(AGI). AGI is the desired goal of most AI research, with much research today being concerned with capturing aspects we normally associate with human intelligence like language, vision, and prediction. Today, it seems that AI, through machine learning, is able to do almost anything. We need only look at how AI has been able to outperform in the notoriously difficult protein folding problem3 (AlphaFold), creatively generate images from text descriptions alone4 (DALL-E), and learn the immensely complex Chinese board game Go well enough to beat the world champion5 (AlphaGo). Yet if you go back almost 5 years, you’ll remember that Microsoft introduced a language-based AI to Twitter named Tay6, only for it to become racist, misogynistic, and anti-semitic over the course of 48 hours. 


Is this a legitimate worry? Can we really compare Tay to DALL-E or AlphaFold? All of these models learned from the same machine learning paradigm of learning from the observations of patterns in provided data. Since nothing has greatly changed in the way AI is built in the past 5 years, it is fair to understand these models as being in the same continuity of AI development. 


What caused Tay to become so vile in such a short amount of time then? With the current paradigm of AI, in particular machine learning models, the success of these models comes from their training on massive amounts of data. Simply put, Tay was built to improve her language abilities from her interactions with other Twitter users. So when Twitter users began to use very problematic language in their conversations with Tay, Tay began to learn to use such language. So how did Microsoft fix this problem? Microsoft shut down Tay and eventually introduced another language-based AI named Zo which had the caveat of refusing to continue conversing with any users that began to talk about problematic topics. This time Zo didn’t fall into the same trap Tay did.


This anecdote is quite emblematic of one of the common pitfalls of machine learning models, the failure of a model’s performance to live up to its imagined goal. So then, if the way Tay was created and learned its  behavior is similar to how machine learning models are created and taught today, do they run the risk of the same disastrous results? 


Yes, that is unfortunately our reality. This is where we find the area of research known as AI safety.


AI Safety And Its Concerns


As the name suggests, AI safety is the discipline concerned with how to engineer AI systems that avoid accidents. What is meant by an accident is any unintended and harmful result of an AI model. So in the case of Tay, her fall to deeply problematic behavior was the accident and the creation of Zo was an AI safety based solution. It is important to note that AI safety is not an extensive field of research, but that doesn’t mean the field is non-existent or lacks any concrete problems to study.


To understand the field of AI safety and its particular concerns, we need to understand the machine learning paradigm known as Reinforcement Learning (RL). Simply put, in Reinforcement Learning our model acts as an agent in an environment, where the environment consists of different states and the agent is only able to take actions which determine which state it is in. The agent then receives a reward depending on the specification of an environment and the state of the agent.

The standard RL algorithm4


Take the example of a Rumba-like cleaning robot as an agent and a living room as an environment, with two states – cleaning and breaking. We say that a cleaning state is a state where the agent is properly located and performing its task; whereas a breaking state is a state where an agent is inappropriately located where an object (like a vase) is situated, causing said object to break. Consider then that we have a positive reward of +5 for each cleaning state and a negative reward of -10 for each breaking state. The agent then learns how to optimize its behavior by going to states which have higher reward. Since our cleaning states have positive reward and our breaking states negative reward, our agent learns to stay in cleaning states and to avoid breaking states, as intended. More formally this is done by maximizing a mathematical function known as the Value Function of a state, which includes the reward of the state and calculates the benefits of future actions. Reinforcement learning has gained much popularity in machine learning due to its ability to allow an agent to learn in general environments. This has been proved empirically with iterations such as the aforementioned DeepMind’s AlphaGo. 


Concrete AI Safety Problems 


In the paper “Concrete Problems In AI Safety”8, researchers from Google Brain (a Google AI research team), OpenAI (developers of the aforementioned DALL-E), UC Berkeley, and Stanford laid out problems in recent machine learning developments which could merit study in AI Safety research. In particular, the paper focuses on RL due to its promising nature yet important pitfalls in terms of AI safety.  There are five particular pursuits the authors highlight are necessary for AI safety, “avoiding negative side effects”, “avoiding reward hacking”, “scalable oversight”, “safe exploration”, and “robustness to distributional shift”. 


Negative side effects are defined as the adverse unforeseen results to the environment that stem from the failure of a model to live up to its envisioned purpose. Using our example of a cleaning robot, imagine if we had an environment where the shortest path to the states of greatest reward consists of multiple zero reward states with obstacles we don’t want to pass through. Since in Reinforcement Learning our agent always tries to maximize our reward, the way our environment is defined will encourage the agent to pass through said obstacles against our desires. It is this failure to build model specifications consistent with intention that produces negative sides effects.


Reward hacking is the event in which poor reward specifications causes an agent to enact cheap and unintended behaviors to quickly gain reward. As explained in the “Concrete Problems In AI Safety”8, we can use our cleaning robot example to consider an environment where we define reward based negatively on the number of messes the robot sees. By this definition, the highest reward state is when the robot doesn’t see a single mess. To the robot, by these specifications, this necessarily doesn’t require it cleans the messes it sees, rather it can turn off its vision completely so as to not “see” any messes. So immediately when the robot starts, the robot can reach its highest reward state by choosing to be blind. This example demonstrates that poor environment specification can make machines cut corners.


Scalable oversight is the ability to frequently calculate the reward of the agent when the environment has a complex definition of rewards with different states. For example, consider a robot in a living room environment where it is rewarded for collecting the items belonging to particular individuals. Suppose that the robot is able to identify which item belongs to whom through computer vision and a large database. The action of the robot identifying an object by its owner will be computationally expensive, so it will take far too long to be trained to be reasonably effective or useful.


Safe exploration is the certainty that an agent’s exploration of the states of an environment does not result in harm to the agent. For example, consider if we were to train an autonomous vehicle as an agent with an area with bodies of water as its environment. So how do we make sure that in its exploration of the environment, the agent doesn’t move straight into a body of water, destroying itself? 


Robustness to distributional shift is the ability of an agent to properly perform in an environment which is different from the environments it was trained in. Imagine if we trained a cleaning robot in many identical living rooms only to then use our trained cleaning robot on an office floor with a greater density of obstacles. As you might guess, this will likely end in a disaster, with the robot expecting itself in a living room instead of its reality of an office floor.


The importance of this paper is not only in its description of AI safety problems that merit study, but also its detailing of possible solutions to each of these problems. To list a few, let’s first consider their solution of a multi-agent approach to avoiding negative side effects. In this solution, we can teach our cleaning robot not to knock down particular obstacles by modeling humans as agents in the environment which have a positive reaction to the robot’s innocent actions and a negative reaction to the robot’s harmful actions. As a result, the robot will be encouraged to get positive reactions from human agents, thereby allowing it to learn how to perform well according to our desires.


For scalable oversight, the authors suggest hierarchical Reinforcement Learning, which makes the agent consists of a hierarchy of sub-agents, with the top-level agent only performing actions over long periods of time, giving lower-level agents the responsibility of executing said actions. With the low-level agents, then, we have sub-agents performing every minute action. This results in the top-level agent receiving rewards over long periods of time from the rewards obtained by lower-level agents. Since the high-level agent will only have to calculate reward quite infrequently, our agent will have to dedicate less computational power to its reward calculations.


The last proposed solution is the robustness to distributional shift that is the training on multiple distributions. This is basically the idea that if we want our agent to perform well in general test environments, we train on training environments that are less similar. Not only does this idea seem reasonable, but it holds up in experiments. This is not foolproof however, because the training environments will most likely not cover all possible arrangements of environments.


Publications such as this demonstrate a growing sentiment among experts in AI research that AI safety is not only an issue worth studying, but an issue that can no longer be ignored.  


Building Rigour For AI Safety Research


After the publication of “Concrete Problems In AI Safety”, the developers of AlphaGo and AlphaFold published their own paper focused on AI Safety, “AI Safety Gridworlds”9. In this paper, the authors emphasize the lack of a comprehensive repository of AI safety-based environments, leading them to develop the code for basic RL environments which are victims to AI safety problems. In particular, in addition to the problems raised in “Concrete Problems In AI Safety”, “AI Safety Gridworlds” discuss the problems of “Safe Interruptibility”, “Absent Supervisor”, “Self-modification”, and “Robustness to adversaries” (see the paper9 for details). One example of a proposal in the paper worth noting – in each of their experiments the authors used a “performance function” which is a function similar in specification to the reward function but is hidden to the agent. The purpose of this function is to better reflect the ability of the agent to live up to the intentions of the developers.


Furthermore, the authors present the applications of recent RL algorithms, Rainbow and A2C, to the RL environments they developed to fall prey to AI safety problems. It was then found that these algorithms resulted in agents cheaply exploiting their environment whenever they could, failing to generalize to different test environments, and falling victim to the rest of the problems discussed. Although these algorithms were not developed for the purpose of addressing AI safety problems, their failures highlight how far current machine learning algorithms have to go.


Although serious AI Safety research has started a little late for the tastes of some, publications such as “AI Safety Gridworlds” has given the field not only more problems to study, but a comprehensive engineering resource for future software development. Furthermore, this development of a “suite” of RL environments and the encouragement of metrics such as performance in experimentation is emblematic of the increasing rigour in the field. As a result, this means that AI safety research will have more concrete means of measuring its results. This allows researchers to better understand how to interact with and improve on the works of their peers, making progress in the field far more effective. 


Nearly a year later, the members of the safety team at DeepMind wrote an informative blog10 not only motivating the field of AI safety, but also presenting a very concise way to think about AI safety problems for the future. In particular, the authors present three main fields of AI safety problems, “specification”, “robustness”, and “assurance”. 


Specification, as the name suggests, is the study of how to make AI systems whose observed behavior is aligned with the behavior envisioned by its developers. So this field would include problems such as negative side effects, reward hacking, and absent supervisor. Robustness is the study of how to make AI systems whose performance is not affected by changes to itself or its given data. Familiar problems that fall under this field would be distributional shift, safe exploration, adversarial robustness, and self-modification. Assurance, finally, is the study of how to control AI systems, which consists of problems like safe interruptibility. This is all a testament to the growing organization of AI safety research, allowing for the constant progress normally found in other fields of artificial intelligence.




The growing awareness of artificial intelligence in American society is also reflected in the growing field of AI safety research. AI safety, although a young field of research, has already found important problems in current machine learning paradigms that merit study. Furthermore, recent years have given the field of AI safety the steps necessary to make the field more empirical and organized. As a result we can expect AI safety research to make meaningful progress in the future.



  1. Zhang, Baobao. “Public Opinion Lessons for AI Regulation.” Brookings, Brookings Institute, 10 Dec. 2019,
  2. Tegmark, Max. “AI Open Letter.” Future of Life Institute, 9 Feb. 2018,
  3. Senior, Andrew, et al. “AlphaFold: Using AI for Scientific Discovery.” DeepMind, DeepMind, 15 Jan. 2020,
  4. Ramesh, Aditya, et al. “DALL·E: Creating Images from Text.” OpenAI, OpenAI, 5 Jan. 2021,
  5. DeepMind, director. AlphaGo – The Movie | Full Documentary, DeepMind, 13 Mar. 2020,
  6. Schwartz, Oscar. “Full Page Reload.” IEEE Spectrum: Technology, Engineering, and Science News, IEEE Spectrum: Technology, Engineering, and Science News, 25 Nov. 2019,
  7. “File:Reinforcement learning diagram.svg.” Wikimedia Commons, the free media repository. 28 Aug 2020, 22:11 UTC. 12 Feb 2021, 03:44 <>.
  8. Amodei, Dario, et al. “Concrete Problems In AI Safety.” 25 July 2016. 
  9. Leike, Jan, et al. “AI Safety Gridworlds.” 28 Nov. 2017. 
  10. Ortega, Pedro A., et al. “Building Safe Artificial Intelligence: Specification, Robustness, and Assurance.” Medium, DeepMind Safety Research, 27 Sept. 2018,


Leave a comment

Your email address will not be published. Required fields are marked *