Shielded RL

Shielding is a popular technique for safe reinforcement learning. A shield monitors the environment and prevents the execution of actions that would violate a formally defined safety constraint. Shields are automatically synthesized from a discrete, symbolic model of the environment.

In order to synthesize a shield and integrate it into state-of-the-art RL frameworks, the following ingredients are needed:

Symbolic Models

A symbolic model of the environment dynamics. This model can either be a Markov decision process (MDP) or a stochastic multiplayer game (SMG). Tempest needs these models to be expressed in the PRISM modelling language.

Safety Specifications and Thresholds

A formal specification in probabilistic computation tree logic (PCTL) or alternating-time temporal logic (ATL) is used to define unsafe behaviour. A specification for an MDP can, for example, describe that the probability of never reaching an unsafe state should be maximized: “Pmax=? [ G !unsafe ]”. An action will then be classified as safe, if the probability of satisfying the specification, “G !unsafe”, is larger or equal the specified safety threshold. For SMGs, the specification needs to be formulated in ATL. Using ATL, the set of players that aim to satisfy the specification needs to be defined. All other players modelled in the SMG will try to prevent the specification from being satisfied.

Translations from Observations to Symbolic States

The model used to compute the shield can be strictly smaller than the original environment used to train the agent. While the RL training will harnish the full feature set of the environment observations, the set of safe actions needs to be queried using the symbolic states in the model. This is done by providing tempestpy with a translation method from observations to symbolic states. Typically, an MDP or SMG used for shielding models the safety-critical aspects needed to

Maskable Algorithms

In order to easily integrate the shield into the training, we recommend the use of a maskable algorithm. Libraries like Stable-Baselines3 and RLlib provide implementations of maskable PPO and DQN.

Getting Started

tempestpy, in combination with Minigrid^Safe and Minigrid2PRISM, provide a good starting point for getting familiar with shielded reinforcement learning.

If you want to try out shielded reinforcement learning, we strongly recommend you to start with Tempest in action. This repository provides a way to install the project framework in a virtualized environment, using docker. You can then interact with the source code using jupyter notebooks.