🌍 Environments

Overview

GEM supports a diverse range of environments and makes it easy to add your own. GEM provides four main categories of environments, each designed for different types of agent training and evaluation.

All GEM environments follow a consistent interface pattern:

  • env.reset() - Initialize/reset the environment
  • env.step(action) - Take an action and get the next state
  • env.sample_random_action() - Get a random valid action

This design closely follows the Gymnasium standard, making it easy to integrate with existing RL frameworks and tools.

Games

Interactive game environments including Sudoku, Minesweeper, Wordle, and more from the TextArena collection.

We maintain local versions of many of the TextArena games with (i) improved dense game reward design and (ii) compatible gym-style interface.

Available Game Environments

EnvironmentDescription
game:GuessTheNumberThe agent has multiple guesses to guess the hidden number. The agent receives whether the hidden number is higher or lower than its guess.
game:MastermindThe agent has multiple guesses to guess the hidden code. The agent receives black and white pegs depending on the number of correct digits in the right and wrong places.
game:MinesweeperThe agent must reveal all safe grid squares without revealing a mine. For each revealed square the agent receives the number of adjacent squares that contain mines.
game:WordleThe agent must guess the hidden word. After each turn the agent receives feedback ("G"=correct letter + correct position, "Y"=correct letter + incorrect position, "X"=incorrect letter).
game:FifteenPuzzleArrange tiles on the board into ascending order using the empty space to slide tiles into different positions.
game:HangmanThe objective of the game is to guess the word by providing one letter guesses or the entire word.
game:SudokuClassic Sudoku Game. `easy` version renders a 4x4 board.
game:TowerofHanoia classic single-player puzzle game where the objective is to move a stack of disks from one tower to another following specific rules.

Difficulty Variants

Each environment additionally has -easy, -hard, and -random variants, where -random denotes that the environment is set to a random level of difficulty at each reset.

Adding New Games

Adding new games is easy. Simply include .step(), .reset() functions and register the environment with a new name.

Math

Mathematical reasoning environments with automatic answer parsing and checking, compatible with various math datasets.

GEM’s math environment class includes automatic answer parsing and checking and is designed to be compatible with any math dataset. To add a new environment simply register the dataset. A typical use case is combining these with access to the python tool to train the agent to utilize code.

Available Math Environments

EnvironmentDataset
math:ASDIV2kASDIV-2k
math:GSM8kGSM-8k
math:Math12kMATH-12k
math:ORZ57kORZ-57k

Features

  • Automatic Answer Parsing: Built-in parsing for mathematical expressions and numerical answers
  • Answer Checking: Automatic validation of agent responses against ground truth
  • Dataset Compatibility: Works with any math dataset that follows the standard format
  • Tool Integration: Designed to work seamlessly with Python tool for computational assistance

Code

Code generation and evaluation environments that automatically test solutions in sandboxed environments.

GEM’s code environment class automatically evaluates success by running the test cases in a sandbox. This class can be used with any code dataset consisting of the task and test cases.

Available Code Environments

EnvironmentDataset
code:CodeContestCodeContest
code:Taco8kTACO-8k

Features

  • Automatic Code Evaluation: Runs test cases in a secure sandbox environment
  • Test Case Validation: Compares agent-generated code against provided test cases
  • Sandbox Diversity: Two execution options are available.
    • Sandboxed environment using bubblewrap
    • Implementation with Python’s subprocess code.
  • Dataset Diversity: Compatible with any code dataset that includes problems and test cases

Question-Answering

QA environments designed for integrated search tool usage to train agents in information retrieval and reasoning.

GEM’s question-answering environments are designed to allow integrated search tool usage to train the agent to use search functionality. Additional question-answering environments can be added by simply registering the dataset.

Available QA Environments

EnvironmentDataset
qa:NaturalQuestionsNaturalQuestions
qa:HotpotQAHotpotQA
logic:RuleTaker-d0RuleTaker-d0-70k
logic:RuleTaker-d1RuleTaker-d1-70k
logic:RuleTaker-d2RuleTaker-d2-70k
logic:RuleTaker-d3RuleTaker-d3-70k
logic:RuleTaker-d5RuleTaker-d5-70k

Environment Types

  • Natural Questions: Real-world questions that people ask search engines, requiring factual knowledge and reasoning
  • HotpotQA: Multi-hop reasoning questions that require gathering information from multiple sources
  • RuleTaker: Logical reasoning environments with varying complexity levels (d0 through d5), where agents must apply rules to derive conclusions

Reasoning Gym

We include all tasks in Reasoning Gym in our package, which could be simply used by calling make(rg:[sub_task_name]).