🌍 Environments

Overview

GEM supports a diverse range of environments and makes it easy to add your own. GEM provides four main categories of environments, each designed for different types of agent training and evaluation.

All GEM environments follow a consistent interface pattern:

env.reset() - Initialize/reset the environment
env.step(action) - Take an action and get the next state
env.sample_random_action() - Get a random valid action

This design closely follows the Gymnasium standard, making it easy to integrate with existing RL frameworks and tools.

Games

Interactive game environments including Sudoku, Minesweeper, Wordle, and more from the TextArena collection.

We maintain local versions of many of the TextArena games with (i) improved dense game reward design and (ii) compatible gym-style interface.

Available Game Environments

Environment	Description
`game:GuessTheNumber`	The agent has multiple guesses to guess the hidden number. The agent receives whether the hidden number is higher or lower than its guess.
`game:Mastermind`	The agent has multiple guesses to guess the hidden code. The agent receives black and white pegs depending on the number of correct digits in the right and wrong places.
`game:Minesweeper`	The agent must reveal all safe grid squares without revealing a mine. For each revealed square the agent receives the number of adjacent squares that contain mines.
`game:Wordle`	The agent must guess the hidden word. After each turn the agent receives feedback ("G"=correct letter + correct position, "Y"=correct letter + incorrect position, "X"=incorrect letter).
`game:FifteenPuzzle`	Arrange tiles on the board into ascending order using the empty space to slide tiles into different positions.
`game:Hangman`	The objective of the game is to guess the word by providing one letter guesses or the entire word.
`game:Sudoku`	Classic Sudoku Game. `easy` version renders a 4x4 board.
`game:TowerofHanoi`	a classic single-player puzzle game where the objective is to move a stack of disks from one tower to another following specific rules.

Difficulty Variants

Each environment additionally has -easy, -hard, and -random variants, where -random denotes that the environment is set to a random level of difficulty at each reset.

Adding New Games

Adding new games is easy. Simply include .step(), .reset() functions and register the environment with a new name.

Math

Mathematical reasoning environments with automatic answer parsing and checking, compatible with various math datasets.

GEM’s math environment class includes automatic answer parsing and checking and is designed to be compatible with any math dataset. To add a new environment simply register the dataset. A typical use case is combining these with access to the python tool to train the agent to utilize code.

Available Math Environments

Environment	Dataset
`math:ASDIV2k`	ASDIV-2k
`math:GSM8k`	GSM-8k
`math:Math12k`	MATH-12k
`math:ORZ57k`	ORZ-57k

Features

Automatic Answer Parsing: Built-in parsing for mathematical expressions and numerical answers
Answer Checking: Automatic validation of agent responses against ground truth
Dataset Compatibility: Works with any math dataset that follows the standard format
Tool Integration: Designed to work seamlessly with Python tool for computational assistance

Code

Code generation and evaluation environments that automatically test solutions in sandboxed environments.

GEM’s code environment class automatically evaluates success by running the test cases in a sandbox. This class can be used with any code dataset consisting of the task and test cases.

Available Code Environments

Environment	Dataset
`code:CodeContest`	CodeContest
`code:Taco8k`	TACO-8k

Features

Automatic Code Evaluation: Runs test cases in a secure sandbox environment
Test Case Validation: Compares agent-generated code against provided test cases
Sandbox Diversity: Two execution options are available.
- Sandboxed environment using bubblewrap
- Implementation with Python’s subprocess code.
Dataset Diversity: Compatible with any code dataset that includes problems and test cases

Question-Answering

QA environments designed for integrated search tool usage to train agents in information retrieval and reasoning.

GEM’s question-answering environments are designed to allow integrated search tool usage to train the agent to use search functionality. Additional question-answering environments can be added by simply registering the dataset.

Available QA Environments

Environment	Dataset
`qa:NaturalQuestions`	NaturalQuestions
`qa:HotpotQA`	HotpotQA
`logic:RuleTaker-d0`	RuleTaker-d0-70k
`logic:RuleTaker-d1`	RuleTaker-d1-70k
`logic:RuleTaker-d2`	RuleTaker-d2-70k
`logic:RuleTaker-d3`	RuleTaker-d3-70k
`logic:RuleTaker-d5`	RuleTaker-d5-70k

Environment Types

Natural Questions: Real-world questions that people ask search engines, requiring factual knowledge and reasoning
HotpotQA: Multi-hop reasoning questions that require gathering information from multiple sources
RuleTaker: Logical reasoning environments with varying complexity levels (d0 through d5), where agents must apply rules to derive conclusions

Reasoning Gym

We include all tasks in Reasoning Gym in our package, which could be simply used by calling make(rg:[sub_task_name]).