🧱 Advanced
Overview
Advanced GEM features, custom environments, and training.
Custom Environments
GEM makes it simple to create custom environments. To create a new environment, simply add .reset()
and .step()
methods, and then register the environment here. See examples for more information.
gem.core.Env.reset()
Returns:
obs
(str) - Initial question/observation from the environment.info
(dict) - Any extra info e.g. for logging or to aid debugging.
gem.core.Env.step(action)
Returns:
obs
(str) - Next observation/output from the environment.reward
(float) - Environment reward.terminated
(bool) - Whether the episode is terminated.truncated
(bool) - Following Gym environments but currently unused.info
(dict) - Any extra info.
Creating a Custom Environment
- Inherit from
gem.core.Env
: Your environment should extend the base environment class - Implement Required Methods: Add your custom
.reset()
and.step()
logic - Register the Environment: Add your environment to the registry for easy access
- Test and Validate: Ensure your environment works correctly with GEM’s ecosystem
Example Structure
from gem.core import Env
from gem.envs.registration import register
class ReverseStringEnv(Env):
def __init__(self, str_len: int = 5):
super().__init__()
self.str_len = str_len
def _get_instructions(self) -> str:
return (
"You are tasked to reverse a given string.\n"
"You may provide your response in any manner. Only the content wrapped inside \\boxed{} will be considered as your final answer.\n"
f"Please reverse the string: {self.gt_str}.\n"
)
def reset(self, seed=None):
super().reset(seed)
characters = string.ascii_letters + string.digits # A-Z, a-z, 0-9
self.gt_str = "".join(random.choices(characters, k=self.str_len))
return self._get_instructions(), {}
def step(self, action):
clean_action = extract_last_boxed_answer(action)
if clean_action is None:
reward = 0
else:
reward = float(clean_action[::-1] == self.gt_str)
return TERMINAL_STATE, reward, True, True, {}
# Register your environment
register("custom:ReverseString", ReverseStringEnv)
Best Practices
- Clear Instructions: Provide clear, unambiguous instructions in your observations
- Consistent Rewards: Design a reward structure that encourages desired behavior
- Proper Termination: Clearly define when episodes should end
- Informative Output: Use the info dictionary to provide debugging information
- Documentation: Document your environment’s behavior and expected usage