src.buffers package

Submodules

src.buffers.PPOBuffer module

Buffer for PPO.

class src.buffers.PPOBuffer.PPOBuffer(obs_dim: int, act_dim: int, size: int, batch_size: int, gamma: float = 0.99, lam: float = 0.95, eps: float = 0.001)

Bases: object

A buffer for storing trajectories experienced by a PPO agent interacting with the environment, and using Generalized Advantage Estimation (GAE-Lambda) for calculating the advantages of state-action pairs.

Initialize PPOBuffer

Parameters
  • obs_dim (int) – Observation Dimension

  • act_dim (int) – Action Dimension

  • size (int) – Size of Replay Buffer

  • batch_size (int) – Batch Size

  • gamma (float, optional) – Gamma. Defaults to 0.99.

  • lam (float, optional) – Lambda. Defaults to 0.95.

  • eps (_type_, optional) – Epsilon. Defaults to 1e-3.

finish_path(action_obj=None)

Call this at the end of a trajectory, or when one gets cut off by an epoch ending. This looks back in the buffer to where the trajectory started, and uses rewards and value estimates from the whole trajectory to compute advantage estimates with GAE-Lambda, as well as compute the rewards-to-go for each state, to use as the targets for the value function. The “last_val” argument should be 0 if the trajectory ended because the agent reached a terminal state (died), and otherwise should be V(s_T), the value function estimated for the last state. This allows us to bootstrap the reward-to-go calculation to account for timesteps beyond the arbitrary episode horizon (or epoch cutoff).

classmethod instantiate_from_config(config_file_location)

Initialize class from config file

Parameters

config_file_location (path) – Path to config file

Raises

ValueError – Error loading file

Returns

object from class.

Return type

cls

classmethod instantiate_from_config_dict(config)

Initialize class from config dictionary

Parameters

config (dictionary) – Create instance of class from dictionary.

Returns

Object from class and config.

Return type

cls

sample_batch()

Call this at the end of an epoch to get all of the data from the buffer, with advantages appropriately normalized (shifted to have mean zero and std one). Also, resets some pointers in the buffer.

schema = Map({'obs_dim': Int(), 'act_dim': Int(), 'size': Int(), 'batch_size': Int(), Optional("gamma"): Float(), Optional("lam"): Float(), Optional("eps"): Float()})
store(buffer_dict)

Append one timestep of agent-environment interaction to the buffer.

src.buffers.PPOBuffer.discount_cumsum(x, discount)

Wraper arround discounted cumulative summation.

Parameters
  • x (np.array) – Array to sum over

  • discount (float) – discount parameter.

Returns

output

Return type

float

src.buffers.SimpleReplayBuffer module

Default Replay Buffer.

class src.buffers.SimpleReplayBuffer.SimpleReplayBuffer(obs_dim: int, act_dim: int, size: int, batch_size: int)

Bases: object

A simple FIFO experience replay buffer for SAC agents.

Initialize simple replay buffer

Parameters
  • obs_dim (int) – Observation dimension

  • act_dim (int) – Action dimension

  • size (int) – Buffer size

  • batch_size (int) – Batch size

finish_path(action_obj=None)

Call this at the end of a trajectory, or when one gets cut off by an epoch ending. This looks back in the buffer to where the trajectory started, and uses rewards and value estimates from the whole trajectory to compute advantage estimates with GAE-Lambda, as well as compute the rewards-to-go for each state, to use as the targets for the value function. The “last_val” argument should be 0 if the trajectory ended because the agent reached a terminal state (died), and otherwise should be V(s_T), the value function estimated for the last state. This allows us to bootstrap the reward-to-go calculation to account for timesteps beyond the arbitrary episode horizon (or epoch cutoff).

classmethod instantiate_from_config(config_file_location)

Initialize class from config file

Parameters

config_file_location (path) – Path to config file

Raises

ValueError – Error loading file

Returns

object from class.

Return type

cls

classmethod instantiate_from_config_dict(config)

Initialize class from config dictionary

Parameters

config (dictionary) – Create instance of class from dictionary.

Returns

Object from class and config.

Return type

cls

sample_batch()

Sample batch from self.

Returns

Dictionary of batched information.

Return type

dict

schema = Map({'obs_dim': Int(), 'act_dim': Int(), 'size': Int(), 'batch_size': Int()})
store(buffer_dict)

Store data from buffer_dict

Parameters

buffer_dict (_type_) – Buffer dict

Module contents

Replay buffer definitions, for storing past data for learning.