ml.rl.training package

Submodules

ml.rl.training.actor_predictor module

class ml.rl.training.actor_predictor.ActorPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.sandboxed_predictor.SandboxedRLPredictor

actor_prediction(float_state_features: List[Dict[int, float]]) → List[Dict[str, float]]
predict(float_state_features: List[Dict[int, float]]) → List[Dict[str, float]]

Returns values for each state :param float_state_features A list of feature -> float value dict examples

ml.rl.training.ddpg_predictor module

class ml.rl.training.ddpg_predictor.DDPGPredictor(net, init_net, parameters)

Bases: ml.rl.training.rl_predictor_pytorch.RLPredictor

actor_prediction(float_state_features)

Actor Prediction - Returns action for each float_feature state. :param float_state_features A list of feature -> float value dict examples

critic_prediction(float_state_features, actions)

Critic Prediction - Returns values for each state/action pair. Accepts

:param float_state_features states as list of feature -> float value dict :param actions actions as list of feature -> value dict

classmethod export_actor(trainer, state_normalization_parameters, action_feature_ids, min_action_range_tensor_serving, max_action_range_tensor_serving, model_on_gpu=False)
Export caffe2 preprocessor net and pytorch actor forward pass as one

caffe2 net.

:param trainer DDPGTrainer

:param state_normalization_parameters state NormalizationParameters

:param min_action_range_tensor_serving pytorch tensor that specifies

min action value for each dimension

:param max_action_range_tensor_serving pytorch tensor that specifies

min action value for each dimension

:param state_normalization_parameters state NormalizationParameters

:param model_on_gpu boolean indicating if the model is a GPU model or CPU model

classmethod export_critic(trainer, state_normalization_parameters, action_normalization_parameters, model_on_gpu=False)

Export caffe2 preprocessor net and pytorch DQN forward pass as one caffe2 net.

:param trainer ParametricDQNTrainer :param state_normalization_parameters state NormalizationParameters :param action_normalization_parameters action NormalizationParameters :param model_on_gpu boolean indicating if the model is a GPU model or CPU model

classmethod generate_train_net(trainer, model, min_action_range_tensor_serving, max_action_range_tensor_serving, model_on_gpu)
policy()

TODO: Return actions when exporting final net and fill in this function.

ml.rl.training.ddpg_trainer module

class ml.rl.training.ddpg_trainer.ActorNet(layers, activations, fl_init)

Bases: torch.nn.Module

forward(state) → torch.FloatTensor

Forward pass for actor network. Assumes activation names are valid pytorch activation names. :param state state as list of state features

class ml.rl.training.ddpg_trainer.ActorNetModel(layers, activations, fl_init, state_dim, action_dim, use_gpu, use_all_avail_gpus, dnn=None)

Bases: ml.rl.models.base.ModelBase

cpu_model()

Override this in DistributedDataParallel models

forward(input: ml.rl.types.StateInput) → ml.rl.types.ActorOutput

Forward pass for actor network. Assumes activation names are valid pytorch activation names. :param input StateInput containing float_features

input_prototype() → ml.rl.types.StateInput

This function provides the input for ONNX graph tracing.

The return value should be what expected by forward().

class ml.rl.training.ddpg_trainer.CriticNet(layers, activations, fl_init, action_dim)

Bases: torch.nn.Module

forward(state_action) → torch.FloatTensor

Forward pass for critic network. Assumes activation names are valid pytorch activation names. :param state_action tensor of state & actions concatted

forward_split(state, action) → torch.FloatTensor
class ml.rl.training.ddpg_trainer.CriticNetModel(layers, activations, fl_init, state_dim, action_dim, use_gpu, use_all_avail_gpus, dnn=None)

Bases: ml.rl.models.base.ModelBase

cpu_model()

Override this in DistributedDataParallel models

forward(input: ml.rl.types.StateAction) → ml.rl.types.SingleQValue

Forward pass for critic network. Assumes activation names are valid pytorch activation names. :param input ml.rl.types.StateAction of combined states and actions

input_prototype() → ml.rl.types.StateAction

This function provides the input for ONNX graph tracing.

The return value should be what expected by forward().

class ml.rl.training.ddpg_trainer.DDPGTrainer(actor_network, critic_network, parameters, state_normalization_parameters: Dict[int, ml.rl.thrift.core.ttypes.NormalizationParameters], action_normalization_parameters: Dict[int, ml.rl.thrift.core.ttypes.NormalizationParameters], min_action_range_tensor_serving: torch.Tensor, max_action_range_tensor_serving: torch.Tensor, use_gpu: bool = False, use_all_avail_gpus: bool = False)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

internal_prediction(states, noisy=False) → numpy.ndarray

Returns list of actions output from actor network :param states states as list of states to produce actions for

train(training_batch: ml.rl.types.TrainingBatch) → None
class ml.rl.training.ddpg_trainer.OrnsteinUhlenbeckProcessNoise(action_dim, theta=0.15, sigma=0.2, mu=0)

Bases: object

Exploration noise process with temporally correlated noise. Used to explore in physical environments w/momentum. Outlined in DDPG paper.

clear() → None
get_noise() → numpy.ndarray

dx = theta * (mu − prev_noise) + sigma * new_gaussian_noise

ml.rl.training.ddpg_trainer.fan_in_init(weight_tensor, bias_tensor) → None

Fan in initialization as described in DDPG paper.

ml.rl.training.dqn_predictor module

class ml.rl.training.dqn_predictor.DQNPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.sandboxed_predictor.SandboxedRLPredictor

predict(float_state_features)

Returns values for each state :param float_state_features A list of feature -> float value dict examples

ml.rl.training.dqn_trainer module

class ml.rl.training.dqn_trainer.DQNTrainer(q_network, q_network_target, reward_network, parameters: ml.rl.thrift.core.ttypes.DiscreteActionModelParameters, use_gpu=False, q_network_cpe=None, q_network_cpe_target=None, metrics_to_score=None, imitator=None)

Bases: ml.rl.training.dqn_trainer_base.DQNTrainerBase

calculate_cpes(training_batch, states, next_states, all_next_action_scores, logged_action_idxs, discount_tensor, not_done_mask)
get_detached_q_values(state) → Tuple[ml.rl.types.AllActionQValues, Optional[ml.rl.types.AllActionQValues]]

Gets the q values from the model and target networks

internal_prediction(input)

Only used by Gym

internal_reward_estimation(input)

Only used by Gym

num_actions
train(training_batch)

ml.rl.training.dqn_trainer_base module

class ml.rl.training.dqn_trainer_base.DQNTrainerBase(parameters, use_gpu, metrics_to_score=None, actions: Optional[List[str]] = None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

boost_rewards(rewards: torch.Tensor, actions: torch.Tensor) → torch.Tensor
get_max_q_values(q_values, possible_actions_mask)

Used in Q-learning update.

Parameters
  • states – Numpy array with shape (batch_size, state_dim). Each row contains a representation of a state.

  • possible_actions_mask – Numpy array with shape (batch_size, action_dim). possible_actions[i][j] = 1 iff the agent can take action j from state i.

  • double_q_learning – bool to use double q-learning

get_max_q_values_with_target(q_values, q_values_target, possible_actions_mask)

Used in Q-learning update.

Parameters
  • states – Numpy array with shape (batch_size, state_dim). Each row contains a representation of a state.

  • possible_actions_mask – Numpy array with shape (batch_size, action_dim). possible_actions[i][j] = 1 iff the agent can take action j from state i.

  • double_q_learning – bool to use double q-learning

ml.rl.training.imitator_training module

class ml.rl.training.imitator_training.ImitatorTrainer(imitator, parameters: ml.rl.thrift.core.ttypes.DiscreteActionModelParameters, use_gpu=False)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

train(training_batch, train=True)
ml.rl.training.imitator_training.get_valid_actions_from_imitator(imitator, input, drop_threshold)

Create mask for non-viable actions under the imitator.

ml.rl.training.loss_reporter module

class ml.rl.training.loss_reporter.BatchStats(td_loss, reward_loss, imitator_loss, logged_actions, logged_propensities, logged_rewards, logged_values, model_propensities, model_rewards, model_values, model_values_on_logged_actions, model_action_idxs)

Bases: tuple

static add_custom_scalars(action_names: Optional[List[str]])
imitator_loss

Alias for field number 2

logged_actions

Alias for field number 3

logged_propensities

Alias for field number 4

logged_rewards

Alias for field number 5

logged_values

Alias for field number 6

model_action_idxs

Alias for field number 11

model_propensities

Alias for field number 7

model_rewards

Alias for field number 8

model_values

Alias for field number 9

model_values_on_logged_actions

Alias for field number 10

reward_loss

Alias for field number 1

td_loss

Alias for field number 0

write_summary(actions: List[str])
class ml.rl.training.loss_reporter.LossReporter(action_names: Optional[List[str]] = None)

Bases: object

RECENT_WINDOW_SIZE = 100
static calculate_recent_window_average(arr, window_size, num_entries)
flush()
get_last_n_td_loss(n)
get_logged_action_distribution()
get_model_action_distribution()
get_recent_imitator_loss()
get_recent_reward_loss()
get_recent_td_loss()
log_to_tensorboard(epoch: int) → None
num_batches
report(**kwargs)
class ml.rl.training.loss_reporter.StatsByAction(actions)

Bases: object

append(stats)
items()
ml.rl.training.loss_reporter.merge_tensor_namedtuple_list(l, cls)

ml.rl.training.parametric_dqn_predictor module

class ml.rl.training.parametric_dqn_predictor.ParametricDQNPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.sandboxed_predictor.SandboxedRLPredictor

predict(float_state_features, actions)

Returns values for each state :param float_state_features A list of feature -> float value dict examples

ml.rl.training.parametric_dqn_trainer module

class ml.rl.training.parametric_dqn_trainer.ParametricDQNTrainer(q_network, q_network_target, reward_network, parameters: ml.rl.thrift.core.ttypes.ContinuousActionModelParameters, use_gpu: bool = False)

Bases: ml.rl.training.dqn_trainer_base.DQNTrainerBase

get_detached_q_values(state, action) → Tuple[ml.rl.types.SingleQValue, ml.rl.types.SingleQValue]

Gets the q values from the model and target networks

internal_prediction(state, action)

Only used by Gym

internal_reward_estimation(state, action)

Only used by Gym

train(training_batch) → None

ml.rl.training.parametric_inner_product module

class ml.rl.training.parametric_inner_product.ParametricInnerProduct(state_model, action_model, state_dim, action_dim)

Bases: torch.nn.Module

forward(input)

ml.rl.training.rl_dataset module

class ml.rl.training.rl_dataset.RLDataset(file_path)

Bases: object

insert(**kwargs)
insert_pre_timeline_format(mdp_id, sequence_number, state, timeline_format_action, reward, terminal, possible_actions, time_diff, action_probability, possible_actions_mask)

Insert a new sample to the dataset in the pre-timeline json format. Format needed for running timeline operator and for uploading dataset to hive.

insert_replay_buffer_format(state, action, reward, next_state, next_action, terminal, possible_next_actions, possible_next_actions_mask, time_diff, possible_actions, possible_actions_mask, policy_id)

Insert a new sample to the dataset in the same format as the replay buffer.

load()

Load samples from a gzipped json file.

save()

Save samples as a pickle file or JSON file.

ml.rl.training.rl_exporter module

class ml.rl.training.rl_exporter.ActorExporter(dnn, feature_extractor=None, output_transformer=None, state_preprocessor=None, predictor_class=<class 'ml.rl.training.actor_predictor.ActorPredictor'>, **kwargs)

Bases: ml.rl.training.rl_exporter.SandboxedRLExporter

classmethod from_state_action_normalization(dnn, state_normalization, action_normalization, state_preprocessor=None, predictor_class=<class 'ml.rl.training.actor_predictor.ActorPredictor'>, **kwargs)
class ml.rl.training.rl_exporter.DQNExporter(dnn, feature_extractor=None, output_transformer=None, state_preprocessor=None, predictor_class=<class 'ml.rl.training.dqn_predictor.DQNPredictor'>, preprocessing_class=None, **kwargs)

Bases: ml.rl.training.rl_exporter.SandboxedRLExporter

class ml.rl.training.rl_exporter.ParametricDQNExporter(dnn, feature_extractor=None, output_transformer=None, state_preprocessor=None, action_preprocessor=None)

Bases: ml.rl.training.rl_exporter.SandboxedRLExporter

classmethod from_state_action_normalization(dnn, state_normalization, action_normalization, state_preprocessor=None, action_preprocessor=None, **kwargs)
class ml.rl.training.rl_exporter.RLExporter(dnn, feature_extractor=None, output_transformer=None)

Bases: object

export()
class ml.rl.training.rl_exporter.SandboxedRLExporter(dnn, predictor_class, preprocessing_class, feature_extractor=None, output_transformer=None, state_preprocessor=None, action_preprocessor=None, **kwargs)

Bases: ml.rl.training.rl_exporter.RLExporter

export()

ml.rl.training.rl_predictor_pytorch module

class ml.rl.training.rl_predictor_pytorch.RLPredictor(net, init_net, parameters, ws=None)

Bases: object

analyze(named_features)
get_predictor_export_meta()

Returns a PredictorExportMeta object

in_order_dense_to_sparse(dense)

Convert dense observation to sparse observation assuming in order feature ids.

classmethod load(db_path, db_type)

Creates Predictor by loading from a database

:param db_path see load_from_db :param db_type see load_from_db

predict(float_state_features)

Returns values for each state :param float_state_features A list of feature -> float value dict examples

predict_net
save(db_path, db_type)

Saves network to db

:param db_path see save_to_db :param db_type see save_to_db

ml.rl.training.rl_trainer_pytorch module

class ml.rl.training.rl_trainer_pytorch.RLTrainer(parameters, use_gpu, metrics_to_score=None, actions: Optional[List[str]] = None)

Bases: object

ACTION_NOT_POSSIBLE_VAL = -1000000000.0
FINGERPRINT = 12345
internal_prediction(input)

Q-network forward pass method for internal domains. :param input input to network

internal_reward_estimation(input)

Reward-network forward pass for internal domains.

load_state_dict(state_dict)
state_dict()
train(training_samples, evaluator=None) → None
warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.rl_trainer_pytorch.rescale_torch_tensor(tensor: torch.tensor, new_min: torch.tensor, new_max: torch.tensor, prev_min: torch.tensor, prev_max: torch.tensor)

Rescale column values in N X M torch tensor to be in new range. Each column m in input tensor will be rescaled from range [prev_min[m], prev_max[m]] to [new_min[m], new_max[m]]

ml.rl.training.sac_trainer module

class ml.rl.training.sac_trainer.SACTrainer(q1_network, value_network, value_network_target, actor_network, parameters: ml.rl.thrift.core.ttypes.SACModelParameters, q2_network=None, min_action_range_tensor_training=None, max_action_range_tensor_training=None, min_action_range_tensor_serving=None, max_action_range_tensor_serving=None)

Bases: ml.rl.training.rl_trainer_pytorch.RLTrainer

Soft Actor-Critic trainer as described in https://arxiv.org/pdf/1801.01290

The actor is assumed to implement reparameterization trick.

internal_prediction(states)

Returns list of actions output from actor network :param states states as list of states to produce actions for

train(training_batch) → None

IMPORTANT: the input action here is assumed to be preprocessed to match the range of the output of the actor.

warm_start_components()

The trainer should specify what members to save and load

ml.rl.training.sandboxed_predictor module

class ml.rl.training.sandboxed_predictor.SandboxedRLPredictor(pem, ws, predict_net=None)

Bases: ml.rl.training.rl_predictor_pytorch.RLPredictor

classmethod load(db_path, db_type, *args, **kwargs)

Creates Predictor by loading from a database

:param db_path see load_from_db :param db_type see load_from_db

predict_net
save(db_path, db_type)

Saves network to db

:param db_path see save_to_db :param db_type see save_to_db

ml.rl.training.training_data_page module

class ml.rl.training.training_data_page.TrainingDataPage(mdp_ids: Optional[numpy.ndarray] = None, sequence_numbers: Optional[torch.Tensor] = None, states: Optional[torch.Tensor] = None, actions: Optional[torch.Tensor] = None, propensities: Optional[torch.Tensor] = None, rewards: Optional[torch.Tensor] = None, possible_actions_mask: Optional[torch.Tensor] = None, possible_actions_state_concat: Optional[torch.Tensor] = None, next_states: Optional[torch.Tensor] = None, next_actions: Optional[torch.Tensor] = None, possible_next_actions_mask: Optional[torch.Tensor] = None, possible_next_actions_state_concat: Optional[torch.Tensor] = None, not_terminal: Optional[torch.Tensor] = None, time_diffs: Optional[torch.Tensor] = None, metrics: Optional[torch.Tensor] = None, step: Optional[torch.Tensor] = None, max_num_actions: Optional[int] = None)

Bases: object

actions
as_discrete_maxq_training_batch()
as_discrete_sarsa_training_batch()
as_parametric_maxq_training_batch()
as_parametric_sarsa_training_batch()
max_num_actions
mdp_ids
metrics
next_actions
next_states
not_terminal
possible_actions_mask
possible_actions_state_concat
possible_next_actions_mask
possible_next_actions_state_concat
propensities
rewards
sequence_numbers
set_type(dtype)
size() → int
states
step
time_diffs

Module contents