Rewards normalising in reinforcement learning

I have 20 copies of an environment that gets a reward of 0.1 when it reaches it target and 0 otherwise.

What I want to do is make sense of how to normalise the rewards. Suppose I run the environment for 300 time steps. So the reward matrix is of size 300x20.

I usually normalise by doing:

discounted_rewards = torch.zeros_like(rewards, dtype=torch.float32, device=device)

for t in reversed(range(len(rewards))):

    running_add = rewards[t] + discount * running_add

    discounted_rewards[t] = running_add



mean = discounted_rewards.mean(0, keepdim=True)

std = discounted_rewards.std(0, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

However, when I was doing an assignment I normalised the mean across the environments and not timesteps. i.e.

mean = discounted_rewards.mean(1, keepdim=True)

std = discounted_rewards.std(1, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

and this seems to train faster. Was using PPO if it helps.

So my questions are:

Which way are you supposed to normalise?

I read this SO post which mentions that normalisation "doesn't mess with the sign of the gradient". However, if the reward is less than the mean it does change the sign of the gradient doesn't it?

(optional) why does normalisation even work?

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

add a comment |

I have 20 copies of an environment that gets a reward of 0.1 when it reaches it target and 0 otherwise.

What I want to do is make sense of how to normalise the rewards. Suppose I run the environment for 300 time steps. So the reward matrix is of size 300x20.

I usually normalise by doing:

discounted_rewards = torch.zeros_like(rewards, dtype=torch.float32, device=device)

for t in reversed(range(len(rewards))):

    running_add = rewards[t] + discount * running_add

    discounted_rewards[t] = running_add



mean = discounted_rewards.mean(0, keepdim=True)

std = discounted_rewards.std(0, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

However, when I was doing an assignment I normalised the mean across the environments and not timesteps. i.e.

mean = discounted_rewards.mean(1, keepdim=True)

std = discounted_rewards.std(1, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

and this seems to train faster. Was using PPO if it helps.

So my questions are:

Which way are you supposed to normalise?

I read this SO post which mentions that normalisation "doesn't mess with the sign of the gradient". However, if the reward is less than the mean it does change the sign of the gradient doesn't it?

(optional) why does normalisation even work?

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

add a comment |

I have 20 copies of an environment that gets a reward of 0.1 when it reaches it target and 0 otherwise.

What I want to do is make sense of how to normalise the rewards. Suppose I run the environment for 300 time steps. So the reward matrix is of size 300x20.

I usually normalise by doing:

discounted_rewards = torch.zeros_like(rewards, dtype=torch.float32, device=device)

for t in reversed(range(len(rewards))):

    running_add = rewards[t] + discount * running_add

    discounted_rewards[t] = running_add



mean = discounted_rewards.mean(0, keepdim=True)

std = discounted_rewards.std(0, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

However, when I was doing an assignment I normalised the mean across the environments and not timesteps. i.e.

mean = discounted_rewards.mean(1, keepdim=True)

std = discounted_rewards.std(1, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

and this seems to train faster. Was using PPO if it helps.

So my questions are:

Which way are you supposed to normalise?

I read this SO post which mentions that normalisation "doesn't mess with the sign of the gradient". However, if the reward is less than the mean it does change the sign of the gradient doesn't it?

(optional) why does normalisation even work?

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

I have 20 copies of an environment that gets a reward of 0.1 when it reaches it target and 0 otherwise.

What I want to do is make sense of how to normalise the rewards. Suppose I run the environment for 300 time steps. So the reward matrix is of size 300x20.

I usually normalise by doing:

discounted_rewards = torch.zeros_like(rewards, dtype=torch.float32, device=device)

for t in reversed(range(len(rewards))):

    running_add = rewards[t] + discount * running_add

    discounted_rewards[t] = running_add



mean = discounted_rewards.mean(0, keepdim=True)

std = discounted_rewards.std(0, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

However, when I was doing an assignment I normalised the mean across the environments and not timesteps. i.e.

mean = discounted_rewards.mean(1, keepdim=True)

std = discounted_rewards.std(1, keepdim=True) + 1e-10

discounted_rewards = (discounted_rewards - mean) / std

and this seems to train faster. Was using PPO if it helps.

So my questions are:

Which way are you supposed to normalise?

I read this SO post which mentions that normalisation "doesn't mess with the sign of the gradient". However, if the reward is less than the mean it does change the sign of the gradient doesn't it?

(optional) why does normalisation even work?

deep-learning reinforcement-learning

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

asked Nov 21 '18 at 0:44

sachinruk

2,12961933

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53403754%2frewards-normalising-in-reinforcement-learning%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu