From 696b66b8168d7e8f689fa71058681b95da94a713 Mon Sep 17 00:00:00 2001 From: antoine_galataud Date: Tue, 14 May 2024 14:58:25 +0200 Subject: [PATCH] work on doc --- doc/source/overview/index.rst | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/doc/source/overview/index.rst b/doc/source/overview/index.rst index c5132e0..421a23c 100644 --- a/doc/source/overview/index.rst +++ b/doc/source/overview/index.rst @@ -23,7 +23,7 @@ In the continuous case, the expected value of :math:`f(x)` under :math:`p(x)` is .. math:: - E_{x \sim p}[f(x)] = \int p(x) f(x) dx + E_{x \sim p(x)}[f(x)] = \int p(x) f(x) dx Now, let's say we don't have access to samples from :math:`p(x)`, but we have samples from another distribution :math:`q(x)`. @@ -31,12 +31,12 @@ We can still compute the expected value of :math:`f(x)` by using the samples fro .. math:: - E_{x \sim p}[f(x)] = \int p(x) f(x) dx + E_{x \sim p(x)}[f(x)] = \int p(x) f(x) dx = \int p(x) \frac{q(x)}{q(x)} f(x) dx = \int q(x) \frac{p(x)}{q(x)} f(x) dx - = E_{x \sim q}[\frac{p(x)}{q(x)} f(x)] + = E_{x \sim q(x)}[\frac{p(x)}{q(x)} f(x)] -Wrapping that up, :math:`E_{x \sim p}[f(x)] = E_{x \sim q}[\frac{p(x)}{q(x)} f(x)]`, which means +Wrapping that up, :math:`E_{x \sim p(x)}[f(x)] = E_{x \sim q(x)}[\frac{p(x)}{q(x)} f(x)]`, which means the expected value of :math:`f(x)` under :math:`p(x)` is equal to the expected value of :math:`\frac{p(x)}{q(x)} f(x)` under :math:`q(x)`. @@ -83,6 +83,7 @@ Among other generic considerations, there are two assumptions that must be satis - *coverage*: the behavior policy must have a non-zero probability of taking all the actions that the evaluation policy could take. In Hopes, we deal with this as much as possible by ensuring a small probability of taking all the actions in the behavior policy, especially in the deterministic case. -- *positivity*: the rewards must be non-negative to be able to get a lower bound estimate of the target policy. +- *positivity*: the rewards must be non-negative to be able to get a lower bound estimate of the target policy. In Hopes, + you'll find a way to rescale the rewards to make them positive (using MinMaxScaler). .. [#] in the context of off-policy policy gradient methods, but that's out of the scope of this project.