-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Talk about gradients in algorithmic_differentiation.md
#457
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ |
1bdbe0b
to
ccf16e1
Compare
algorithmic_differentiation.md
algorithmic_differentiation.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this -- it's a fantastic extension to the docs that ties a lot together. I just have a few small comments, but think it's basically good to go.
|
||
The role of the adjoint is revealed when we consider ``f := \mathcal{l} \circ g``, where ``g : \mathcal{X} \to \mathcal{Y}``, ``\mathcal{l}(y) := \langle \bar{y}, y \rangle``, and ``\bar{y} \in \mathcal{Y}`` is some fixed vector. | ||
Noting that ``D \mathcal{l} [y](\dot{y}) = \langle \bar{y}, \dot{y} \rangle``, we apply the chain rule to obtain | ||
An alternative characterisation is that ``\nabla f(x)`` is the vector pointing in the direction of steepest ascent on ``f`` at ``x``, with magnitude equal to the directional derivative in that steepest direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm not quite clear what is meant by with magnitude equal to the directional derivative in that steepest direction.
-- is there a precise mathematical statement by which you can explain what this means?
Notice that the value of the gradient depends on how the inner product on ``\mathcal{X}`` is defined. | ||
Indeed, different choices of inner product result in different values of ``\nabla f``. | ||
Adjoints such as ``D f[x]^*`` are also inner product dependent. | ||
However, the actual derivative ``D f[x]`` is of course invariant -- it makes no reference to the inner product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct, technically? We make use of the norms for both X and Y in the definition of the Frechet derivative, which I've been assuming we take to be the norms induced by whichever inner products we pick on X and Y. Would it be more accurate to point out that the definition is invariant because all norms in are equivalent in finite dimensions?
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``). | ||
But we endeavour to keep the discussion general in order to make the role of the inner product explicit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``). | |
But we endeavour to keep the discussion general in order to make the role of the inner product explicit. | |
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``), but we endeavour to keep the discussion general in order to make the role of the inner product explicit. |
grammar
``` | ||
from which we conclude that ``D g [x]^\ast (\bar{y})`` is the gradient of the composition ``l \circ g`` at ``x``. | ||
where the second equality follows from the gradient's implicit definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where the second equality follows from the gradient's implicit definition. | |
where the second equality follows from the gradient's definition. |
Reading this, I briefly thought that we had multiple definitions of the gradient lying around, and the one you are using here is the "implicit" one, before realising you're just trying to point out that our definition of the gradient is implicit. I wonder if others might read it in the same way, meaning that it's better just to refer to the "gradient's definition"?
|
||
_**Example**_ | ||
|
||
The adjoint derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The adjoint derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives | |
The adjoint of the derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives |
nit-pick: I don't believe we refer to the "adjoint derivative" anywhere, but we do refer to the "adjoint" and the "adjoint of the derivative" interchangeably. Is this a typo, or ought we to be talking about the "adjoint derivative"?
\nabla f(x, y) = D f[x, y]^\ast (1) = (1, (y_2, y_1)) . | ||
``` | ||
|
||
_**Aside: Adjoint Derivatives as Gradients**_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here -- "Adjoint Derivatives" vs "Adjoint" or "Adjoint of the Derivative" etc
|
||
To compute the gradient in forwards-mode, we need to evaluate the forwards pass ``\dim \mathcal{X}`` times. | ||
We also need to refer to a basis ``\{\mathbf{e}_i\}`` of ``\mathcal{X}`` and its reciprocal basis ``\{\mathbf{e}^i\}`` defined by ``\langle \mathbf{e}_i, \mathbf{e}^j \rangle = \delta_i^j``. | ||
(For any basis there exists such a reciprocal basis, and they are the same if the basis is orthonormal.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(For any basis there exists such a reciprocal basis, and they are the same if the basis is orthonormal.) | |
For any basis there exists such a reciprocal basis, and they are the same for orthonormal bases such as the standard basis. As a result, you can replace any occurrences of ``\{\mathbf{e}^i\}`` with ``\{\mathbf{e}_i\}`` in what follows and still have a correct understanding of the mathematics underpinning Mooncake. |
nip-pick: I don't think we need the brackets here, and I think it would be good to allude to the standard basis.
What are your thoughts on my bit about "replace occurences of..."? I'm not 100% certain I've phrased this perfectly, but I would like to reassure readers that this is indeed the consequence of the previous sentence. Maybe I'm overthinking it...
This PR contains the addition of another "aside" section explaining why the choice of inner product does not affect the gradient of functions. I also took the liberty to change the wording in some places.