Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talk about gradients in algorithmic_differentiation.md #457

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Jollywatt
Copy link

This PR contains the addition of another "aside" section explaining why the choice of inner product does not affect the gradient of functions. I also took the liberty to change the wording in some places.

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

@Jollywatt Jollywatt force-pushed the docs branch 3 times, most recently from 1bdbe0b to ccf16e1 Compare February 5, 2025 15:35
@Jollywatt Jollywatt changed the title Minor additions to algorithmic_differentiation.md Talk about gradients in algorithmic_differentiation.md Feb 5, 2025
Copy link
Member

@willtebbutt willtebbutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this -- it's a fantastic extension to the docs that ties a lot together. I just have a few small comments, but think it's basically good to go.


The role of the adjoint is revealed when we consider ``f := \mathcal{l} \circ g``, where ``g : \mathcal{X} \to \mathcal{Y}``, ``\mathcal{l}(y) := \langle \bar{y}, y \rangle``, and ``\bar{y} \in \mathcal{Y}`` is some fixed vector.
Noting that ``D \mathcal{l} [y](\dot{y}) = \langle \bar{y}, \dot{y} \rangle``, we apply the chain rule to obtain
An alternative characterisation is that ``\nabla f(x)`` is the vector pointing in the direction of steepest ascent on ``f`` at ``x``, with magnitude equal to the directional derivative in that steepest direction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm not quite clear what is meant by with magnitude equal to the directional derivative in that steepest direction. -- is there a precise mathematical statement by which you can explain what this means?

Notice that the value of the gradient depends on how the inner product on ``\mathcal{X}`` is defined.
Indeed, different choices of inner product result in different values of ``\nabla f``.
Adjoints such as ``D f[x]^*`` are also inner product dependent.
However, the actual derivative ``D f[x]`` is of course invariant -- it makes no reference to the inner product.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct, technically? We make use of the norms for both X and Y in the definition of the Frechet derivative, which I've been assuming we take to be the norms induced by whichever inner products we pick on X and Y. Would it be more accurate to point out that the definition is invariant because all norms in are equivalent in finite dimensions?

Comment on lines +433 to +434
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``).
But we endeavour to keep the discussion general in order to make the role of the inner product explicit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``).
But we endeavour to keep the discussion general in order to make the role of the inner product explicit.
In practice, Mooncake uses the Euclidean inner product, extended in the "obvious way" to other composite data types (that is, as if everything is flattened and embedded in ``\mathbb{R}^N``), but we endeavour to keep the discussion general in order to make the role of the inner product explicit.

grammar

```
from which we conclude that ``D g [x]^\ast (\bar{y})`` is the gradient of the composition ``l \circ g`` at ``x``.
where the second equality follows from the gradient's implicit definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where the second equality follows from the gradient's implicit definition.
where the second equality follows from the gradient's definition.

Reading this, I briefly thought that we had multiple definitions of the gradient lying around, and the one you are using here is the "implicit" one, before realising you're just trying to point out that our definition of the gradient is implicit. I wonder if others might read it in the same way, meaning that it's better just to refer to the "gradient's definition"?


_**Example**_

The adjoint derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The adjoint derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives
The adjoint of the derivative of ``f(x, y) = x + y_1 y_2`` (see [above](#AD-of-a-Julia-function:-a-slightly-less-trivial-example)) immediately gives

nit-pick: I don't believe we refer to the "adjoint derivative" anywhere, but we do refer to the "adjoint" and the "adjoint of the derivative" interchangeably. Is this a typo, or ought we to be talking about the "adjoint derivative"?

\nabla f(x, y) = D f[x, y]^\ast (1) = (1, (y_2, y_1)) .
```

_**Aside: Adjoint Derivatives as Gradients**_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- "Adjoint Derivatives" vs "Adjoint" or "Adjoint of the Derivative" etc


To compute the gradient in forwards-mode, we need to evaluate the forwards pass ``\dim \mathcal{X}`` times.
We also need to refer to a basis ``\{\mathbf{e}_i\}`` of ``\mathcal{X}`` and its reciprocal basis ``\{\mathbf{e}^i\}`` defined by ``\langle \mathbf{e}_i, \mathbf{e}^j \rangle = \delta_i^j``.
(For any basis there exists such a reciprocal basis, and they are the same if the basis is orthonormal.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(For any basis there exists such a reciprocal basis, and they are the same if the basis is orthonormal.)
For any basis there exists such a reciprocal basis, and they are the same for orthonormal bases such as the standard basis. As a result, you can replace any occurrences of ``\{\mathbf{e}^i\}`` with ``\{\mathbf{e}_i\}`` in what follows and still have a correct understanding of the mathematics underpinning Mooncake.

nip-pick: I don't think we need the brackets here, and I think it would be good to allude to the standard basis.

What are your thoughts on my bit about "replace occurences of..."? I'm not 100% certain I've phrased this perfectly, but I would like to reassure readers that this is indeed the consequence of the previous sentence. Maybe I'm overthinking it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants