Note, this chapter is very preliminary - probably not for the first version of the book. move after RL, before BNNs?
The next chapter will questions some fundamental aspects of the formulations so far -- namely the gradients -- and aim for an even tighter integration of physics and learning. The approaches explained previously all integrate physical models into deep learning algorithms. Either as a physics-informed (PI) loss function or via differentiable physics (DP) operators embedded into the network. In the PI case, the simulator is only required at training time, while for DP approaches, it also employed at inference time, it actually enables an end-to-end training of NNs and numerical solvers. Both employ first order derivatives to drive optimizations and learning processes, and we haven't questioned at all whether this is the best choice so far.
A central insight the following chapter is that regular gradients are often a sub-optimal choice for learning problems involving physical quantities.
Treating network and simulator as separate systems instead of a single black box, we'll derive a different update step that replaces the gradient of the simulator.
As this gradient is closely related to a regular gradient, but computed via physical model equations,
we refer to this update (proposed by Holl et al. {cite}holl2021pg
) as the {\em physical gradient} (PG).
:class: tip
Below, we'll proceed in the following steps:
- we'll first show the problems with regular gradient descent, especially for functions that combine small and large scales,
- a central insight will be that an _inverse gradient_ is a lot more meaningful than the regular one,
- finally, we'll show how to use inverse functions (and especially inverse PDE solvers) to compute a very accurate update that includes higher-order terms.
All NNs of the previous chapters were trained with gradient descent (GD) via backpropagation, GD and hence backpropagation was also employed for the PDE solver (simulator)
In the field of classical optimization, techniques such as Newton's method or BFGS variants are commonly used to optimize numerical processes since they can offer better convergence speed and stability. These methods likewise employ gradient information, but substantially differ from GD in the way they compute the update step, typically via higher order derivatives. % cite{nocedal2006numerical}
The PG which we'll derive below can take into account nonlinearities to produce better optimization updates when an (full or approximate) inverse simulator is available. In contrast to classic optimization techniques, we show how a differentiable or invertible physics simulator can be leveraged to compute the PG without requiring higher-order derivatives of the simulator.
In the following, we will stop using GD for everything, and instead use the aforementioned PGs for the simulator. This update is combined with a GD based step for updating the weights in the NNs. This setup, consisting of two fundamentally different optimization schemes, will result in an improved end-to-end training.
---
height: 220px
name: pg-training
---
TODO, visual overview of PG training
We'll start by revisiting the most commonly used optization methods -- gradient descent (GD) and quasi-Newton methods -- and describe their fundamental limits and drawbacks on a theoretical level.
As before, let
The optimization updates
$$ \Delta x = -\eta \cdot \frac{\partial L}{\partial x} $$ (GD-update)
where
Units 📏
A first indicator that something is amiss with GD is that it inherently misrepresents dimensions.
Assume two parameters
Function sensitivity 🔍
GD has inherent problems when functions are not normalized.
Assume the range of
Convergence near optimum 💎
The loss landscape of any differentiable function necessarily becomes flat close to an optimum
(the gradient approaches zero upon convergence).
Therefore
This is an important point, and we will revisit it below. It's also somewhat surprising at first, but it can actually stabilize the training. On the other hand, it also makes the learning process difficult to control.
Quasi-Newton methods, such as BFGS and its variants, evaluate the gradient
$$ \Delta x = \eta \cdot \left( \frac{\partial^2 L}{\partial x^2} \right)^{-1} \frac{\partial L}{\partial x}. $$ (quasi-newton-update)
where
Units 📏
Quasi-Newton methods definitely provide a much better handling of physical units than GD.
The quasi-Newton update from equation {eq}quasi-newton-update
produces the correct units for all parameters to be optimized,
Convergence near optimum 💎
Quasi-Newton methods also exhibit much faster convergence when the loss landscape is relatively flat.
Instead of slowing down, they instead take larger steps, even when
Consistency in function compositions
So far, quasi-Newton methods address both shortcomings of GD. However, similar to GD, the update of an intermediate space still depends on all functions before that. This behavior stems from the fact that the Hessian of a function composition carries non-linear terms of the gradient.
Consider a function composition
% chain of function evaluations: Hessian of an outer function is influenced by inner ones; inversion corrects and yields quantity similar to IG, but nonetheless influenced by "later" derivatives
Dependence on Hessian 🎩
In addition, a fundamental disadvantage of quasi-Newton methods is their dependence on the Hessian of the full function.
The first obvious drawback is the computational cost.
While evaluating the exact Hessian only adds one extra pass to every optimization step, this pass involves higher-dimensional tensors than the computation of the gradient.
As
The quasi-Newton update above additionally requires the inverse Hessian matrix. Thus, a Hessian that is close to being non-invertible typically causes numerical stability problems, while inherently non-invertible Hessians require a fallback to a first order GD update.
Another related limitation of quasi-Newton methods is that the objective function needs to be twice-differentiable. While this may not seem like a big restriction, note that many common neural network architectures use ReLU activation functions of which the second-order derivative is zero. % Related to this is the problem that higher-order derivatives tend to change more quickly when traversing the parameter space, making them more prone to high-frequency noise in the loss landscape.
_Quasi-Newton Methods_
are still a very active research topic, and hence many extensions have been proposed that can alleviate some of these problems in certain settings. E.g., the memory requirement problem can be sidestepped by storing only lower-dimensional vectors that can be used to approximate the Hessian. However, these difficulties illustrate the problems that often arise when applying methods like BFGS.
%\nt{In contrast to these classic algorithms, we will show how to leverage invertible physical models to efficiently compute physical update steps. In certain scenarios, such as simple loss functions, computing the inverse gradient via the inverse Hessian will also provide a useful building block for our final algorithm.} %, and how to they can be used to improve the training of neural networks.
As a first step towards physical gradients, we introduce inverse gradients (IGs), which naturally solve many of the aforementioned problems.
Instead of
$$ \Delta x = \frac{\partial x}{\partial z} \cdot \Delta z. $$ (IG-def)
to be the IG update.
Here, the Jacobian
Note that instead of using a learning rate, here the step size is determined by the desired increase or decrease of the value of the output,
% Units
Positive Aspects
IGs scale with the inverse derivative. Hence the updates are automatically of the same units as the parameters without requiring an arbitrary learning rate:
% Function sensitivity
They also don't have problems with normalization as the parameter updates from the example
% Convergence near optimum
IGs show the opposite behavior of GD close to an optimum: they typically produce very accurate updates, which don't vanish near an optimum. This leads to fast convergence, as we will demonstrate in more detail below.
% Consistency in function compositions
Additionally, IGs are consistent in function composition.
The change in
Note that even Newton's method with its inverse Hessian didn't fully get this right. The key here is that if the Jacobian is invertible, we'll directly get the correctly scaled direction at a given layer, without "helpers" such as the inverse Hessian.
Limitations
So far so good.
The above properties make the advantages of IGs clear, but we're not done, unfortunately. There are strong limitations to their applicability.
%
The IG
Thus, we now consider the fact that inverse gradients are linearizations of inverse functions and show that using inverse functions provides additional advantages while retaining the same benefits.
Physical processes can be described as a trajectory in state space where each point represents one possible configuration of the system. A simulator typically takes one such state space vector and computes a new one at another time. The Jacobian of the simulator is, therefore, necessarily square. % As long as the physical process does not destroy information, the Jacobian is non-singular. In fact, it is believed that information in our universe cannot be destroyed so any physical process could in theory be inverted as long as we have perfect knowledge of the state.
While evaluating the IGs directly can be done through matrix inversion or taking the derivative of an inverse simulator, we now consider what happens if we use the inverse simulator directly in backpropagation.
Let
% Original: \begin{equation} \label{eq:pg-def} \frac{\Delta x}{\Delta z} \equiv \mathcal P_{(x_0,z_0)}^{-1} (z_0 + \Delta z) - x_0 = \frac{\partial x}{\partial z} + \mathcal O(\Delta z^2)
added $ / \Delta z $ on the right!? the above only gives
Note that this PG is equal to the IG from the section above up to first order, but contains nonlinear terms, i.e.
$ \Delta x / \Delta z = \frac{\partial x}{\partial z} + \mathcal O(\Delta z^2)
% We now show that these terms can help produce more stable updates than the IG alone, provided that
The intuition for why the PG update is a good one is that when
applying the update
Fundamental theorem of calculus
To more clearly illustrate the advantages in non-linear settings, we
apply the fundamental theorem of calculus to rewrite the ratio
% \begin{equation} \label{eq:avg-grad}
% $\begin{aligned} % \frac{\Delta z}{\Delta x} = \frac{\int_{x_0}^{x_0+\Delta x} \frac{\partial z}{\partial x} , dx}{\Delta x} % \end{aligned}$
% where we've integrated over a trajectory in
$\begin{aligned} \frac{\Delta x}{\Delta z} = \frac{\int_{z_0}^{z_0+\Delta z} \frac{\partial x}{\partial z} , dz}{\Delta z} \end{aligned}$
Here the expressions inside the integral is the local gradient, and we assume it exists at all points between
The equations naturally generalize to higher dimensions by replacing the integral with a path integral along any differentiable path connecting
Let
Instead of using this "perfect" inverse
By contrast, a \emph{local inverse}, defined at point
With the local inverse, the PG is defined as
$$ \frac{\Delta x}{\Delta z} \equiv \big( \mathcal P_{(x_0,z_0)}^{-1} (z_0 + \Delta z) - x_0 \big) / \Delta z $$ (local-PG-def)
For differentiable functions, a local inverse is guaranteed to exist by the inverse function theorem as long as the Jacobian is non-singular.
That is because the inverse Jacobian
The update obtained with a regular gradient descent method has surprising shortcomings. The physical gradient instead allows us to more accurately backpropagate through nonlinear functions, provided that we have access to good inverse functions.
Before moving on to including PGs in NN training processes, the next example will illustrate the differences between these approaches with a practical example.