. Author manuscript; available in PMC: 2024 Apr 1.

Published in final edited form as: Transact Mach Learn Res. 2023 Jun;2023:https://openreview.net/forum?id=K0CAGgjYS1.

Table 1:

Effects of per-sample gradient clipping on gradient flow. Here “Yes/No” means guaranteed or not and the loss refers to the training set. “Loss convergence” is conditioned on $H (t) ≻ 0$ .

Clipping type	NTK matrix	Symmetric NTK	Positive in quadratic form	Positive in eigenvalues	Loss convergence	Monotone loss decay	To zero loss
No clipping	$H \equiv \sum_{r} H_{r}$	✓	✓	✓	✓	✓	✓
Batch clipping	$c H \equiv c \sum_{r} H_{r}$	✓	✓	✓	✓	✓	✓
Large R clipping (Flat & layerwise)	$H \equiv \sum_{r} H_{r}$	✓	✓	✓	✓	✓	✓
Small R clipping (Flat)	$HC$	✗	✗	✓	✗	✗	✓
Small R clipping (Layerwise)	$\sum_{r} H_{r} C_{r}$	✗	✗	✗	✗	✗	✗