Gradient Descent

Gradient Descent

Concavity is valuable in connection with the Gradient Method of minimizing $J({\hat \theta})$ with respect to ${\hat \theta}$ .

Definition. The gradient of the error measure $J({\hat \theta})$ is defined as the ${\hat N}\times 1$ column vector

$\displaystyle {J^\prime}({\hat \theta})\mathrel{\stackrel{\mathrm{\Delta}}{=}}\... ...ial}{\partial \theta}{J(\theta)}{a_{n_a}}\left({\hat a}_{n_a}\right)\right]^T.$

Definition. The Gradient Method (Cauchy) is defined as follows.

Given ${\hat \theta}_0\in{\hat \Theta}$ , compute

$\displaystyle {\hat \theta}_{n+1} = {\hat \theta}_n - t_n {J^\prime}({\hat \theta}_n),\qquad n=1,2,\ldots\,,$

where ${J^\prime}({\hat \theta}_n)$ is the gradient of

at ${\hat \theta}_n$ , and $t_n\in\Re$ is chosen as the smallest nonnegative local minimizer of

$\displaystyle \Phi_n(t) \mathrel{\stackrel{\mathrm{\Delta}}{=}}J\left({\hat \theta}_n-t{J^\prime}({\hat \theta}_n)\right).$

Cauchy originally proposed to find the value of $t_n\geq 0$ which gave a global minimum of $\Phi_n(t)$ . This, however, is not always feasible in practice.

Some general results regarding the Gradient Method are given below.

Theorem. If ${\hat \theta}_0$ is a local minimizer of $J({\hat \theta})$ , and ${J^\prime}({\hat \theta}_0)$ exists, then ${J^\prime}({\hat \theta}_0)=0$ .

Theorem. The gradient method is a descent method, i.e., $J({\hat \theta}_{n+1})\leq J({\hat \theta}_n)$ .

Definition. $J:{\hat \Theta}\rightarrow \Re ^1$ , ${\hat \Theta}\subset\Re ^{\hat N}$ , is said to be in the class ${\cal C}_k({\hat \Theta})$ if all th order partial derivatives of $J({\hat \theta})$ with respect to the components of ${\hat \theta}$ are continuous on ${\hat \Theta}$ .

Definition. The Hessian ${J^{\prime\prime}}({\hat \theta})$ of at ${\hat \theta}$ is defined as the matrix of second-order partial derivatives,

$\displaystyle {J^{\prime\prime}}({\hat \theta}) [i,j] \mathrel{\stackrel{\mathr... ...c{\partial^2 J(\theta)}{\partial \theta[i]\partial \theta[j]} ({\hat \theta}),$

where $\theta[i]$ denotes the

th component of $\theta$ , $i=1,\ldots\,,{\hat N}={n_a}+{n_b}+1$ , and

denotes the matrix entry at the

th row and

th column.

The Hessian of every element of ${\cal C}_2({\hat \Theta})$ is a symmetric matrix [7]. This is because continuous second-order partials satisfy

$\displaystyle \frac{\partial^2}{\partial x_1\partial x_2} = \frac{\partial^2}{\partial x_2\partial x_1} .$

Theorem. If $J\in {\cal C}_1({\hat \Theta})$ , then any cluster point ${\hat \theta}_\infty$ of the gradient sequence ${\hat \theta}_n$ is necessarily a stationary point, i.e., ${J^\prime}({\hat \theta}_\infty)=0$ .

Theorem. Let $\overline{{\hat \Theta}}$ denote the concave hull of ${\hat \Theta}\subset\Re ^{\hat N}$ . If $J\in {\cal C}_2({\hat \Theta})$ , and there exist positive constants and such that

$\displaystyle c\left\Vert\,\eta\,\right\Vert^2\leq \eta^T {J^{\prime\prime}}({\hat \theta}) \eta \leq C\left\Vert\,\eta\,\right\Vert^2,$

(4)

for all ${\hat \theta}\in{\hat \Theta}$ and for all $\eta\in\Re ^{{\hat N}}$ , then the gradient method beginning with any point in ${\hat \Theta}$ converges to a point ${\hat \theta}^\ast$ . Moreover, ${\hat \theta}^\ast$ is the unique global minimizer of

in $\Re ^{{\hat N}}$ .

By the norm equivalence theorem [4], Eq. (4) is satisfied whenever ${J^{\prime\prime}}({\hat \theta})$ is a norm on ${\hat \Theta}$ for each ${\hat \theta}\in{\hat \Theta}$ . Since ${J^{\prime\prime}}$ belongs to ${\cal C}_2({\hat \Theta})$ , it is a symmetric matrix. It is also bounded since it is continuous over a compact set. Thus a sufficient requirement is that ${J^{\prime\prime}}$ be positive definite on ${\hat \Theta}$ . Positive definiteness of ${J^{\prime\prime}}$ can be viewed as ``positive curvature'' of at each point of ${\hat \Theta}$ which corresponds to strict concavity of on ${\hat \Theta}$ .

Download gradient.pdf
Download the whole thesis containing this material.

``Elementary Gradient-Based Parameter Estimation'', by Julius O. Smith III, from ``Techniques for Digital Filter Design and System Identification, with Application to the Violin,'' Julius O. Smith III, Ph.D. Dissertation, CCRMA, Department of Electrical Engineering, Stanford University, June 1983.
Copyright © 2006-01-03 by Julius O. Smith III
Center for Computer Research in Music and Acoustics (CCRMA), Stanford University
[Automatic-links disclaimer]