Optimising smooth functions with gradient descent

Gradient descent

What is gradient descent?

Rather than solve a normal equation, gradient descent takes the loss function, and takes the derivative of the loss function with respect to each parameter.

Small adjustments are then made to the parameters, in the direction of the steepest derivative, resulting in better parameters.

As derivative term gets smaller, convergance happens. The largest changes to the parametres occurs early on in the algorithm.

Can stop if not lowering by much

Local minima

Gradient descent is not guaratneed to arrive at a global minimum. For some loss functions, there will be multiple local minima, and gradient descent can end up in the wrong one.

Linear regression does not have this issue.

As a result, when we create functions with loss functions, convextity is very important. If the loss space is convex, then we will not get stuck in a local minima.

Momentum gradient descent

Batch gradient descent

:= used to denote an update of variable. Used in programming, eg x=x+1.

\(\theta _j := \alpha \dfrac{\delta }{\delta \theta _j}J(\theta _0,\theta _1)\)

\(\alpha\) sets rate of descent.

\(\theta 0 := \theta 0 - \alpha/m \sum(h0(x) - y)\)

\(\theta j := \theta j - \alpha/m \sum(h0(x) - y)xj\)

Can check if j theta increasing, means bad methodology, lower alpha

Get run for x iterations,evaluate j(theta)

Can use matrices to do each step

Can check convergence by checking cost over last 1000 or so, rather than all

Smaller learning rate can get to better solution, as can circle drain for small samples

Slowly decreasing learning rate can get better solutions

\(\alpha = const1/(i + cost2)\)

Do gradient descent on all samples

The standard gradient descent algorithm above is also known as batch gradient descent. There are other implementations.

Mini-batch gradient descent

Use \(b\) samples on each iteration, \(b\) is parameter, between stochastic and batch

\(b=2-100\) for example

Stochastic gradient descent

Do gradient descent on one (?!) sample only

Not guaranteed for each step to go towards minimum, but each step much faster

Stochastic gradient descent with momentum

The gradient we use is not just determined by the single sample, it is a moving average of past samples.

Epochs

This refers to the number of times the whole dataset has been run.

Adaptive learning rates (Adagrad, Adadelta, RMSProp, ADAptive Momentum (ADAM))

Adagrad

Adadelta

RMSProp

ADAptive Momentum (ADAM)

Differentiable on a single axis

Coordinate descent

Twice-differentiable functions

Algorithmic efficiency

An algorithm takes memory and time to run. Analysing these characteristics of algorithms can enable effective choice of algorithms.

Complexity is described using big-O notation. So an algorithm with parameters \(\theta\) would have a time efficiency of \(O(f(\theta )\) where \(f(\theta )\) is a function of \(\theta\).

Generally we expect \(f(\theta )\) to be weakly increasing for all \(\theta\). As we add additional inputs, these would not decrease the time or space requirements of the algorithm.

An algorithm which did not change complexity with inputs would have a constant as the largest term. So we would write \(O(c)\).

An algorithm which increase linearly with inputs could be written \(O(\theta )\).

An algorithm which increased exponentially could be written \(O(e^\theta )\).

Complexity can differ between worst-case scenarios, best-case scenarios and average case scenarios.

We can describe logical systems by completeness (all true statements are theorems) and soundness (all theorems are true). We have similar definitions for algorithms.

An algorithm which returns outputs for all possible inputs is complete. An algorithm which never returns an incorrect output is optimal.

Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm

Sort

Floating point

Integer

Tree path

Array

Eg 8 queens

Search tree

in search tree, each node has state which is used for test. could be ID of node (for path finding), path history and cost (for trav salesman)

frontier (not open list)

backward search. only possible if end state is clearly defined. eg maze. not clear if don’t know eg 8 queens.

can do breadth first on them simultaneously?

problem has: initial state. actions, transition model

model \(T(s, a)-> s_n+1\). as in , given state and action, we have new state

goal test on each state

path cost for each sucessor

search tree. we expand when testing action.

open lists in unexplored notes.

loopy paths. if we go \(a->b\) don’t need to go \(b->a\) because if goal, not any closer, if util, higher cost.

redunant paths. if we’ve already been to c, no need to explore going there from somewhere else in goal

if already been to c at lower cost, no point for util

actions is function on state.

keep explored states in open list