14.1.2 Bp Connection Specifications

In addition to the weight itself, the connection type in Bp, BpCon, has two additional variables:

float dwt
The most recently computed change in the weight term. It is computed in the UpdateWeights function.
float dEdW
The accumulating derivative of the error with respect to the weights. It is computed in the Compute_dWt function. It will accumulate until the UpdateWeights function is called, which will either be on a trial-by-trial or epoch-wise (batch mode) basis.

The connection specifications control the behavior and updating of connections (see section 10.5 Connections). Thus, in Bp, this is where you will find thinks like the learning rate and momentum parameters. A detailed description of the parameters is given below:

float lrate
The learning rate parameter. It controls how fast the weights are updated along the computed error gradient. It should generally be less than 1 and harder problems will require smaller learning rates.
float momentum
The momentum parameter determines how much of the previous weight change will be retained in the present weight change computation. Thus, weight changes can build up momentum over time if they all head in the same direction, which can speed up learning. Typical values are from .5 to .9, with anything much lower than .5 making little difference.
MomentumType momentum_type
There are a couple of different ways of thinking about how momentum should be applied, and this variable controls which one is used. According to AFTER_LRATE, momentum is added to the weight change after the learning rate has been applied:
  cn->dwt = lrate * cn->dEdW + momentum * cn->dwt;
  cn->wt += cn->dwt;
This was used in the original pdp software. The BEFORE_LRATE model holds that momentum is something to be applied to the gradient computation itself, not to the actual weight changes made. Thus, momentum is computed before the learning rate is applied to the weight gradient:
  cn->dwt = cn->dEdW + momentum * cn->dwt;
  cn->wt += lrate * cn->dwt;
Finally, both of the previous forms of momentum introduce a learning rate confound since higher momentum values result in larger effective weight changes when the previous weight change points in the same direction as the current one. This is controlled for in the NORMALIZED momentum update, which normalizes the total contribution of the previous and current weight changes (it also uses the BEFORE_LRATE model of when momentum should be applied):
  cn->dwt = (1.0 - momentum) * cn->dEdW + momentum * cn->dwt;
  cn->wt += lrate * cn->dwt;
Note that normalized actually uses a variable called momentum_c which is pre-computed to be 1.0 - momentum, so that this extra computation is not incurred needlessly during actual weight updates.
float decay
Controls the magnitude of weight decay, if any. If the corresponding decay_fun is NULL weight decay is not performed. However, if it is set, then the weight decay will be scaled by this parameter. Note that weight decay is applied before either momentum or the learning rate is applied, so that its effects are relatively invariant with respect to manipulations of these other parameters.
decay_fun
The decay function to be used in computing weight decay. This is a pointer to a function, which means that the user can add additional decay functions as they wish. However, the default ones are Bp_Simple_WtDecay, which simply subtracts a fraction of the current weight value, and Bp_WtElim_WtDecay, which uses the "weight elimination" procedure of Weigand, Rumelhart, and Huberman, 1991. This procedure allows large weights to avoid a strong decay pressure, but small weights are encouraged to be eliminated:
  float denom = 1.0 + (cn->wt * cn->wt);	
  cn->dEdW -= spec->decay * ((2 * cn->wt) / (denom * denom);
The ratio of the weight to the denom value is roughly proportional to the weight itself for small weights, and is constant for weights larger than 1.