How is the bias in the neural network updated

What role does distortion play in neural networks?


I think prejudice is almost always helpful. With With a bias value, you can move the activation function to the left or right what can be critical to successful learning.

It might be helpful to look at a simple example. Consider this network with 1 input and 1 output that has no distortion:

The output of the network is calculated by dividing the input (x) with the weight (w 0 ) is multiplied and the result is passed through an activation function (e.g. a sigmoid function).

Here is the function this network does for different values ​​of w 0 calculated:

By changing the weight w 0 becomes in Essentially changed the "steepness" of the sigmoid. This is useful, but what if you want the network to return 0 when x is 2? Just changing the steepness of the sigmoid doesn't really work - You want to be able to move the entire curve to the right .

This is exactly what the bias enables you to do. If we add a tendency to this network like this:

... then the output of the network sig (w 0 * x + w 1 * 1.0). This is how the output of the network looks for different values ​​of w 1 off :

A weight of -5 for w 1 shifts the curve to the right, which allows us to have a network that returns 0 when x is 2.

Just to add my two cents.

An easier way to understand what the bias is: it's kind of like the constant b a linear function

y = ax + b

You can move the line up and down to better align the forecast with the data. Without b goes the line always goes through the origin (0, 0) and you may get a worse fit.

This thread really helped me develop my own project. Here are some more figures showing the result of a simple 2-layer feed-forward neural network with and without bias units on a two-variable regression problem. The weights are randomly initialized and the standard ReLU activation is used. As the answers before me noted, the ReLU network cannot deviate from zero without the bias at (0,0).

During the training of an ANN, two different types of parameters can be set, the weights and the value in the activation functions. This is inconvenient and it would be easier if only one of the parameters were to be adjusted. To solve this problem, a bias neuron is invented. The bias neuron lies in one layer, is connected to all neurons in the next layer but none in the previous layer, and it always emits 1. Since the bias neuron emits 1, the weights connected to the bias neuron become direct adds the combined sum of the other weights (equation 2.1), just like the t-value in the activation functions. 1

The reason this is inconvenient is because you are adjusting the weight and value at the same time, so any change in weight can neutralize the change in value that was useful for a previous instance of data. Adding a bias neuron without changing the value allows you to control the behavior of the layer.

In addition, with the bias, you can use a single neural network to represent similar cases. Consider the Boolean AND function represented by the following neural network:


  • w0 corresponds to b .
  • w1 corresponds to x1 .
  • w2 corresponds to x2 .

A single perceptron can be used to represent many Boolean functions.

For example, assuming boolean values ​​of 1 (true) and -1 (false), one way to use a two-input perceptron to implement the AND function is to assign the weights w0 = -3 and w1 = w2 = set .5. This perceptron can be made to represent the OR function by changing the threshold to w0 = -.3 instead. In fact, AND and OR can be viewed as special cases of m-of-n functions: that is, functions where at least m of the n inputs to the perceptron must be true. The OR function corresponds to m = 1 and the AND function to m = n. Each m-of-n function can easily be represented with a perceptron by setting all input weights to the same value (e.g. 0.5) and then the threshold value w0 can be adjusted accordingly.

Perceptrons can represent all primitive Boolean functions AND, OR, NAND (1 AND) and NOR (1 OR). Machine Learning - Tom Mitchell)

The threshold is the bias and w0 is the weight associated with the bias / threshold neuron.

The distortion is not a term, but a general algebra term to be considered.

(Straight line equation)

If then, the line will always pass through the origin, that is, and only depends on one parameter, that is, the slope, so we have fewer things to play with.

This is the bias, which takes on any number and has the activity of shifting the graph and thus depicting more complex situations.

In a logistic regression, the expected value of the target is transformed by a logic function in order to restrict its value to the unit interval. In this way, model predictions can be viewed as the primary outcome probabilities as follows: Sigmoid function on Wikipedia

This is the last activation layer in the NN card that turns the neuron on and off. Here, too, the distortion plays a role and moves the curve flexibly in order to depict the model.

A layer in an un-biased neural network is nothing more than the multiplication of an input vector by a matrix. (The output vector may be passed through a sigmoid function for normalization and subsequent use in multilayer ANN, but this is not important.)

This means that you are using a linear function and therefore an input of all zeros will always map to an output of all zeros. This may be a reasonable solution for some systems, but is generally too restrictive.

With a bias, you effectively add another dimension to your input area, always one, so you avoid an input vector of all zeros. You do not lose any general validity because your trained weight matrix does not have to be surjective so that it can still be assigned to all previously possible values.

2d ANN:

For an ANN that maps two dimensions to one dimension, as in rendering the AND or OR (or XOR) functions, you can think of a neural network as follows:

Mark all positions of the input vectors in the 2D plane. So for boolean values ​​you want to mark (-1, -1), (1,1), (-1,1), (1, -1). What your ANN does now is draw a straight line in the 2D plane separating the positive output from the negative output values.

Without distortion, that straight line must go through zero, while with distortion you can place it anywhere. So you will find that you run into a problem with the AND function without bias because you don't both (1, -1) and (-1,1) on the negative side. (You won't be allowed to be on across the board.) The problem is the same for the OR function. However, with a tendency, it's easy to draw the line.

Note that the XOR function cannot be solved in this situation, even with a preload.

When you use ANNs, you seldom know about the internals of the systems you are trying to learn. Some things cannot be learned without prejudice. For example, look at the following data: (0, 1), (1, 1), (2, 1), basically a function that maps each x to 1.

If you have a single tier (or linear mapping) network, you can't find a solution. However, if you have a tendency, it's trivial!

In an ideal environment, a distortion could also map all points to the mean of the target points and let the hidden neurons model the differences from that point.

The modification of the neuron weights only serves the purpose of the Shape / curvature Your transfer function and not theirs Equilibrium / zero to manipulate .

The introduction of Bias Neurons allow you to move the transfer function curve horizontally (left / right) along the input axis while keeping shape / curvature unchanged. In this way, the network can produce any output that differs from the default settings. Therefore you can adapt / move the input-output assignment to your special requirements.

A graphic explanation can be found here:

Just to add to all of this something that is very much lacking that the rest of the people most likely didn't know.

When working with images, you may prefer not to use any distortion at all. In theory, this way your network is more independent of the data size, e.g. B. whether the picture is dark or light and vivid. And the web will learn to do its job by examining the theory of relativity in your data. Many modern neural networks use this.

For other data, biases can be critical. It depends on what kind of data you are dealing with. If your information is amount invariant - if entering [1,0,0,1] should produce the same result as entering [100,0,10], you may be better off without bias.

In some experiments in my master's thesis (e.g. page 59) I found that the distortion might be important for the first layer (s), but especially with the fully connected layers at the end it doesn't seem to matter much play.

This can be very dependent on the network architecture / dataset.

The preload determines how much angle you want your weight to turn.

In a two-dimensional diagram, weight and distortion help us find the decision limit of spending. Assuming we need to create an AND function, the input (p) -output (t) pair should be

{p = [0,0], t = 0}, {p = [1,0], t = 0}, {p = [0,1], t = 0}, {p = [1,1] , t = 1}

Now we have to find the decision limit, the idea limit should be:

See? W is perpendicular to our limit. So we say W set the direction of the boundary.

However, it is difficult to find the right W the first time. Most of the time we choose the original W value at random. The first limit can therefore be:

Now the limit is parallel to the y-axis.

We want to flip the line, eh?

By changing the W.

So we use the learning rule function: W '= W + P:

W '= W + P is equivalent to W' = W + bP while b = 1.

Therefore, by changing the value of b (Bias), you can determine the angle between W 'and W. This is "ANN's learning rule".

You can also read Neural Network Design by Martin T. Hagan / Howard B. Demuth / Mark H. Beale, Chapter 4, "Perceptron Learning Rule".

In particular, Nate's answer, zfy's answer, and Pradi's answer are great.

In simpler terms, distortions allow that more and more variations learned / saved from weights ... ( side note : sometimes with a certain threshold). Anyway case mean more variations that biases the learned / stored weights of the model a more comprehensive presentation of the input space. (Where better weights can improve the estimating power of the neural network)

For example in learning models the hypothesis / guess is desirably bounded by y = 0 or y = 1 when input is made, possibly in a classification task ... i.e. some y = 0 for some x = (1,1) and some y = 1 for some x = (0,1). (The condition for the hypothesis / outcome is the threshold I talked about above. Note that in my examples the inputs X are set up to be x = a double or two-valued vector, respectively, instead of Nate's singular ones x-entries of a collection X).

When we get the distortion to ignore , become many inputs possibly by many of the same weights (ie the learned weights) are shown usually occur near the origin (0.0). The model would then be limited to inferior sets of good weights. Instead of the many, many other good weights, it might better learn with bias (where poorly learned weights lead to worse guesses or a reduction in the neural network's power of guesswork).

It is therefore optimal that the model learns both close to the origin and as many places as possible within the threshold / decision limit. With this tendency, we can allow degrees of freedom close to the origin, but without being limited to the immediate region of the origin.

Extension of the @zfy explanation ... The equation for an input, a neuron and an output should look like this:

where x is the value of the input node and 1 is the value of the bias node; y can be your output directly or passed to a function, often a sigmoid function. Also note that the distortion can be any constant, but to make things easier we always choose 1 (and that's probably so common that @zfy did it without showing and explaining).

Your network tries to learn the coefficients a and b to match your data. Here's how you can see why adding the item allows you to better fit more data: Now you can change both the slope and the intercept.

If you have more than one input, your equation will look like this:

It should be noted that the equation still describes a neuron, an output network; When you have more neurons, add just one dimension to the coefficient matrix to multiplex the inputs to all nodes and back-calculate each node contribution.

This can be written in vectorized format as

So if you put coefficients in one array and (inputs + bias) in another array, you have your desired solution as the dot product of the two vectors (you need to transpose X for the shape to be correct. I wrote XT an 'X transposed') ).

So in the end, you can also see your distortion as just another input to represent the part of the output that is actually independent of your input.

Answers other than those mentioned. I would like to add a few other points.

Bias acts as our anchor. It's a way for us to have some kind of baseline that we're not below. On a graph like y = mx + b, imagine it's like a y-intercept of this function.

output = input multiplies with the weight value and adds one Bias value and then apply an activation function.

To put it simply, if you have y = w1 * x, in which y Your output and w1 the weight is, imagine a condition in which x = 0, then y = w1 * x equal to 0 if you want to update your weight you have Around to calculate how much is going through delw = target-y changes, where target is your The target output is changes in this case 'delw' not there y is calculated as 0. So let's say if you can add some extra value it will help y = w1 * x + w0 * 1 , where bias = 1 and weight can be adjusted to get correct bias. Consider the following example.

In terms of the line, slope-intercept is a specific form of linear equations.

y = mx + b

Check the picture


here is b (0.2)

If you want to increase it to (0.3) how are you going to do it by changing the value of b which will be your distortion

For all of the ML books I've studied, W is always defined as the connectivity index between two neurons. The higher the connectivity between two neurons, the stronger the signals are transmitted from the firing neuron to the target neuron, or Y = w * X As a result, in order to maintain the biological character of neurons, we must maintain 1> = W> = -1, but in real regression, the W becomes with | W | end> = 1, which contradicts the way neurons work, so I suggest W = cos (theta) while 1> = | cos (theta) | and Y = a * X = W * X + b, while a = b + W = b + cos (theta), b is an integer

In neural networks:

  1. Every neuron has a tendency
  2. You can view bias as a threshold (generally opposite values ​​of the threshold).
  3. The weighted sum of input layers + bias decides on the activation of the neuron
  4. Bias increases the flexibility of the model.

In the absence of a bias, the neuron cannot be activated just by taking into account the weighted sum from the input layer. If the neuron is not activated, the information from that neuron will not be passed through the rest of the neural network.

The value of the distortion is learnable.

Effective bias = threshold. You can think of bias as the ease with which the neuron can output a 1 - with a really large bias it is very easy for the neuron to output a 1, but when the bias is very negative it is difficult.

In summary: Bias helps control the level at which the activation function is triggered.

Follow this video for more details

Some other useful links:


towards data science

The term bias is used to adjust the final output matrix such as the y-intercept. For example, in the classical equation, y = mx + c, if c = 0 then the line always goes through 0. Adding the bias term gives more flexibility and better generalization to our neural network model.

In general, when it comes to machine learning, we have this basic formula: Bias-variance trade-off Because in NN we have the problem of overfitting (model generalization problem where small changes in the data lead to large changes in the model result) and for this reason we have large variance, by introducing a small bias could help a lot. Considering the above formula Bias-variance trade-off , where the distortion is squared, the introduction of a small distortion could lead to a large decrease in the variance. So introduce bias when you have large variances and are at risk of overfitting.

The distortion helps to get a better equation

Think of the input and output as a function and you need to put the correct line between input (x) and output (y) to minimize the global error between each point and the line. If you keep the equation like this, you have one parameter to adjust only, even if you find the best one to minimize global error, it is far from the value you want

You can say that the distortion makes the equation more flexible to fit the best values

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.