Deep learning neural network 11 common pitfalls and countermeasures

The author lists 11 common problems that may be encountered when building a neural network, including preprocessing data, regularization, learning rate, activation function, network weight setting, etc., and provides solutions and explanations for reasons, which is useful for deep learning practice. data.

Deep learning neural network 11 common pitfalls and countermeasures

Problem Description

When using neural networks, it is important to think about how to properly normalize data. This is a step that cannot be changed - if this step is not done carefully and correctly, your network will be almost impossible to work with. Since this step is very important and well known in the deep learning community, it is rarely mentioned in the paper, so beginners often make mistakes in this step.

Deep learning neural network 11 common pitfalls and countermeasures

How to solve it?

In general, normalizaTIon means subtracting the mean from the data and dividing it by its variance. Usually this is done separately for each input and output feature, but you may often want to standardize certain features on the feature set or specifically on the main wing.

why?

The main reason we need to normalize the data is that most neural network processes assume that both input and output data are distributed with a standard deviation of about 1 and an average of about zero. These assumptions are everywhere in the deep learning literature, from weight initialization, activation functions to training network optimization algorithms.

Also need to pay attention

Untrained neural networks typically output values ​​between approximately -1 and 1 range. Some problems arise if you want to output values ​​in other ranges (for example, RBG images are stored in bytes ranging from 0-255). At the beginning of the training, the network will be very unstable, because for example the expected value is 255, the network produces a value of -1 or 1 - which is considered a serious error by most optimization algorithms used to train neural networks. This can create excessive gradients that can cause gradient explosions. If it doesn't explode, the first few stages of training are wasted because the network first learns to narrow the output to roughly the expected range. If the data is normalized (in this case, you can simply divide the value by 128 and subtract 1), these problems will not occur.

In general, the size of the features in the neural network also determines its importance. If one of the features in the output is large in size, it will produce a larger error than other features. Similarly, large-scale features in the input will dominate the network and cause larger changes downstream. Therefore, automatic normalization using neural network libraries is often not enough. These neural network libraries blindly subtract the average and divide by the variance on the basis of each feature. You might have an input feature, usually ranging from 0.0 to 0.001 - the range of this feature is so small because it's an unimportant feature (in which case you might not want to rescale), or because of Other features have some smaller units (in which case you might want to rescale)? Similarly, be careful with features with such a small range that their variance is close to or equal to zero, and if they are normalized, NaN will be unstable. It's important to think carefully about these issues—consider what each of your features really represents and equalize the “units” of all input features, treating the process as normalized. This is one of the few aspects that I think people really need in this loop in deep learning.

You forgot to check the results.

   Problem Description

You have trained a few epochs in the network and also see the error is decreasing. Does this mean that it has been completed? Unfortunately telling you, there are almost certain problems in your code. There may be bugs in data preprocessing, training code, and even inference. Just because the error rate has dropped does not mean that your network is learning useful things.

  How to solve it?

It is important to check that the data is correct at each stage of the process. Usually, you need to find some way to visualize the results. If it is image data, then this is very simple, and the animation data can be visualized without any trouble. But if it's other types of data, you have to find a way to check the results to make sure that each process in preprocessing, training, and inference is correct and compare the results to ground truth data.

  why?

Unlike traditional programming, machine learning systems fail silently in almost all situations. In the traditional programming process, we are used to the computer to throw an error when an error occurs, and return it as a signal to check for bugs. Unfortunately, this process is not suitable for machine learning, so we should be very careful to use the human eye to check the process at each stage to know when a bug occurs, when it needs to go back and check the code more thoroughly.

   Also need to pay attention

There are many ways to check if the network is working properly. Part of the method is to explain exactly what the reported training error means. Visualize the results of the network applied to the training set – how does your network's results compare to ground truth in practice? You may reduce the error from 100 to 1 during training, but if the error of 1 is still an unacceptable result, the result will still be unusable. If the network is working on a training set, then check the validation set - does it still apply to data that has not been seen before? My advice is to get used to visualizing everything from the start -- don't just visualize when the network is not working -- and make sure you've checked the complete process before you start experimenting with different neural network structures. This is the only way to accurately evaluate some potentially different methods.

You forgot to preprocess the data.

Problem Description

Most of the data is tricky—usually the data we know is similar and can be represented by very different numbers. Take character animaTIon as an example: If we use the 3D position of the character's joint relative to the center of the motion-captured studio to represent the data, then when performing an action at a certain position or in a certain direction, it is different The same action, or different directions, may result in a large number of different digital representations. Then we need to represent the data in different ways - for example in some local reference frames (for example relative to the quality center of the character), so that similar actions have similar numerical representations.

   How to solve it?

Thinking about what your characteristics mean - is there a simple transformaTIon that ensures that data points that represent similar things always get a similar numerical representation? Is there a local coordinate system that can represent data more naturally - perhaps a better color space - in a different format?

why?

For the input data, the neural network only makes some basic assumptions, one of which is that the space in which the data is located is continuous - for most spaces, the points between the two data points have at least some "mix", Two adjacent data points represent something "similar" in some sense. The existence of large discontinuities in the data space (disconTInuities) or the existence of a large amount of separated data representing the same thing will make learning tasks more difficult.

  Also need to pay attention

Another way to pre-process data is to try to reduce the combined explosion of required data changes. For example, if the neural network trained in character animation data must learn the same combination of actions in each location and in each direction, then the network has a lot of capacity wasted, and most of the learning process is repeated.

Forget to use regularization

  Problem Description

Regularization—usually in the form of dropout, noise, or a network random process—is another unchangeable aspect of training neural networks. Even if you think you have a much larger amount of data than the parameters, or if the overfitting is not important, or if there is no fit, you should still add dropout or other forms of noise.

How to solve it?

The most basic way to regularize a neural network is to add a dropout before each linear layer (convolution layer or dense layer) of the network. Start with a medium to high retaining probability, such as 0.75 or 0.9. Adjust according to the possibility of overfitting. If you still think that it is impossible to have a fit, you can set the retention probability to very high, for example 0.99.

why?

Regularization is more than just about controlling overfitting. By introducing some random processes during the training process, you are in some sense “smoothing” the loss pattern. This speeds up training, helps with outliers in the data, and prevents extreme weighting of the network.

Also need to pay attention

Data augmentation or other types of noise can also be used as a regularization method like dropout. Although dropout is generally considered to be a technique for combining predictions of multiple random subnetworks, dropout can also be viewed as a method of dynamically expanding the size of a training set by generating many changes in input data during training. And we know that the best way to avoid overfitting and improve network accuracy is to have more data that the network has not seen.

The used Batch is too large

Problem Description

Using too large a batch may have a negative impact on the accuracy of the network during training, as this will reduce the randomness of the gradient drop.

How to solve it?

Find the smallest batch you can accept while training. The batch size that maximizes GPU parallelism during training may not be the best for accuracy, because at some point, larger batches need to train more epochs to achieve the same accuracy. . Don't worry about starting with very small batches, such as 16, 8 or even 1.

why?

Use a smaller batch to produce a more convenient (choppier), more random weight update. There are two major benefits to doing this. First, it can help train "bounce" the local minimum that might otherwise be stuck; second, it can end the training at the "flatter" minimum. In general, the latter will represent better generalization performance.

Also need to pay attention

Other elements in the data can sometimes take effect as batch sizes. For example, when processing an image, the resolution is doubled, and there may be an effect similar to the batch size × 4. Intuitively, in CNN, the weight update for each filter will be averaged across all pixels of the input image and each image in the batch. Double the image resolution will produce an average of more than four times the pixel, just like increasing the batch size by a factor of four. In summary, it is important to consider how much the final gradient update will be averaged in each iteration and to balance the negative impact with the GPU parallelism possible.

Learning rate is incorrect

Problem Description

The learning rate may have a great impact on the network's good training. If you've just entered the line, under the influence of the various default options of the common deep learning framework, it's almost certain that you didn't set the learning rate correctly.

How to solve it?

Turn off the gradient clipping. Find the value of the highest learning rate that does not go wrong during training. Then set the learning rate a little lower than this value - which is likely to be very close to the optimal learning rate.

why?

Gradient cropping is enabled by default for many deep learning frameworks. This option prevents over-optimization during training, and it forces the weight to be changed at each step to maximize the weight. This can be useful, especially when there are many outliers in the data, because outliers can make a big mistake, leading to large gradients and weight updates. However, turning this option on by default will make it difficult for users to manually find the best learning rate. I found that most beginners of deep learning set the learning rate too high due to gradient cropping, which made the overall training behavior slower and made the effect of changing the learning rate unpredictable.

Also need to pay attention

If you cleaned up the data correctly, deleted most of the outliers, and set the learning rate correctly, then you don't actually need gradient clipping. After turning off gradient clipping, if you find that training errors occasionally break out, you can completely re-open the gradient cropping option. However, keep in mind that the cause of frequent training errors almost always indicates some other anomaly in your data—cropping is just a temporary remedy.

The wrong activation function was used on the last layer

Problem Description

Using the activation function on the last level can sometimes mean that your network is not able to produce the full range of values ​​required. The most common mistake is to use ReLU on the last layer, which causes the network to only output positive values.

How to solve it?

If you do a regression, then most of the time you won't want to use any type of activation function on the last level unless you know exactly what kind of value you want to output.

why?

Think again about what your data values ​​actually represent and what they are after standardization. The most likely scenario is that your output is unbounded positive or negative - in this case, you should not use the activation function in the final layer. If your output value only makes sense within certain ranges, such as a probability within 0-1, then the final layer should have a specific activation function, such as the Sigmoid activation function.

Also need to pay attention

There are a number of things to be aware of when using the activation function on the last level. Maybe you know that your system will eventually crop the output to [-1,1]. So, adding this clipping process to the activation of the final layer makes sense, because it will ensure that your network error function does not penalize values ​​greater than 1 or less than -1. However, no error means that these values ​​greater than 1 or less than -1 will not have a gradient - which in some cases will make your network untrainable. Or, you might try to use tanh on the last layer, because the value of the output of this activation function is [-1, 1], but this can also cause problems because the gradient of this function changes around 1 or -1. It's very small, and in order to produce -1 or 1 may make your weight very large. In general, it's best to be safe, don't use the activation function on the last level. Sometimes smart is wrong.

Bad gradient in the network

Problem Description

Deep networks that use the ReLU activation function are often affected by so-called "dead neurons", which are caused by poor gradients. This can have a negative impact on the performance of the network and, in some cases, even no training at all.

How to solve it?

If you find that the training error has not changed after multiple epochs, it is possible to use the ReLU activation function to let all neurons die. Try switching to another activation function, such as leaky ReLU or ELU, and then see if this is still the case.

why?

The gradient of the ReLU activation function has a positive value of 1 and a negative value of zero. This is because when the input is less than 0, a small change in the input does not affect the output. In the short term, this may not be a problem because the gradient of positive values ​​is large. However, layers can be stacked together, and negative weights can change positive values ​​with large gradients to negative values ​​with a gradient of 0; typically, some or even all hidden units have a zero gradient for the cost function, regardless of the input. What is it. In this case, we say that the network is "dead" because the weights are completely unrenewable.

Also need to pay attention

Any operation with a zero gradient (such as crop, round, or max/min) will produce a bad gradient when used to calculate the derivative of the cost function relative to the weight. If they appear in the symbol map, be very careful, as they tend to cause unexpected problems.

Network weights are not initialized correctly

Problem Description

If you don't initialize your neural network weights correctly, then the neural network is simply not likely to train. Many other components in a neural network have some correct or standardized weight initialization and set the weight to zero, or use your own custom random initialization does not work.

How to solve it?

The weight initialization of "he", "lecun" or "xavier" is a very popular choice and works well in almost all situations. You can choose one (my favorite is "lecun"). When your neural network is working properly, you are free to experiment.

why?

As you probably already know, you can use "small random numbers" to initialize neural network weights, but things are not that simple. All of the above initializations are based on complex and detailed mathematical findings that explain why they work best. More importantly, build other neural network components around these initializations and use them for testing based on experience - using your own initialization may make the results of other researchers more difficult to reproduce.

Also need to pay attention

Other layers may also need to be carefully initialized. Network biases are initialized to zero, while other more complex layers (such as parameter activation functions) may have their own initialization, and it is equally important to get this right.

The neural network you use is too deep

Problem Description

Is it deeper and better? Well, the situation is not always the case... Deeper neural networks are generally better when we are desperately refreshing the benchmark and improving the accuracy of certain tasks by 1% and 1%. However, if you only have 3 to 5 layers of small networks that don't learn anything, then I can guarantee that if you use 100 layers, it will fail, if not worse.

How to solve it?

Start with a shallow neural network from 3 to 8 layers. Only after your neural network runs to learn things, explore ways to improve accuracy and try to deepen the network.

why?

In the past decade, all improvements in neural networks have been small, fundamental changes that only apply to smaller networks as deeper performance. If your network doesn't work, it's more likely to be a problem other than depth.

Also need to pay attention

Starting with a small network also means faster training, faster reasoning, and iterating over different designs and settings. Initially, all of these things will have a greater impact on accuracy, not just stacking layers.

The number of hidden units used is incorrect

Problem Description

In some cases, using too many or too few hidden units can make the network difficult to train. There are too few hidden units, there may be no ability to express the required tasks, and there are too many hidden units, which may become slow and difficult to train, and residual noise is difficult to eliminate.

How to solve it?

Start with a hidden unit between 256 and 1024. Then, look at how much other research similar applications use and refer to use. If other researchers use very different numbers than you use, there may be some specific reasons to explain.

why?

When deciding on the number of hidden units, it is important to consider the minimum number of real values ​​you think you need to pass information to the network. You should make this number bigger. For networks that use more redundant representations, dropout can. If you want to do classification, you can use five to ten times the number of classes, and if you do regression, you may need to use two to three times the number of input or output variables. Of course, all of this is highly environmentally dependent, and there is no simple automated solution – having a good intuition is the most important factor in determining the number of hidden units.

Also need to pay attention

In fact, the number of hidden units typically has little effect on neural network performance compared to other factors, and in many cases overestimating the number of hidden units required does not slow down training. Once your network is working, if you are still worried, just try a lot of different numbers and measure the accuracy until you find the one that works best for your network.

Valve Mechanism

Valve Mechanism,Valve Operating Mechanism,Overhead Valve Mechanism,Side Valve Mechanism

Chongqing LDJM Engine Parts Center , https://www.ckcummins.com