Its strength is that it measures the divergence between an unknown probability distribution and a predicted distribution. I’ve use hinge loss and found it trained to higher accuracy for one problem.Ĭross entropy loss is the loss function most often used in machine learning. See this article for examples of standard loss functions:Įach loss function has its own justifications and characteristics, for example, how sharply they separate classes and how much they penalize outliers. So the function you posted above would work fine. It only needs to penalize the wrong guesses, reward the correct ones, and be mostly differentiable. As I understand it, many different functions can serve for a loss function. I am not a real mathematician, so experts feel free make corrections. This is why Binary-Cross-Entropy is often used for this, and it is using sigmoid instead of a softmax for this reason. If this is not true then softmax should not be used. Softmax passes through information in a way that says there’s only 1 label per image as they all sum to 1. This is why when you do multi label problems cross-entropy should be replaced with binary-cross-entropy. This makes sense in a single class and makes the space you need to search smaller. If 1 class chance increases, the one or more of the other classes decrease. So basically the reason why using the softmax (ie cross-entropy) is better for gradient descent, is because baked into the loss function is additional information. By the same reasoning, we may want the sum to be less than 1, if we don’t think any of the categories appear in an image. softmax, as we saw, requires that all predictions sum to 1, and tends to push one activation to be much larger than the others (due to the use of exp ) however, we may well have multiple objects that we’re confident appear in an image, so restricting the maximum sum of activations to 1 is not a good idea.Note that because we have a one-hot-encoded dependent variable, we can’t directly use nll_loss or softmax (and therefore we can’t use cross_entropy ): In PyTorch, this is available as nn.CrossEntropyLoss (which, in practice, actually does log_softmax and then nll_loss ): When we first take the softmax, and then the log likelihood of that, that combination is called cross-entropy loss. I had read them before, but they only made sense after what SamJoel said In addition to what SamJoel said, these 2 quotes from /fastai/fastbook proved very helpful to me. I am going to summarize what I think I learned. I have been looking, and your response helped a ton. I don’t really understand why this would make it easier or faster for the model to train.Ĭan anyone point me in the right direction for where I need to either expand or correct my understanding of this? My goal is to try to build an understanding of loss functions so that I can understand when and how they should be changed for specific problems. To me, this seems like it’s doing roughly the same thing just with an additional conversion. My understanding is cross-entropy does something similar, but rather than only putting each prediction between 0 and 1, it makes each prediction of this class a % likelihood (so all sum to 1). That seems to do what we want and the more we minimize it the more closely the model is matching the targets. We could have 1 image have multiple classes by having the target have multiple columns be ‘1’ in the same row. The less confident it is about a correct class, the more that adds to the loss as well. The more confident it is about a wrong class, the more it adds to loss. Why should I use cross-entropy rather than extending this? If I take prediction that are all between 0 and 1, why not just take the difference.įor example if we have 2 images we are classifying into 3 classes, we may have this: pred = tensor(,) Return torch.where(targets=1, 1-predictions, predictions).mean() In a single class classifier (taken from new Fastai book by Sylvain and Jeremy on /fastai/fastbook) it looks like this: def mnist_loss(predictions, targets): I am going to outline my thought process and what I think I know.Ī simple way to measure loss it to take the difference between the prediction (passed through a sigmoid so all are between 0 and 1) and my y_truth. I am hoping someone can help point me in the right direction. I am working on learning Neural Networks, and I am a bit unclear on the benefits of cross-entropy loss function for multi-class image classification.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |