This paper investigates the relationship between the loss function, the type of regularization, and the resulting model sparsity of discriminatively-trained multiclass linear models. The effects on sparsity of optimizing log loss are straightforward: L2 regularization produces very dense models while L1 regularization produces much sparser models. However, optimizing hinge loss yields more nuanced behavior. We give experimental evidence and theoretical arguments that, for a class of problems that arises frequently in natural-language processing, both L1- and L2-regularized hinge loss lead to sparser models than L2-regularized log loss, but less sparse models than L1-regularized log loss. Furthermore, we give evidence and arguments that for models with only indicator features, there is a critical threshold on the weight of the regularizer below which L1- and L2-regularized hinge loss tends to produce models of similar sparsity.
Index Terms: regularization, hinge loss, support vector machines, SVMs, sparsity