Deep Averaging Networks (DAN)

Paper: Deep unordered methods rival syntactic methods for text classification


Natural language processing applications have progressed from representing words using sparse one-hot vectors to using dense low-dimensional vectors. These low dimensional vectors are referred to as word embeddings.

If we can model words using word embeddings, how should we model sentences and documents? To achieve this, we need a compositionality function that composes the embeddings of different words into a single vector that best characterizes the composition. Compositionality functions can be broadly described into two types - models that don’t take into account the order of words and syntactic structure and models that do.

Models that don’t take word order or syntactic structure into account are called Bag of Word (BOW) models. One would expect that BOW models should be inferior to syntactic models since word order and syntax are highly informative elements. However, the previous statement is predicated on the ability of the model to efficiently incorporate syntax/word order. There is also the issue of computation time, where BOW models are more efficient.

This paper shows that we should not discard BOW models so easily. According to the results presented in this paper, unordered compositionality functions outperform or perform as well as syntactic compositionality functions for certain tasks like sentiment analysis and question answering.

The authors of this paper introduce a model called Deep Averaging Networks (which essentially means BOW model with averaged inputs) The DAN model works as follows:

  1. The word embeddings of the input sequence are averaged.
  2. The resulting input vector is then passed through a neural network with hidden layer(s)
  3. Classification is performed on the final layer’s representation (through a softmax)
  4. A dropout like regularization called ‘word dropout’ which randomly drops a percentage of the words before calculating the vector average of the embeddings is used

The authors show highly competitve results on tasks like sentiment analysis and question answering when compared to state-of-the-art techniques like recurrent neural networks.

Key takeaways from this paper:

  1. Adding non-linearities to the model is more important /equally important as taking into account syntax and word order.
  2. Randomly dropping a fixed percentage of tokens every iteration before calculating the embedding average is a form of regularization comparable to dropout.
  3. BOW models trained on out-of-domain data can perform better than syntactic models as the latter requires training and test data to have similar syntactic structures
  4. Pre-trained embeddings are preferred since they already capture some sentiment.
  5. Syntactically aware models make similar errors as BOW models, suggesting that current syntactic models are still deficient in accurately incorporating syntax.
  6. Averaging word embeddings work better than summing them up for the tasks considered.

This paper shows that more complex models need not always be ‘better’ models. Simple unordered models are still competitive, and should be part of any practitioner’s tookit, especially if computational resources are limited.

Why does averaging work?

For a number of problems, word ordering is simply not that important. The nonlinearities in the model simply represent unordered compositions of the input, which is enough to produce reasonable results on tasks like sentiment analysis. Also, it is actually possible to recover some ordering from averaged embeddings due to the underlying data distribution where certain words are more likely to occur before/after other words. Language is predictable, so some orderings are much more likely to occur than others. Suffice to say, this phenomenon is not yet fully understood by the research community.