Solving Fine-grained Entity Type Classification using Neural Networks

Paper: Neural Fine-grained Entity Classification with Hierarchy-Aware loss


Fine-grained Entity Type Classification (FETC) is an NLP task that aims to assign entity instances occurring in text to one or more entity types that are organized in a hierarchy. As an example, consider the sentence ‘In Guitar World’s poll, Hetfield was placed as the 19th greatest guitarist of all time’. The entity ‘Hetfield’ can be seen as belonging to the entity types ‘person’, ‘musician’, and ‘guitarist’, with these three types arranged in a hierarchy (guitarist being a subtype of musician and musician being a subtype of person).

A key challenge in solving the FETC task is the high cost associated with labeling a very large training corpus with fine-grained types. Thus, current methods rely heavily on distant supervision, where entity instances are labeled with all the types in the knowledge graph that are associated with that entity. This can lead to noisy training data. Specifically, this paper identifies two sources of noise - Out-of-context labels and overly-specific labels.

Consider the sentence ‘Hetfield enjoys a variety of activities, most notably hunting; farming and beekeeping;’ This sentence in itself does not provide any evidence that the types ‘musician’ and ‘guitarist’ are associated with the entity. We can say that ‘musician’ and ‘guitarist’ are overly specific labels for this particular entity instance. Overly-specific labels can cause the model to be biased towards more fine-grained entity types. They are also ‘out-of-context’ labels, given that the context in which this entity appears has nothing to do with either ‘musician’ or ‘guitarist’.

Heuristic methods such as discarding training examples where multiple types are assigned to a single entity instance can hurt the accuracy of the model. Thus, a more robust method is desirable to deal with noisy labels. Current FETC systems also heavily rely on hand-crafted features derived from NLP tools, thus introducing its own source of noise.

This paper attempts to solve these issues by using a neural network model. It attempts to solve the problem of overly-specific labels by using a hierarchical loss function, so that penalties can vary depending on how far apart the ground truth and predicted labels are in the hierarchy. It attempts to mitigate the effect of out-of-context labels by using a variant of the cross-entropy loss function.


The training data for FETC consists of

  1. A set of extracted entity mentions from a corpus of text.
  2. The context for each entity mention (tokens surrounding the entity mention bounded by a context window).
  3. Candidate entity type sets that are automatically generated for each entity.

The authors conduct their experiments on two standard datasets:

  1. FIGER - which contains sentences from Wikipedia with entity assignment automatically generated using distant supervision, mapping Wikipedia identifiers to types in Freebase, a knowledge base.
  2. OntoNotes - which contains sentences from the OntoNotes text corpus, a collection of newswire articles. DBpedia spotlight, a tool for automatic annotation, was used to map entity mentions in these sentences to types in Freebase.

Model Inputs and their representations:

The input features to the model are the entity mentions and the context surrounding them. They are represented as follows:

Entity representation:

GloVe embeddings of each of the tokens in the entity are averaged to form an entity vector. In order to capture more semantic information about the entity mention, the token immediately preceding and following the entity mention is added to the entity mention and passed through an LSTM. The final output of the LSTM is then concatenated with the entity vector that was calculated by averaging word embeddings.

Entity context representation:

The entity context input representation for each token in the context is the concatenation of the word embedding for the token and a word position embedding, based on the distance of the token from the entity mention. The resulting vector formed by the concatenation of the representations of all tokens in the context is then passed through a bidirectional LSTM with attention, in order for the model to be able to pick the most informative words in the context.

The mention representation and the context representation form the overall input representation. The input is then passed through a softmax classifier which selects the best entity type that fits the entity mention given its context.

In order to handle data with out-of-context noise, a variant of the cross-entropy loss function is used. The authors only take into account the entity type with the maximum probability and ignore other entity types present in ground truth labels. In order to handle data with overly-specific noise, hierarchical loss normalization is used, which adjusts the probability calculation function by adding a tunable penalty factor.

Key takeaways from the paper:

  1. Yet another task where bidirectional LSTM’s with attention shows its power.
  2. Using hierarchical loss normalization can be used to deal with overly-specific labels in the training data (although this seems obvious)
  3. A simple variant of the cross-entropy loss function can be used to deal with out-of-context labels in the training data