Relation extraction using Joint Bootstrapping Machines

Paper: Joint Bootstrapping Machines for High Confidence Relation Extraction


This paper proposes a method for effective semi-supervised bootstrapping in the context of relation extraction. Deep learning techniques have shown promise in solving entity extraction and relation extraction tasks. However, these techniques require a large amount of hand-labeled data, thus making it an expensive and time-consuming proposition. Hence, semi-supervised bootstrapping techniques are used to simplify and automate extraction from the corpus.

Bootstrapping methods start off with an initial set of positive and negative seed instances(say, examples of entity pairs for the target relation). Occurrences of positive seed instances in the corpus are converted into extraction patterns that are used to extract new relationship entity pairs that augment the initial seed set. This process is continued iteratively.

Since this is an iterative process that builds upon previous extractors, false positives can have a highly detrimental effect. False positives added during a particular iteration can effect extractions from all subsequent iterations, thus leading to semantic drift.

This paper tries to solve this problem by proposing a new bootstrapping method that aims to reduce false positives by using rigorous confidence measures and using a combination of two types of seed sets.


For a target relation like ‘X acquired Y’, the goal is to extract all entity pairs in the corpus for which this relationship holds true. The algorithm requires an initial set of positive and negative seeds. Seeds can be entity pairs, templates, or both. A template consists of three vectors, with the first vector representing the text before the relation occurrence, the second vector representing the text between the entity pair participating in the relation, and the third vector representing the text after the relation occurrence. An instance refers to the combination of the entity pair and its template(context).

The first step is to extract a set of instances from the corpus. The extracted instances are compared to the seeds according to a similarity measure. Instances that pass the similarity threshold are collected and clustered. This process is repeated for three hops (i.e for the next iteration, all instances similar to a hop-1 instance are added).

Each such collected instance is added to the extracted set only if they pass a confidence assessment. For an instance, the confidence of a single extractor is defined as the product of the reliability of the cluster and the similarity of the instance to the cluster. The similarity of an instance to a cluster is defined as the maximum similarity between the instance and any member of the cluster. The reliability of the cluster is an instance independent score.

One of the factors that affect the reliability score of a cluster is the ratio of the number of instances in the cluster that matches negative seeds to the number of instances in the cluster that matches positive seeds. If the ratio is close to zero, it means false positive extractions are less likely compared to true positive extractions. Thus, a large likelihood of false positive extractions reduces the reliability score of the cluster.

All instances that pass the confidence assessment are added to the extracted set and the algorithm is repeated for k iterations.

The number of iterations and the similarity threshold hyperparameters have a tradeoff between them.

Why use both entity and template seeds?

In this algorithm, instance matching is disjunctive, i.e, an instance need only match either the entity pair seed or the template seed, thus having a higher hit rate in terms of matched instances. Using both entity pair seeds and template seeds can act as a confidence booster, with instances matching both entity pair seeds and template seeds getting a higher confidence score.

Solving Fine-grained Entity Type Classification using Neural Networks

Paper: Neural Fine-grained Entity Classification with Hierarchy-Aware loss


Fine-grained Entity Type Classification (FETC) is an NLP task that aims to assign entity instances occurring in text to one or more entity types that are organized in a hierarchy. As an example, consider the sentence ‘In Guitar World’s poll, Hetfield was placed as the 19th greatest guitarist of all time’. The entity ‘Hetfield’ can be seen as belonging to the entity types ‘person’, ‘musician’, and ‘guitarist’, with these three types arranged in a hierarchy (guitarist being a subtype of musician and musician being a subtype of person).

A key challenge in solving the FETC task is the high cost associated with labeling a very large training corpus with fine-grained types. Thus, current methods rely heavily on distant supervision, where entity instances are labeled with all the types in the knowledge graph that are associated with that entity. This can lead to noisy training data. Specifically, this paper identifies two sources of noise - Out-of-context labels and overly-specific labels.

Consider the sentence ‘Hetfield enjoys a variety of activities, most notably hunting; farming and beekeeping;’ This sentence in itself does not provide any evidence that the types ‘musician’ and ‘guitarist’ are associated with the entity. We can say that ‘musician’ and ‘guitarist’ are overly specific labels for this particular entity instance. Overly-specific labels can cause the model to be biased towards more fine-grained entity types. They are also ‘out-of-context’ labels, given that the context in which this entity appears has nothing to do with either ‘musician’ or ‘guitarist’.

Heuristic methods such as discarding training examples where multiple types are assigned to a single entity instance can hurt the accuracy of the model. Thus, a more robust method is desirable to deal with noisy labels. Current FETC systems also heavily rely on hand-crafted features derived from NLP tools, thus introducing its own source of noise.

This paper attempts to solve these issues by using a neural network model. It attempts to solve the problem of overly-specific labels by using a hierarchical loss function, so that penalties can vary depending on how far apart the ground truth and predicted labels are in the hierarchy. It attempts to mitigate the effect of out-of-context labels by using a variant of the cross-entropy loss function.


The training data for FETC consists of

  1. A set of extracted entity mentions from a corpus of text.
  2. The context for each entity mention (tokens surrounding the entity mention bounded by a context window).
  3. Candidate entity type sets that are automatically generated for each entity.

The authors conduct their experiments on two standard datasets:

  1. FIGER - which contains sentences from Wikipedia with entity assignment automatically generated using distant supervision, mapping Wikipedia identifiers to types in Freebase, a knowledge base.
  2. OntoNotes - which contains sentences from the OntoNotes text corpus, a collection of newswire articles. DBpedia spotlight, a tool for automatic annotation, was used to map entity mentions in these sentences to types in Freebase.

Model Inputs and their representations:

The input features to the model are the entity mentions and the context surrounding them. They are represented as follows:

Entity representation:

GloVe embeddings of each of the tokens in the entity are averaged to form an entity vector. In order to capture more semantic information about the entity mention, the token immediately preceding and following the entity mention is added to the entity mention and passed through an LSTM. The final output of the LSTM is then concatenated with the entity vector that was calculated by averaging word embeddings.

Entity context representation:

The entity context input representation for each token in the context is the concatenation of the word embedding for the token and a word position embedding, based on the distance of the token from the entity mention. The resulting vector formed by the concatenation of the representations of all tokens in the context is then passed through a bidirectional LSTM with attention, in order for the model to be able to pick the most informative words in the context.

The mention representation and the context representation form the overall input representation. The input is then passed through a softmax classifier which selects the best entity type that fits the entity mention given its context.

In order to handle data with out-of-context noise, a variant of the cross-entropy loss function is used. The authors only take into account the entity type with the maximum probability and ignore other entity types present in ground truth labels. In order to handle data with overly-specific noise, hierarchical loss normalization is used, which adjusts the probability calculation function by adding a tunable penalty factor.

Key takeaways from the paper:

  1. Yet another task where bidirectional LSTM’s with attention shows its power.
  2. Using hierarchical loss normalization can be used to deal with overly-specific labels in the training data (although this seems obvious)
  3. A simple variant of the cross-entropy loss function can be used to deal with out-of-context labels in the training data