Modeling applause in campaign speeches

Paper: Please Clap: Modeling Applause in Campaign Speeches

Introduction

This paper attempts to build a model for predicting audience applause during speeches.

The audience of public speeches can be seen as participating in a coordination game. Each member of the audience must make a split-second decision whether to applaud at a particular utterance. It is socially undesirable if the member applauds when no one else does. Therefore, each member has to utilize knowledge and awareness of social dynamics and speech content to judge whether the rest of the audience will applaud along with them.

The authors focus on three major types of factors that can influence applause - The content of the speech itself, the delivery of the speech (speaker pitch, silence duration, gaze etc), and the design of the speech (rhetorical devices used to induce applause).

This paper deals with modeling applause for a particular category of public speaking events - campaign speeches. Campaign speeches are more amenable to this type of analysis. Speakers have a vested interest in invoking applause and are thus more likely to employ rhetorical techniques in order to induce applause. Campaign speeches are usually self-contained and can encompass a complete rhetorical strategy. They are also delivered in front of a partisan crowd, which means applause and cheers are welcomed.

A key concept introduced in this paper is the rhetorical strategy of tension and release. Building up tension and subsequently releasing it is an oft-used strategy in literature, film, and music(Suspense in literature, for example). The authors investigate if a tension-release model is applicable to campaign speeches as well.

Methodology:

The authors have released a new dataset of text and audio from speeches from campaign events leading up to the 2016 US presidential election annotated with markers denoting audience applause. The dataset includes the speech audio along with their respective closed caption transcriptions.

An acoustic model trained using PennSound, a collection of poetry readings, is used to distinguish speech from applause. This model is run through the audio data to automatically generate applause tags. To match the detected applause with the speech text, forced alignment of the audio with the text is run using the Kaldi toolkit. Words for which forced alignment failed are simply discarded.

The text is then segmented into a series of utterances, in which an utterance is defined as a set of words that are bounded by a period of silence that exceeds a threshold. This segmentation lends itself naturally to binary classification since applause usually happens at pauses between utterances. Moreover, since the closed caption text doesn’t contain punctuation, sentence boundary detectors would be inefficient.

Each utterance is thus tagged with a positive or negative label, with the utterance being labeled positive if there was an applause 1.5 seconds within the end of the utterance, and negative otherwise.

Difference speakers speak at different speeds, hence the threshold may have to be adjusted accordingly.

Features:

LIWC categories: LIWC (Linguistic Inquiry and Word Count) is a dictionary of words annotated with syntactic and semantic categories they belong to.

Euphony: Euphony refers to the sounds of words or phrases that are aesthetically pleasing. Features that are used to identify euphonic words are:

  1. plosives - Consonants whose pronunciation requires blocking the vocal tract so that all airflow ceases. The plosive score for an utterance is the ratio of plosive sounds in the utterance to the total number of phonemes in the utterance.
  2. rhyme - Rhymes are repeating patterns of similar sounding sounds. The rhyme score for an utterance is the number of repeated sounds at the beginning or end of words in the utterance.
  3. alliteration - Alliteration is a stylistic device where a series of words have the same first consonant sound (Example, ‘She sells sea-shells by the sea shore’). The alliteration score in an utterance is the ratio of the number of repeated prefix phonemes in an utterance to the total number of phonemes.
  4. Homogeneity - The homogeneity score is a measure of distinct phonemes in an utterance.

Lexical: Bigrams that occur more than 5 times in the text are included as features in the model.

Embeddings: Sentence embeddings learned from a CNN are used. The authors use a Skip-Thought model.

Acoustic features: Acoustic features include the max, min, mean, standard deviation, and range in the utterance’s pitch and energy.

Repeated words: Some rhetorical strategies rely on repeating sub-units of speech. To accommodate this, the following features are used:

  1. Repeated words: This feature is calculated by taking the proportion of words in the current utterance that also occur in the previous utterance.
  2. Longest Common Subsequence: Repeating the same phrase at the start of a sentence is used to build tension. The paper provides this example:

‘We will not allow the party of Lincoln and Reagan to fall into the hands of a con artist. We will not allow the next president of the United States to be a socialist like Bernie Sanders. And we will not allow the next president of the United States to be someone under FBI investigation like Hillary Clinton.’ - Marco Rubio

To capture this, the longest common subsequence between the current utterance and the previous utterance is used as a feature.

Delta: Every feature in the model is accompanied with a delta feature. The delta feature measures the difference between each feature at time t with the feature at time t-1. Delta features are informative because rhetorical theory suggests that highly similar or highly different neighboring utterances can indicate dramatic moments.

Features inspired by rhetorical structure theory: Rhetorical structure theory is a framework for understanding the relationship between components of a text and how they interact to form a meaningful whole. The basic unit is termed an EDU (Elementary Discourse Units). EDU’s are connected to each other by relationships, typically hierarchical. An RST parser is used to extract the RST structure of the text. The features used are:

  1. RST labels: For each utterance, all the relationships between EDU’s encompassed in it are taken. The label consists of the type of relationship and its directionality.
  2. Rhetorical phrase closures: This feature gives the number of rhetorical phrases closed by the units contained in an utterance.

Logistic regression and LSTM models are then used to predict applause using these features.

Analysis:

The authors report these features as most informative:

High pitch, high energy, and broad pitch range are influential. Bigram features are generally useful, but are especially useful when trained on the same speaker, since speakers use their trademark catch-phrases. The strongest bigrams include moral declaratives like ‘should not’, ‘right to’ etc. Other strong bigram features include politically charged topics and call-outs to the audience. The most informative LIWC categories are the FOCUSFUTURE category (will, gonna, going), BODY(heart, hands, brain), and REWARD(success). RST relation categories ANTITHESIS and PURPOSE seem to be the most informative.

What applications can this be research be useful for?

  1. Understand social dynamics
  2. Learn to give better talks.
  3. Automatic summarization of videos.
  4. Analyzing situations where applause was expected but not received.
  5. Discover paid clappers.