Modelling, representing, & generating language

Outline and objectives

Modelling and generation are big topics

Language modelling - Building a probabilistic model of language. Language generation - Using templates and probabilistic models of language to create natural-sounding testing

Language Modelling

Probabilistic Language Models

Core problem of language modelling
Given a sequence of tokens, predict the next token

Difficult to predict what word is next/accurate, is it 'I like feeding pigeons' or 'I like feeding bludgeons'. It is difficult to observe as what if that sentence doesn't appear in our data set.

Each word i depends only on a restricted set of preceding words
Assumption: Language is a stochastic, memoryless process

ULM - We only care about the probabilities of individual words BLM - We care about each words preceding word

Maximum Likelihood Estimation and its problems

$P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_i)}{\#(w_{i-1})}$ is the Maximum Likelihood Estimate (MLE)

Sample from the Berkeley Restaurant Project Corpus

Smoothing

What if we apply our language model on a new sentence
- One novel bigram would pull the probability to 0
Additive smoothing or Laplace smoothing, when counting occurrences, add a quantity to the numerator, now everything is possible
Additive smoothing is rarely used in real language models
Modern methods are a lot more complex
Perplexity = inverse probability of the test dataset (normalised by number of words)

Conclusion

Smallest unit of thought is the n-gram
- Unigrams let us work with single words
- Bigrams let us work with pairs of consecutive words
- Trigrams let us work with triples of consecutive words
Language models let us compare the likelihood of documents

Language Generation

What is NLG

NLG is the subfield of ai and computational linguistics that focuses on computer systems that can produce understandable texts
Is the process of deliberately constructing a natural language text, in order to meet specified communicative goals

Multiple schools of NLG

Gap-filling systems - Simplest, no novelty
Rule-based systems - Expanded gap-filling system
Grammatical function - Complex template system, low novelty, medium error rate
Dynamic generation - Abstract representation, fully dynamic, high novelty, high error rate

NLP = Natural Language Understanding + Natural Language Generation

Stages of template-based NLG

Content Determination - what should we put in the text
Document Structuring - order of information, and grouping of sentences
Lexical Choice - the content words we use depend on genre, context, perception
Referring Expression Generation - Create of expressions referring to entities
Aggregation - Merging syntactic or conceptual constituents
Realisation - Building the actual output

Conclusion

NLG is on a spectrum from fixed gap-filling to completely dynamic generation
NLG needs to be evaluated on two criteria, more dynamic, the hard it is to convey communication goal
More fixed NLG systems are boring but safe
More free NLG systems are interesting but risky

Modelling, representing, & generating language

Outline and objectives​

Language Modelling​

Probabilistic Language Models​

Maximum Likelihood Estimation and its problems​

Smoothing​

Conclusion​

Language Generation​

What is NLG​

Multiple schools of NLG​

Stages of template-based NLG​

Conclusion​

Outline and objectives

Language Modelling

Probabilistic Language Models

Maximum Likelihood Estimation and its problems

Smoothing

Conclusion

Language Generation

What is NLG

Multiple schools of NLG

Stages of template-based NLG

Conclusion