Modelling, representing, & generating language
Outline and objectives
- Modelling and generation are big topics
Language modelling - Building a probabilistic model of language. Language generation - Using templates and probabilistic models of language to create natural-sounding testing
Language Modelling
Probabilistic Language Models
- Core problem of language modelling
- Given a sequence of tokens, predict the next token
Difficult to predict what word is next/accurate, is it 'I like feeding pigeons' or 'I like feeding bludgeons'. It is difficult to observe as what if that sentence doesn't appear in our data set.
- Each word
i
depends only on a restricted set of preceding words - Assumption: Language is a stochastic, memoryless process
ULM - We only care about the probabilities of individual words BLM - We care about each words preceding word
Maximum Likelihood Estimation and its problems
is the Maximum Likelihood Estimate (MLE)
Sample from the Berkeley Restaurant Project Corpus
Smoothing
What if we apply our language model on a new sentence
- One novel bigram would pull the probability to 0
Additive smoothing or Laplace smoothing, when counting occurrences, add a quantity to the numerator, now everything is possible
Additive smoothing is rarely used in real language models
Modern methods are a lot more complex
Perplexity = inverse probability of the test dataset (normalised by number of words)
Conclusion
- Smallest unit of thought is the n-gram
- Unigrams let us work with single words
- Bigrams let us work with pairs of consecutive words
- Trigrams let us work with triples of consecutive words
- Language models let us compare the likelihood of documents
Language Generation
What is NLG
- NLG is the subfield of ai and computational linguistics that focuses on computer systems that can produce understandable texts
- Is the process of deliberately constructing a natural language text, in order to meet specified communicative goals
Multiple schools of NLG
- Gap-filling systems - Simplest, no novelty
- Rule-based systems - Expanded gap-filling system
- Grammatical function - Complex template system, low novelty, medium error rate
- Dynamic generation - Abstract representation, fully dynamic, high novelty, high error rate
NLP = Natural Language Understanding + Natural Language Generation
Stages of template-based NLG
- Content Determination - what should we put in the text
- Document Structuring - order of information, and grouping of sentences
- Lexical Choice - the content words we use depend on genre, context, perception
- Referring Expression Generation - Create of expressions referring to entities
- Aggregation - Merging syntactic or conceptual constituents
- Realisation - Building the actual output
Conclusion
- NLG is on a spectrum from fixed gap-filling to completely dynamic generation
- NLG needs to be evaluated on two criteria, more dynamic, the hard it is to convey communication goal
- More fixed NLG systems are boring but safe
- More free NLG systems are interesting but risky