Skip to main content

Modelling, representing, & generating language

Outline and objectives

  • Modelling and generation are big topics

Language modelling - Building a probabilistic model of language. Language generation - Using templates and probabilistic models of language to create natural-sounding testing

Language Modelling

Probabilistic Language Models

  • Core problem of language modelling
  • Given a sequence of tokens, predict the next token

Difficult to predict what word is next/accurate, is it 'I like feeding pigeons' or 'I like feeding bludgeons'. It is difficult to observe as what if that sentence doesn't appear in our data set.

  • Each word i depends only on a restricted set of preceding words
  • Assumption: Language is a stochastic, memoryless process

ULM - We only care about the probabilities of individual words BLM - We care about each words preceding word

Maximum Likelihood Estimation and its problems

P(wiwi1)=#(wi1,wi)#(wi1)P(w_i|w_{i-1}) = \frac{\#(w_{i-1},w_i)}{\#(w_{i-1})} is the Maximum Likelihood Estimate (MLE)

Sample from the Berkeley Restaurant Project Corpus

Smoothing

  • What if we apply our language model on a new sentence

    • One novel bigram would pull the probability to 0
  • Additive smoothing or Laplace smoothing, when counting occurrences, add a quantity to the numerator, now everything is possible

  • Additive smoothing is rarely used in real language models

  • Modern methods are a lot more complex

  • Perplexity = inverse probability of the test dataset (normalised by number of words)

Conclusion

  • Smallest unit of thought is the n-gram
    • Unigrams let us work with single words
    • Bigrams let us work with pairs of consecutive words
    • Trigrams let us work with triples of consecutive words
  • Language models let us compare the likelihood of documents

Language Generation

What is NLG

  • NLG is the subfield of ai and computational linguistics that focuses on computer systems that can produce understandable texts
  • Is the process of deliberately constructing a natural language text, in order to meet specified communicative goals

Multiple schools of NLG

  • Gap-filling systems - Simplest, no novelty
  • Rule-based systems - Expanded gap-filling system
  • Grammatical function - Complex template system, low novelty, medium error rate
  • Dynamic generation - Abstract representation, fully dynamic, high novelty, high error rate

NLP = Natural Language Understanding + Natural Language Generation

Stages of template-based NLG

  • Content Determination - what should we put in the text
  • Document Structuring - order of information, and grouping of sentences
  • Lexical Choice - the content words we use depend on genre, context, perception
  • Referring Expression Generation - Create of expressions referring to entities
  • Aggregation - Merging syntactic or conceptual constituents
  • Realisation - Building the actual output

Conclusion

  • NLG is on a spectrum from fixed gap-filling to completely dynamic generation
  • NLG needs to be evaluated on two criteria, more dynamic, the hard it is to convey communication goal
  • More fixed NLG systems are boring but safe
  • More free NLG systems are interesting but risky