Skip to main content

Natural Language Processing

Overview of NLP

Why Language

Language is the basis of interaction

Natural Language Understanding - Finding insights and understanding from text Natural Language Generation - Generating text for a purpose

Language is the key to interaction

  • Interaction between people, machines, between machines etc Advances in speech recognition and speech generation (thanks to deep learning) brought natural language interfaces back Natural language -> languages which naturally evolved over time

Natural Language Understanding

  • NLU systems focus on analysing natural language
  • Humans are obsessive hoarders of data
  • We can gain insights from the text data

Natural Language Generation

  • NLG systems focus on generating natural language
    • From chatbots
    • Structured data
    • Media data etc

NLP: the early years

  • Focused on Rule-Based Systems
  • Extract elements from text with regular expressions
  • Classify text in a category based on lexicons
  • Use conditions, extracted elements, and rules to build dynamically generated reports to send to users
  • Rule-based systems are still used today, because they are reliable, easy to understand and require no data

Better computers brought a new obsession with benchmarking

Then deep learning craze appeared.

Common subfields of NLP

  • Opinion mining, sentiment analysis
  • Argument Stance Detection
  • Recognising Textual Entailment
  • Automated Summarisation
  • Question Answering
  • Information Retrieval
  • Conversational User Interfaces

The NLP Pipeline

../_resources/3074_01.png

Preliminary notions

  • Each unit of interest that we analyse is a document
  • A collection of documents on interest in a corpus

1. Tokenisation

  • A document is made of tokens
  • Tokens can be of arbitrary length

2. Annotation (Part of Speech)

  • Part of speech tagging gives each word their grammatical function in a sentence
  • Different languages have different parts of speech
  • Different POS parsers have different POS tags for the same language
  • POS tags help disambiguate meaning using grammatical context

3. Word standardisation

  • Words are inflected during usage
  • Lemmatisation: Reducing terms to their lemma (dictionary form)
    • Dogs, Dog, dog = dog.
    • Produces more consistent content, but is slower
    • Often requires text to be Part of Speech-annotated
  • Stemming: Reducing terms to their word stem (common part of the inflections)
    • programming = program, programmer = programm
    • Stemming produces a list of words that don't exist, but is faster
    • Simply runs a bunch of transformations on character strings

4. Filtering: stop words

  • Not all words matter equally
    • Some words are noise (e.g. determinants)
    • Want to use a smaller vocabulary
  • Less important words are kept in stop-lists
  • We can also filter words by frequency

Pairs of words

  • Set of consecutive words is called an n-gram
    • Unigram: single word
    • Bigram: two consecutive words
    • Trigram: three consecutive words
    • 4-gram: 4 consecutive words
    • etc
  • Only limited by sparsity of representation