Skip to main content

Representation and Similarity

Fixed size representation - Bag of words model

  • NLP algorithms don't understand words
  • Several algorithms work with fixed-size vectors
  • Relies on comparing documents

Those documents need to be represented in a common space A vector space is a natural fit to represent this space

  • Each word/term is a dimension of that vector space
  • Each document is a vector in that vector space

Distance and Similarity in Vector Space

Similarity measures

  • At the core of many algorithms
    • Information retrieval algorithms (search by similarity)
    • Clustering algorithms (cluster by similarity)
    • Some classifiers, recommender systems, etc
  • Intrinsically related to distance measures

Similarity in vector spaces

  • Jaccard index - Works with binary weights
  • Euclidean distance - Standard geometry distance
  • Cosine similarity - Angle of the document vectors. Dot product of those vectors divided by the multiplication of their norms

Representation

  • Binary term weighting - ....
  • Log-frequency term weighting
  • TF-IDF(Term Frequency - Inverse Document Frequency) term weighting - Modification of the log frequency. Adding
  • There are more advanced and specialised representations

Applications

  • Can be fed into ML algorithms (classifiers, clustering etc)
  • Can be used for search engines. Gives us the notion of distance and similarity

Variable-length representations - One hot encoding, and word embeddings

One-hot encoding

  • Assumptions: all words are equally unrelated
  • No semantics
  • Large sparse matrices (mostly 0s)

Word Embeddings

  • Semantics - the study of meaning
  • Distributional semantics - modelling meaning based on the distribution of language in large samples of data
  • A word embedding is a dense vector representation of words
  • Tend to preserve some properties of words

Larger the NN the more expensive

Variable-length representations

  • One-hot encoding - Document of L tokens with a vocabulary of V is a L×VL \times V matrix.
    • Sum the matrix across its length to obtain a BOW model with frequency
    • Take the max across the length to obtain a BOW model with binary encoding
  • Embeddings - Document of L tokens with an M-dimensional embedding is a L×ML\times M matrix.
    • Average the matrix across its length to obtain a Mean Embedding
  • Can be transformed into fixed length representations, but not the other way round.

Concatenating - Compute fixed size vector from variable sized-representations Metric Learning - Train a NN to estimate similarity

Applications of similarity

  • Corpus of documents represent in a common space
  • Relevance function that tasks a query, document
  • Returns a relevance between 0,1.

Key part of a search engine, subject to lots of research.

K-Nearest neighbours

  • Given a corpus of labelled documents
  • Given a new document dd to label
  • Look at the K documents most similar to dd from our corpus and find the majority label
  • Output label as classification

Usually with a Bag-of-Words representation