Representation and Similarity

Fixed size representation - Bag of words model

NLP algorithms don't understand words
Several algorithms work with fixed-size vectors
Relies on comparing documents

Those documents need to be represented in a common space A vector space is a natural fit to represent this space

Each word/term is a dimension of that vector space
Each document is a vector in that vector space

Distance and Similarity in Vector Space

Similarity measures

At the core of many algorithms
- Information retrieval algorithms (search by similarity)
- Clustering algorithms (cluster by similarity)
- Some classifiers, recommender systems, etc
Intrinsically related to distance measures

Similarity in vector spaces

Jaccard index - Works with binary weights
Euclidean distance - Standard geometry distance
Cosine similarity - Angle of the document vectors. Dot product of those vectors divided by the multiplication of their norms

Representation

Binary term weighting - ....
Log-frequency term weighting
TF-IDF(Term Frequency - Inverse Document Frequency) term weighting - Modification of the log frequency. Adding
There are more advanced and specialised representations

Applications

Can be fed into ML algorithms (classifiers, clustering etc)
Can be used for search engines. Gives us the notion of distance and similarity

Variable-length representations - One hot encoding, and word embeddings

One-hot encoding

Assumptions: all words are equally unrelated
No semantics
Large sparse matrices (mostly 0s)

Word Embeddings

Semantics - the study of meaning
Distributional semantics - modelling meaning based on the distribution of language in large samples of data
A word embedding is a dense vector representation of words
Tend to preserve some properties of words

Larger the NN the more expensive

Variable-length representations

One-hot encoding - Document of L tokens with a vocabulary of V is a $L \times V$ matrix.
- Sum the matrix across its length to obtain a BOW model with frequency
- Take the max across the length to obtain a BOW model with binary encoding
Embeddings - Document of L tokens with an M-dimensional embedding is a $L\times M$ matrix.
- Average the matrix across its length to obtain a Mean Embedding
Can be transformed into fixed length representations, but not the other way round.

Concatenating - Compute fixed size vector from variable sized-representations Metric Learning - Train a NN to estimate similarity

Applications of similarity

Corpus of documents represent in a common space
Relevance function that tasks a query, document
Returns a relevance between 0,1.

Key part of a search engine, subject to lots of research.

K-Nearest neighbours

Given a corpus of labelled documents
Given a new document $d$ to label
Look at the K documents most similar to $d$ from our corpus and find the majority label
Output label as classification

Usually with a Bag-of-Words representation

Representation and Similarity

Fixed size representation - Bag of words model​

Distance and Similarity in Vector Space​

Similarity in vector spaces​

Representation​

Applications​

Variable-length representations - One hot encoding, and word embeddings​

One-hot encoding​

Word Embeddings​

Variable-length representations​

Applications of similarity​