Representation and Similarity
Fixed size representation - Bag of words model
- NLP algorithms don't understand words
- Several algorithms work with fixed-size vectors
- Relies on comparing documents
Those documents need to be represented in a common space A vector space is a natural fit to represent this space
- Each word/term is a dimension of that vector space
- Each document is a vector in that vector space
Distance and Similarity in Vector Space
Similarity measures
- At the core of many algorithms
- Information retrieval algorithms (search by similarity)
- Clustering algorithms (cluster by similarity)
- Some classifiers, recommender systems, etc
- Intrinsically related to distance measures
Similarity in vector spaces
- Jaccard index - Works with binary weights
- Euclidean distance - Standard geometry distance
- Cosine similarity - Angle of the document vectors. Dot product of those vectors divided by the multiplication of their norms
Representation
- Binary term weighting - ....
- Log-frequency term weighting
- TF-IDF(Term Frequency - Inverse Document Frequency) term weighting - Modification of the log frequency. Adding
- There are more advanced and specialised representations
Applications
- Can be fed into ML algorithms (classifiers, clustering etc)
- Can be used for search engines. Gives us the notion of distance and similarity
Variable-length representations - One hot encoding, and word embeddings
One-hot encoding
- Assumptions: all words are equally unrelated
- No semantics
- Large sparse matrices (mostly 0s)
Word Embeddings
- Semantics - the study of meaning
- Distributional semantics - modelling meaning based on the distribution of language in large samples of data
- A word embedding is a dense vector representation of words
- Tend to preserve some properties of words
Larger the NN the more expensive
Variable-length representations
- One-hot encoding - Document of L tokens with a vocabulary of V is a matrix.
- Sum the matrix across its length to obtain a BOW model with frequency
- Take the max across the length to obtain a BOW model with binary encoding
- Embeddings - Document of L tokens with an M-dimensional embedding is a matrix.
- Average the matrix across its length to obtain a Mean Embedding
- Can be transformed into fixed length representations, but not the other way round.
Concatenating - Compute fixed size vector from variable sized-representations Metric Learning - Train a NN to estimate similarity
Applications of similarity
- Corpus of documents represent in a common space
- Relevance function that tasks a query, document
- Returns a relevance between 0,1.
Key part of a search engine, subject to lots of research.
K-Nearest neighbours
- Given a corpus of labelled documents
- Given a new document to label
- Look at the K documents most similar to from our corpus and find the majority label
- Output label as classification
Usually with a Bag-of-Words representation