Text classification

Classification as generalisation

Core assumption: our data is representative of the real world (sampled from the same distribution). You are never observing the real world, only a sample of it
e.g. Language competency classifier trained in this class

Arising Issues

Geographical difference, language
Cultural, words changing meaning based on dominant culture, youth slang
Time difference, change in vocabulary, concept drift

ML is generalisation

Machine learning - techniques that enable a computer to learn from examples Deep learning - neural networks with multiple layers Training a supervised model - iteratively modifying parameters of the model to minimise training error in a way that (we think) leads to minimising generalisation error, which we estimate with testing error

Machine Learning Examples

Traditional Machine Learning

Linear models (Logistical Regression)
Tree models (Decision Tree)
Probabilistic models
Similarity-based models
Representation is explicitly defined by the user and fed to the model Deep Learning
Neural network models
Convolutional neural networks
Recurrent neural networks
Transformer models
Representation is typically learnt by the model as part of the training process

ML Approaches

Supervised - There is known ground truth about the data
Unsupervised - There is no ground truth known about the data
Semi-supervised - There is ground truth known about some of the data
Reinforcement learning - We are trying to learn policies/sequences of action

Neural Networks

Also used in LLM

Overfitting

Each additional hidden layer increases expressiveness
- Network can learn more complex concepts
- Training error gets smaller
Drawbacks:
- Data-hungry
- Inefficient
- Overfitting (memorises the training data)

Training and generalisation error

Minimise the training error
Generalisation error - the one which matters. Cant be sure that training error matches generalisation error
The process of finding the best trade-off between simplicity and training error is regularisation
Want to find the best compromise between;
- Minimising training error
- Minimising model complexity
Simpler model is more robust

Text classification

Classification as generalisation​

Arising Issues​

ML is generalisation​

Machine Learning Examples​

ML Approaches​

Neural Networks​

Overfitting​

Training and generalisation error​