Text classification
Classification as generalisation
- Core assumption: our data is representative of the real world (sampled from the same distribution). You are never observing the real world, only a sample of it
- e.g. Language competency classifier trained in this class
Arising Issues
- Geographical difference, language
- Cultural, words changing meaning based on dominant culture, youth slang
- Time difference, change in vocabulary, concept drift
ML is generalisation
Machine learning - techniques that enable a computer to learn from examples Deep learning - neural networks with multiple layers Training a supervised model - iteratively modifying parameters of the model to minimise training error in a way that (we think) leads to minimising generalisation error, which we estimate with testing error
Machine Learning Examples
Traditional Machine Learning
- Linear models (Logistical Regression)
- Tree models (Decision Tree)
- Probabilistic models
- Similarity-based models
- Representation is explicitly defined by the user and fed to the model Deep Learning
- Neural network models
- Convolutional neural networks
- Recurrent neural networks
- Transformer models
- Representation is typically learnt by the model as part of the training process
ML Approaches
- Supervised - There is known ground truth about the data
- Unsupervised - There is no ground truth known about the data
- Semi-supervised - There is ground truth known about some of the data
- Reinforcement learning - We are trying to learn policies/sequences of action
Neural Networks
Also used in LLM
Overfitting
- Each additional hidden layer increases expressiveness
- Network can learn more complex concepts
- Training error gets smaller
- Drawbacks:
- Data-hungry
- Inefficient
- Overfitting (memorises the training data)
Training and generalisation error
- Minimise the training error
- Generalisation error - the one which matters. Cant be sure that training error matches generalisation error
- The process of finding the best trade-off between simplicity and training error is regularisation
- Want to find the best compromise between;
- Minimising training error
- Minimising model complexity
- Simpler model is more robust