Skip to main content

Text classification

Classification as generalisation

  • Core assumption: our data is representative of the real world (sampled from the same distribution). You are never observing the real world, only a sample of it
  • e.g. Language competency classifier trained in this class

Arising Issues

  • Geographical difference, language
  • Cultural, words changing meaning based on dominant culture, youth slang
  • Time difference, change in vocabulary, concept drift

ML is generalisation

Machine learning - techniques that enable a computer to learn from examples Deep learning - neural networks with multiple layers Training a supervised model - iteratively modifying parameters of the model to minimise training error in a way that (we think) leads to minimising generalisation error, which we estimate with testing error

Machine Learning Examples

Traditional Machine Learning

  • Linear models (Logistical Regression)
  • Tree models (Decision Tree)
  • Probabilistic models
  • Similarity-based models
  • Representation is explicitly defined by the user and fed to the model Deep Learning
  • Neural network models
  • Convolutional neural networks
  • Recurrent neural networks
  • Transformer models
  • Representation is typically learnt by the model as part of the training process

ML Approaches

  • Supervised - There is known ground truth about the data
  • Unsupervised - There is no ground truth known about the data
  • Semi-supervised - There is ground truth known about some of the data
  • Reinforcement learning - We are trying to learn policies/sequences of action

Neural Networks

Also used in LLM

Overfitting

  • Each additional hidden layer increases expressiveness
    • Network can learn more complex concepts
    • Training error gets smaller
  • Drawbacks:
    • Data-hungry
    • Inefficient
    • Overfitting (memorises the training data)

Training and generalisation error

  • Minimise the training error
  • Generalisation error - the one which matters. Cant be sure that training error matches generalisation error
  • The process of finding the best trade-off between simplicity and training error is regularisation
  • Want to find the best compromise between;
    • Minimising training error
    • Minimising model complexity
  • Simpler model is more robust