1. ML Basics
1. ML Basics
1.0.1. Supervised vs Unsupervised Learning
1.0.2. Dataset and Data-loaders
- ML is data-driven.
- Data is split into training, validation, and test sets.
- We minimize over the training set and usually overfit.
- Preventing data leakage is crucial.
- Good performance on the test set is essential.
- Common splits: 80/10/10 or 70/15/15; training should be 50% or more.
- Validation and test sets must be large enough to be good estimators of the distribution.
- Shuffling data is important to avoid biases (e.g., daytime vs. nighttime images).
- The Dataset class in PyTorch needs
__len__
and__getitem__
methods.__len__
returns the length of the dataset.__getitem__
returns the i-th element in the dataset, which can be loaded from disk, normalized, transformed, etc.
- The DataLoader takes a dataset, creates batches, and loads them into RAM efficiently.
1.0.3. Data Augmentation
- Artificially augmenting data improves training.
- Techniques include flipping, rotating, cropping images, Gaussian blur, etc., with a probability \(p\).
- Acts as a regularization method to prevent overfitting.
- Not applied to validation and test sets, unlike data processing.
- Should be problem-specific (e.g., cannot flip a 9).
1.0.4. Data Processing
- Applied to all samples.
- Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.
1.0.5. Exam Questions
Difference between supervised and unsupervised learning?
- Supervised: Trained on existing and fixed groud truth data. Classes for classification, depth for depth estimation.
- Unsupervised: No labels, find patterns in the data. Clustering, dimensionality reduction, anomaly detection.
Difference between classification and regression?
- Classification predicts discrete labels, while regression predicts continuous values.
Advantages of unsupervised learning over supervised learning?
- Labeling data is expensive and sometimes impossible to obtain. Can be used to find patterns in data without prior knowledge.
- Semi-supervised we create labels like in masked language modeling.
Example of using unsupervised learning to improve accuracy.
- Autoencoders for feature extraction.
- Training a language model on on a masked language modeling task and then fine-tuning it on a classification task.
K-means:
- What is the K-means clustering algorithm?
- Unsupervised learning algorithm. Used for clustering into \(k\) groups. Assigns a data point to the cluster with the nearest centroid. Then the new clusters are recalculated. Iterative algorithm, converges to a local minimum.
- What is its key hyperparameter?
- The number of clusters, \(k\).
- Consequences of a too small or too large value of its hyperparameter?
- Too small \(k\) can underfit (oversimplify) the data, while too large \(k\) can overfit (make it too complex).
PCA:
- What is PCA?
- Unsupervised dimensionality reduction technique. Uses a linear transformation to project data into a lower-dimensional space that preserves the most variance.
- What Deep Learning architecture could perform a similar task? Why is the deep learning method preferable for real-world problems?
- Autoencoders: they can approximate non-linear functions and reduce dimensions.
Data splits:
- Importance of shuffling data before splitting?
- To avoid biases and ensure the data distribution is uniform across splits.
- Common split ratios?
- 80/10/10 or 70/15/15 for training/validation/test.
- Why is the training set bigger?
- To provide more data for the model to learn from.
- When can we relax the ratio between the splits to be more even? When the other way around?
- More even when the dataset is small; more skewed towards training when the dataset is large.
- Consequence of a too small validation set?
- It may not provide a reliable estimate of model performance.
- How to overcome a too small training set?
- Use data augmentation to artificially increase the dataset size.
Why not use the validation set within the training set and just use the test set for validation?
- To monitor overfitting and perform hyperparameter tuning without biasing the test set.
Main issue with modern dataset sizes? Why is it so?
- They can be extremely large, making them difficult to handle and process efficiently.
- Still limited and cannot represent the entire population.
To which part of the dataset should we apply data augmentation? Why?
- To the training set, to increase the diversity of data the model learns from and reduce overfitting.
Difference between data augmentation and data processing techniques like normalization?
- Data augmentation: artificially increases the dataset size by applying transformations to the data, it is a regularization method.
- Data processing: applied to all samples and all splits. Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.
- Why is the mean and standard deviation of the data calculated only over the training set?
- To prevent data leakage and keep validation and test sets unbiased.