1. ML Basics

1.0.1. Supervised vs Unsupervised Learning

1.0.2. Dataset and Data-loaders

ML is data-driven.
Data is split into training, validation, and test sets.
We minimize over the training set and usually overfit.
Preventing data leakage is crucial.
Good performance on the test set is essential.
Common splits: 80/10/10 or 70/15/15; training should be 50% or more.
Validation and test sets must be large enough to be good estimators of the distribution.
Shuffling data is important to avoid biases (e.g., daytime vs. nighttime images).
The Dataset class in PyTorch needs __len__ and __getitem__ methods.
- __len__ returns the length of the dataset.
- __getitem__ returns the i-th element in the dataset, which can be loaded from disk, normalized, transformed, etc.
The DataLoader takes a dataset, creates batches, and loads them into RAM efficiently.

1.0.3. Data Augmentation

Artificially augmenting data improves training.
Techniques include flipping, rotating, cropping images, Gaussian blur, etc., with a probability \(p\).
Acts as a regularization method to prevent overfitting.
Not applied to validation and test sets, unlike data processing.
Should be problem-specific (e.g., cannot flip a 9).

1.0.4. Data Processing

Applied to all samples.
Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.

1.0.5. Exam Questions

Difference between supervised and unsupervised learning?
- Supervised: Trained on existing and fixed groud truth data. Classes for classification, depth for depth estimation.
- Unsupervised: No labels, find patterns in the data. Clustering, dimensionality reduction, anomaly detection.
Difference between classification and regression?
- Classification predicts discrete labels, while regression predicts continuous values.
Advantages of unsupervised learning over supervised learning?
- Labeling data is expensive and sometimes impossible to obtain. Can be used to find patterns in data without prior knowledge.
- Semi-supervised we create labels like in masked language modeling.
Example of using unsupervised learning to improve accuracy.
- Autoencoders for feature extraction.
- Training a language model on on a masked language modeling task and then fine-tuning it on a classification task.
K-means:
- 1. What is the K-means clustering algorithm?
  - Unsupervised learning algorithm. Used for clustering into \(k\) groups. Assigns a data point to the cluster with the nearest centroid. Then the new clusters are recalculated. Iterative algorithm, converges to a local minimum.
- 1. What is its key hyperparameter?
  - The number of clusters, \(k\).
- 1. Consequences of a too small or too large value of its hyperparameter?
  - Too small \(k\) can underfit (oversimplify) the data, while too large \(k\) can overfit (make it too complex).
PCA:
- 1. What is PCA?
  - Unsupervised dimensionality reduction technique. Uses a linear transformation to project data into a lower-dimensional space that preserves the most variance.
- 1. What Deep Learning architecture could perform a similar task? Why is the deep learning method preferable for real-world problems?
  - Autoencoders: they can approximate non-linear functions and reduce dimensions.
Data splits:
- 1. Importance of shuffling data before splitting?
  - To avoid biases and ensure the data distribution is uniform across splits.
- 1. Common split ratios?
  - 80/10/10 or 70/15/15 for training/validation/test.
- 1. Why is the training set bigger?
  - To provide more data for the model to learn from.
- 1. When can we relax the ratio between the splits to be more even? When the other way around?
  - More even when the dataset is small; more skewed towards training when the dataset is large.
- 1. Consequence of a too small validation set?
  - It may not provide a reliable estimate of model performance.
- 1. How to overcome a too small training set?
  - Use data augmentation to artificially increase the dataset size.
Why not use the validation set within the training set and just use the test set for validation?
- To monitor overfitting and perform hyperparameter tuning without biasing the test set.
Main issue with modern dataset sizes? Why is it so?
- They can be extremely large, making them difficult to handle and process efficiently.
- Still limited and cannot represent the entire population.
To which part of the dataset should we apply data augmentation? Why?
- To the training set, to increase the diversity of data the model learns from and reduce overfitting.
Difference between data augmentation and data processing techniques like normalization?

Data augmentation: artificially increases the dataset size by applying transformations to the data, it is a regularization method.
Data processing: applied to all samples and all splits. Normalization ensures all dimensions are on the same scale, easing learning and preventing exploding gradients.

Why is the mean and standard deviation of the data calculated only over the training set?
- To prevent data leakage and keep validation and test sets unbiased.