Core Wiki

Machine Learning

Insights / Concepts

ML Training Strategy

Evaluation Metrics

  • Precision (of examples classified as cats what % actually are cats?)
  • Recall (what % of actual cats are correctly classified?)
  • F1 score

Supervised learning assumptions

  • You can fit the training set pretty well. (Avoidable bias)
  • The training set performance generalizes well to the dev/test set. (Variance)

Train/Dev/Test splits?

  • Back in a day was 60/20/20 fine with small training data (100-100k training data)
  • 98/1/1 (1M training data)

Reducing avoidable bias

  • Train bigger model
  • Train longer/better optimization (momentum, RMSprop, Adam)
  • Change NN architecture, hyperparameter search, activations

Reducing variance

  • More data
  • Regularization (L2, Dropout, Data Augmentation)
  • Change NN architecture, hyperparameter search, activations

Does not fit training set well on cost function?

  • Bigger network
  • Better optimizer (Adam)

Does not fit dev set well on cost function?

  • Regularization
  • Bigger training set

Does not fit test set well on cost function?

  • Bigger dev set

Does not perform well in real world?

  • Change dev set
  • Change cost function

Classification example

  • Human error 1%
  • Training error 8%
  • Dev error 10%

Big gap between Human error and Training error, focus on reducing bias.

  • Human error 7.5%
  • Training error 8%
  • Dev error 10%

Small gap between Human and Training error, doing fine on the Training set, want to reduce variance between Training and Dev sets.

Typically human error is close to Bayes error.

  • Avoidable bias = Bayes error - Training error
  • Variance = Training error - Dev error

CNN Output Size

Output size is given by ⌊(n - f + 2p)/s⌋ + 1

  • n x n (input size)
  • f x f (filter/kernel size)
  • p (padding)
  • s (stride)

Common settings

  • Convolution typically has stride = 1, padding = 0
  • Pooling typically has string = f, padding = 0

Convolutional NN

Motivation

  • Parameter sharing - feature detector (vertical edge detector) that is useful in one part of an image is useful in another part of the image. Filter matrices has some level of universality.
  • Sparsity of connections: In each layer, each output value depends only on small number of inputs. Output after convolution may not depend on a lot of data in the image.
  • Translational invariance - if you shift a “cat” in the photo, convolution filters will still be able to pick up the features
  • Convolutional layers typically shrink the reduce the dimension of the output data (width, height of an image)
  • 1×1 convolutional filters can be used to shrink the color dimension of an input volume. Say input volume is 28x28x192 and filter 1x1x32 outputs 28x28x32
  • 1×1 convolutions can decrease the number of required arithmetic operations (bottleneck)
  • Inception network constructs outputs using different size filters and concatenates them into final output.
  • Fully connected layer (say 400 neurons) can have convolutional implementation as 1x1x400 volume.

Similarity function

When one needs to compare two images (face verification, logo verification) similarity function can be trained once without a need to retrain the network when a new image enters the database.

d(img1, img2) = degree of difference between images

Siemese network obtains encodings f(x_i). Goal is to learn parameters such that

  • if e x_i s.t x_i, x_j are the same person then ||f(x_i) - f(x_j)|| is small
  • if e x_i s.t x_i, x_j are different person then ||f(x_i) - f(x_j)|| is large

Training could be done using Triplet Loss which has objective for anchor image A, positive image P and negative image N

'Naively: d(A,P) < d(A,N)

For technical reasons to avoid getting f(Y)=0, we introduce margin s.t. d(A,P) - d(A,N) + a < 0 where a is margin.

Given A,P,N can construct loss L(A,P,N) = max(d(A,P) - d(A,N) + a < 0',0) . The total loss is sum over all training examples required pairs A,P. The idea is that as long you manage to get d(A,P) - d(A,N) + a < 0 the Loss is 0, otherwise the Loss is positive (not good).

  • Face verification solves 1:1 matching problem, face recognition addresses harder 1:K matching problem
  • Triplet Loss is an effective loss function for training a neural network to learn an encoding of a face
  • The same encoding can be used for verification and recognition by using a distance metric to know how two images are different from each other

Basic recipe for ML

High bias solutions (underfitting training set)

  • Bigger network
  • Train longer
  • Different NN architecture

High variance solutions (overfitting training set, low dev set accuracy)

  • More data
  • Regularization
  • Different NN architecture

Regularization techniques:

  • L2 regularization
  • Dropout (inverted dropout)
  • Data augmentation (flipping horizontally, zooming, cropping, rotating, distorting)

Vanishing and exploding gradients can be attacked via appropriate weight initialization. Such as Ho initialization.

Concepts to look into

  • autoencoders
  • activation maximisation
  • localized encodings
  • deep image generation network
  • factorized encodings
  • segmentation
  • multilayer perceptron

Useful

Hyperparameter testing

from collections import OrderedDict
from collections import namedtuple
from itertools import product
 
class RunBuilder():
  @staticmethod
  def get_runs(params):
    Run = namedtuple('Run', params.keys())
    runs = [Run(*v) for v in product(*params.values())]
    return runs
 
params = OrderedDict(
    lr = [0.01, 0.001],
    batch_size = [100, 1000]
)
 
for run in RunBuilder.get_runs(params):
  print(f"{run}, {run.lr}")
  # training

Custom confusion matrix

from sklearn.metrics import confusion_matrix
cmt = confusion_matrix(targets, predictions)
import itertools
import numpy as np
import matplotlib.pyplot as plt
 
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
 
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
 
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
 
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
names = (
    'T-shirt/top'
    ,'Trouser'
    ,'Pullover'
    ,'Dress'
    ,'Coat'
    ,'Sandal'
    ,'Shirt'
    ,'Sneaker'
    ,'Bag'
    ,'Ankle boot'
)
plt.figure(figsize=(10,10))
plot_confusion_matrix(cmt, names)