Machine Learning

Insights / Concepts

1710.05381 Regarding performance of different methods for addressing imbalance, in almost all of the situations oversampling emerged as the best method
1506.02640 You Only Look Once: Unified, Real-Time Object Detection (Original YOLO paper)
Linguistic Regularities in Continuous Space Word Representations

ML Training Strategy

Evaluation Metrics

Precision (of examples classified as cats what % actually are cats?)
Recall (what % of actual cats are correctly classified?)
F1 score

Supervised learning assumptions

You can fit the training set pretty well. (Avoidable bias)
The training set performance generalizes well to the dev/test set. (Variance)

Train/Dev/Test splits?

Back in a day was 60/20/20 fine with small training data (100-100k training data)
98/1/1 (1M training data)

Reducing avoidable bias

Train bigger model
Train longer/better optimization (momentum, RMSprop, Adam)
Change NN architecture, hyperparameter search, activations

Reducing variance

More data
Regularization (L2, Dropout, Data Augmentation)
Change NN architecture, hyperparameter search, activations

Does not fit training set well on cost function?

Bigger network
Better optimizer (Adam)

Does not fit dev set well on cost function?

Regularization
Bigger training set

Does not fit test set well on cost function?

Bigger dev set

Does not perform well in real world?

Change dev set
Change cost function

Classification example

Human error 1%
Training error 8%
Dev error 10%

Big gap between Human error and Training error, focus on reducing bias.

Human error 7.5%
Training error 8%
Dev error 10%

Small gap between Human and Training error, doing fine on the Training set, want to reduce variance between Training and Dev sets.

Typically human error is close to Bayes error.

Avoidable bias = Bayes error - Training error
Variance = Training error - Dev error

CNN Output Size

Output size is given by ⌊(n - f + 2p)/s⌋ + 1

n x n (input size)
f x f (filter/kernel size)
p (padding)
s (stride)

Common settings

Convolution typically has stride = 1, padding = 0
Pooling typically has string = f, padding = 0

Convolutional NN

Motivation

Parameter sharing - feature detector (vertical edge detector) that is useful in one part of an image is useful in another part of the image. Filter matrices has some level of universality.
Sparsity of connections: In each layer, each output value depends only on small number of inputs. Output after convolution may not depend on a lot of data in the image.
Translational invariance - if you shift a “cat” in the photo, convolution filters will still be able to pick up the features
Convolutional layers typically shrink the reduce the dimension of the output data (width, height of an image)
1×1 convolutional filters can be used to shrink the color dimension of an input volume. Say input volume is 28x28x192 and filter 1x1x32 outputs 28x28x32
1×1 convolutions can decrease the number of required arithmetic operations (bottleneck)
Inception network constructs outputs using different size filters and concatenates them into final output.
Fully connected layer (say 400 neurons) can have convolutional implementation as 1x1x400 volume.

Similarity function

When one needs to compare two images (face verification, logo verification) similarity function can be trained once without a need to retrain the network when a new image enters the database.

d(img1, img2) = degree of difference between images

Siemese network obtains encodings f(x_i). Goal is to learn parameters such that

if e x_i s.t x_i, x_j are the same person then ||f(x_i) - f(x_j)|| is small
if e x_i s.t x_i, x_j are different person then ||f(x_i) - f(x_j)|| is large

Training could be done using Triplet Loss which has objective for anchor image A, positive image P and negative image N

'Naively: d(A,P) < d(A,N)

For technical reasons to avoid getting f(Y)=0, we introduce margin s.t. d(A,P) - d(A,N) + a < 0 where a is margin.

Given A,P,N can construct loss L(A,P,N) = max(d(A,P) - d(A,N) + a < 0',0) . The total loss is sum over all training examples required pairs A,P. The idea is that as long you manage to get d(A,P) - d(A,N) + a < 0 the Loss is 0, otherwise the Loss is positive (not good).

Face verification solves 1:1 matching problem, face recognition addresses harder 1:K matching problem
Triplet Loss is an effective loss function for training a neural network to learn an encoding of a face
The same encoding can be used for verification and recognition by using a distance metric to know how two images are different from each other

Basic recipe for ML

High bias solutions (underfitting training set)

Bigger network
Train longer
Different NN architecture

High variance solutions (overfitting training set, low dev set accuracy)

More data
Regularization
Different NN architecture

Regularization techniques:

L2 regularization
Dropout (inverted dropout)
Data augmentation (flipping horizontally, zooming, cropping, rotating, distorting)

Vanishing and exploding gradients can be attacked via appropriate weight initialization. Such as Ho initialization.

Concepts to look into

autoencoders
activation maximisation
localized encodings
deep image generation network
factorized encodings
segmentation
multilayer perceptron

Useful

Hyperparameter testing

from collections import OrderedDict
from collections import namedtuple
from itertools import product
 
class RunBuilder():
  @staticmethod
  def get_runs(params):
    Run = namedtuple('Run', params.keys())
    runs = [Run(*v) for v in product(*params.values())]
    return runs
 
params = OrderedDict(
    lr = [0.01, 0.001],
    batch_size = [100, 1000]
)
 
for run in RunBuilder.get_runs(params):
  print(f"{run}, {run.lr}")
  # training

Custom confusion matrix

from sklearn.metrics import confusion_matrix
cmt = confusion_matrix(targets, predictions)

import itertools
import numpy as np
import matplotlib.pyplot as plt
 
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')
 
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
 
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
 
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

names = (
    'T-shirt/top'
    ,'Trouser'
    ,'Pullover'
    ,'Dress'
    ,'Coat'
    ,'Sandal'
    ,'Shirt'
    ,'Sneaker'
    ,'Bag'
    ,'Ankle boot'
)
plt.figure(figsize=(10,10))
plot_confusion_matrix(cmt, names)

Core Wiki

Page Tools

Site Tools

User Tools

Table of Contents