Introduction to NLP (Wk.6)


Ch.8 Deep Learning


8-1) Perceptron


Introduction to Perception



The graph above is step function.
We send (input * weight) to artificial neuron, and if the sum of it exceeds the threshold, the artificial neuron at the end will return 1, and if not, return 0.
if∑inwixi ≥θ→y=1if\sum_i^{n} w_{i}x_{i}\≥\theta → y=1ifi∑n​wi​xi​ ≥θ→y=1
if∑inwixi <θ→y=0if\sum_i^{n} w_{i}x_{i}\<\theta → y=0ifi∑n​wi​xi​ <θ→y=0
θ=threshold\theta = thresholdθ=threshold
We can move threshold value to left, and express it as b(bias).
It will also be used as input of perceptron as below.

if∑inwixi+b≥0→y=1if\sum_i^{n} w_{i}x_{i} + b ≥ 0 → y=1ifi∑n​wi​xi​+b≥0→y=1
if∑inwixi+b<0→y=0if\sum_i^{n} w_{i}x_{i} + b < 0 → y=0ifi∑n​wi​xi​+b<0→y=0
b is also a variable that deep learning should find the optimal value.
Activation Function: The function that changes return value in the neuron. Followings are also the activation function:
  • step function
  • sigmoid function
  • softmax function
  • The difference between artificial neuron that performs logistic regression & perceptron above is 'activation function'.
    Artificial Neuron: activation function
    f(∑inwixi+b)f(\sum_i^{n} w_{i}x_{i} + b)f(i∑n​wi​xi​+b)
    Perceptron (one of artificial neuron): step function
    f(∑inwixi+b)f(\sum_i^{n} w_{i}x_{i} + b)f(i∑n​wi​xi​+b)

    Single-Layer Perceptron



    In single-layer perceptron, there are only two steps: Input and Output. Each step is called 'layer'.

    Sigle-layer perceptron for AND gate. (* there can be other various values for w1, w2, and b)
    def AND_gate(x1, x2):
        w1 = 0.5
        w2 = 0.5
        b = -0.7
        result = x1*w1 + x2*w2 + b
        if result <= 0:
            return 0
        else:
            return 1

    Sigle-layer perceptron for NAND gate.
    def NAND_gate(x1, x2):
        w1 = -0.5
        w2 = -0.5
        b = 0.7
        result = x1*w1 + x2*w2 + b
        if result <= 0:
            return 0
        else:
            return 1

    Sigle-layer perceptron for OR gate.
    def OR_gate(x1, x2):
        w1 = 0.6
        w2 = 0.6
        b = -0.5
        result = x1*w1 + x2*w2 + b
        if result <= 0:
            return 0
        else:
            return 1

    However, single-layer perceptron can only be used for data that we can classify with linear mehod.
    Therefore, it is impossible to implement XOR gate with single-layer perceptron.
    Refer to graphs below.




    MultiLayer Perceptron (MLP)


    We now add more layer between input layer & output layer.
    It is called 'hidden layer'.
    Below is how to implement XOR gate with MLP.

    Below is MLP with two hidden layers.

    We call neural network with more than 1 hidden layer as 'Deep Neural Network, DNN'. It is not only limited to multilayer perceptron, but any neural network with 2 or more hidden layers is also included in DNN.
    In machine learning, we don't put weight and bias manually.
    We have to automate the process so that the machine can find the optimal value by itself, and this is called 'training'.
    We use loss function and optimizer for it.
    If the neural network is DNN, then we call it 'deep learning'.

    8-2) Artificial Neural Network


    Feed-Forward Neural Network (FFNN)



    Above is feed-forward neural network.

    Above is recurrent neural network (RNN).
    Hidden layer's output can be sent to output layer, but also can be reused for hidden layer's input again.

    Fully-Connected Layer (FC)


    It is also called as Dense layer.
    FC is the layer whose all neuron is connected to all neuron of previous layer.
    If there is a feed-forward neural network consists of FCs, we call it 'Fully-connected FFNN'.

    Activation Function



    Nonlinear Function


    Adding hidden layer with linear function multiple time is meaningless. It has same meaning of adding it once.
    Therefore, we usually use nonlinear layer for hidden layer.

    Step Function


    It is not used frequently nowadays.

    Sigmoid Function



    Artificial neural network will do 'forward propagation' for the given input, calculate gradient using differention for loss value, and perform back propagation.

    Vanishing Gradient



    When differentiating the sigmoid function's orange part above, the gradient that is very small will be multiplied.
    Then, the gradient cannot be propagated to layers in front end.
    Which makes w not being updated, thus learning does not proceed.

    Therefore, using sigmoid function in hidden layer should be refrained from.

    Hyperbolic Tangent Function



    ReLU Function



    Leaky ReLU Function



    Softmax Function



    8-3) Matrix Multiplication


    number of parameters (w + b) can be expressed with matrix multiplication (also known as matrix product) as below.


    8-4) Learning Method


    Loss Function



    We usually use MSE for regression, and Cross-Entropy for classification.
    The purpose of deep learning is to find optimal values for w and b that minimize the value of loss function. Therefore, which loss function to use is very important.

    Mean Squared Error (MSE)


    Average of (error)^2.
    We use it for predicting continuous parameter.

    Cross-Entropy


    If the prediction is correct with low possibility, or is wrong with high possibility, the loss value gets bigger.

    Optimizer



    Batch: amount of data for optimizing parameters. (It can be the entire data, or can be specific amount)
    Epoch: number of training

    Batch Gradient Descent


    Update every parameter once per epoch.
    It takes a lot of memory, but can find the global minimum.

    Stochastic Gradient Descent (SGD)



    Only optimizing random one data instead of optimizing every data.
    It uses less data, thus less time but less accuracy.

    Mini-Batch Gradient Descent


    Optimizing pre-determined number of data instead of every or one data.
    It is faster than BGD, and accurate(stable) than SGD.
    It is widely used gradient descent method.

    Momentum



    It prohibits computer misconceive local minimum as global minimum.

    Adagrad


    Variable with large change has lower learning rate.
    Variable with small change has higher learning rate.

    RMSprop


    After Adagrad, the learning rate may descend too much.
    Instead, with replacing it with other function, it can be improved.

    Adam


    RMSprop + Momentum.
    It can be used for not only direction but also learning rate.

    8-3) BackPropagation


    refer to 3-4. (In this velog, refer to Wk.4)

    8-4) Epochs and Batch size and Iteration



    Epoch


    When all propagation is done
    If the epoch is too big, there can be overfitting.
    If the epoch is too small, there can be underfitting.

    Batch Size


    unit of data for updating parameters

    Iteration


    number of batch
    SGD's batch size is 1, therefore it chooses one data for gradient descending per every iteration.

    8-5) How to Prevent Overfitting


    Introduction to Overfitting


    Overfitting signifies that the machine also learned noises from learning data as well.
    Therefore, it may work good for the train data, but can work bad for the test/new data.

    Increasing Data


    If we give more data, model can learn general pattern instead of noise or specific pattern of data.
    If the data is too small, we can change existing data a little, and add them. It is called 'Data Augmentation'.
    In image processing field, it is widely & actively used.

    Simplifying Model


    Complexity of model is determined by number of hidden layers and parameters.
    We can prevent overfitting by reducing it.
    We call number of parameters of model as 'model's capacity'.

    Applying Weight Regularization


    L1 regulation: add sum of absolute values of weight w values to the cost function. It is also known as 'L1 Norm'.
    L2 regulation: add sum of sqare values of weight w values to the cost function. It is also known as 'L2 Norm'.

    Dropout


    We can randomly not use certain number of neuron for each learning process.

    8-6) Gradient Vanishing and Exploding


    Introduction to Gradient Vanishing and Exploding


    If the gradient gradually gets lower and weights of layers close to input layer may not be updated well. Thus not able to find optimal model. It is called 'gradient vanishing'.
    Opposite case also exists. Gradient can gradually gets bigger, and weights can be excessively bigger. It can happen in RNN.

    Using ReLU and its Variations


    Use ReLU or its variation functions like Leaky ReLU instead of sigmoid or hyperbolic tangent function for hidden layers.

    Gradient Clipping


    We can clip gradient value so that it does not exceeds threshold to prevent gradient exploding.

    Weight Initialization


    Model's learning result may be differ by what were the initial values of weight values.
    Therefore, initializing weight can mitigate gradient vanishing and exploding.

    Xavier Initialization


    It is also known as 'Glorot Initialization'.
    http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

    He Initialization


    https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf

    Batch Normalization


    It normalize input for each layer to average and variance.

    Internal Covariate Shift


    Batch Normalization


    Limitations

  • It is too dependant on the size of mini batch.
  • It is difficult to apply to RNN.
  • Layer Normalization


  • Batch Normalization


  • Layer Normalization