Back propagation
33252 ワード
Prologue
多くの教材や講義でbackpropを整理する際にnodeの観点から見るのはもったいない.ではscratchでニューラルネットワークを実現するとback propで詰まります.動的なニューラルネットワークを構築したいです.だからもっと深く掘った.
Basic archetecture
人工ニューラルネットワークは入力値1を受け入れる.Weight(WWW)に乗算し、biasを省略します.2.アクティブ化関数(σ\sigmaσ, 結果値を信号関数でエクスポートします.
f=x⋅Wf=x\cdot Wf=x⋅W
z=σ(f)z=\sigma(f)z=σ(f)
この2つの関数は1つの層を構成し,前の層の結果値を次の層の入力値とし,多層重畳によりニューラルネットワークを形成する.私たちがしなければならないのは、結果値をできるだけ実際の値と等しくする重みを見つけることです.重み付け値をあちこち押して探すことができますが、このように最適な値を探すと、耳を覆って目を閉じて友達を探すようになります.これよりも科学的に近い方法はback propagationとgradient descentである.このときの核心は連鎖法則である.
All we need is Chain rule
上記の例では,zzz関数のWWWの変化量を理解することを望んでいるが,直接相関していない.賢い人がその時に使うために作ったチェーンルール.
∂z∂W=∂z∂f∂f∂W{\partial z\over\partial W}={\partial z\over\partial f}{\partial f\over\partial W}∂W∂z=∂f∂z∂W∂f
この点をよく覚えておくと、ついて行くのが簡単だと思います.
Calculating gradient
3層ネットワークにおける重み値の更新に必要な変化量を考慮する前に,以下に示すように,ニューラルネットワークを視覚的に解いた.
これにより,複雑な合成関数においても,所望の変化量をうまく捉えることができる.前述したように、各ウェイトが変化すると、費用関数がどのように変化するかを見出します.この角度から見ると、必要なのは∂C∂W 1 ∂C∂W2, ∂C∂W3{\partial C\over\partial W_1}\\{\partial C\over\partial W_2},\\{\partial C\over\partial W_3}∂W1∂C ∂W2∂C, ∂W 3∂Cのような3種類損失関数の各重み付け値の変化量をグラフに沿ってまとめると以下のようになる.
∂C∂W3=∂C∂z3∂z3∂f3∂f3∂W3=∂Cost⋅∂σ(f3)⋅z2{\partial C\over\partial W_3}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial W_3}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot z_2∂W3∂C=∂z3∂C∂f3∂z3∂W3∂f3=∂Cost⋅∂σ(f3)⋅z2
∂C∂W2=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂W2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1{\partial C\over\partial W_2}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial W2}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot z_1∂W2∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂W2∂f2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1
∂C∂W1=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂z1∂z1∂f1∂f1∂W1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input{\partial C\over\partial W_1} = {\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial z_1}{\partial z_1\over\partial f_1}{\partial f_1\over\partial W1}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot W_2\cdot\partial\sigma(f1)\cdot\text{Input}∂W1∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂z1∂f2∂f1∂z1∂W1∂f1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input
もちろん,この順序で行列を乗じると,演算が悪いという問題に直面する.従って,各層に局所勾配を求める場合,このモードで計算した.
∂C=C′(zn)⊙σ′(fn)\partial C = C'(z_n)\odot\sigma'(f_n)∂C=C′(zn)⊙σ′(fn)
δn=wn+1T⋅δn+1⊙σ′(fn)\delta_n = w_{n+1}^T\cdot\delta_{n+1}\odot\sigma'(f_n)δn=wn+1T⋅δn+1⊙σ′(fn)
ここで、 ⊙\cdot,\\\odot⋅, ∮この2つの演算はいずれも行列積であり,それぞれ内(点)積,Hadamard積と呼ぶ.現在,この課題では活性化関数を信号とし,パターンに従って整理するとこうなる.
∂C∂W3=2(f3−Gt)⋅z2T{\partial C\over\partial W_3}=2(f_3-Gt)\cdot z_2^T∂W3∂C=2(f3−Gt)⋅z2T
∂C∂W2=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T{\partial C\over\partial W_2}= {W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2))\cdot z_1^T∂W2∂C=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T
∂C∂W1=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT{\partial C\over\partial W_1}= {W_2}^T\cdot\{{W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2)\}\odot\sigma(f1)(1-\sigma(f1))\cdot x^T∂W1∂C=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT
これで、各レイヤのグラデーションを計算するルールが表示されます.
Gradient=local gradient×output of prior layer\text{Gradient}=\text{local gradient}\times\text{output of prior layer}Gradient=local gradient×output of prior layer
Put these all together import matplotlib.pyplot as plt
import numpy as np
def layer(X: np.array, units: int, activation: str = None):
"""
X: input matrices of each layer and it decides weight matrix shape.
units: decides number of nodes in a hidden layer
activation: decides weight initialization. He, Xaiver and LeCun initialization are available.
"""
h, w = X.shape
if activation == 'relu':
weight = np.random.randn(w, units) * np.sqrt(2 / w) # He initialization
elif activation == 'tanh' or activation == 'softmax':
weight = np.random.randn(w, units) * np.sqrt(1 / w) # Xavier initialization
elif activation is None:
weight = np.random.randn(w, units) * 0.01 # LeCun initialization
return weight
def relu(X, props: str = 'fwd'):
"""
X: input matrices.
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return np.maximum(0, X)
elif props == 'bwd':
return np.greater(0, X).astype(int)
def softmax(X):
"""
X: input matrices.
I didn't create derivitives of softmax because when back propagation it's much efficient
to calculate it with cross entropy
"""
score = np.exp(X)
return score / np.sum(score, axis = 1, keepdims = True)
def cross_entropy(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
n = len(X)
if props == 'fwd':
likelyhood = -np.log(X[range(n), Y])
return np.sum(likelyhood) / n
elif props == 'bwd':
X[range(n), Y] -= 1
return X / n
def sse(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return 0.5 * np.square(X - Y)
elif props == 'bwd':
return X - Y
def forward(X, w, activation: str = None):
"""
X: Input matrices that flows into the layer.
w: weight matrices
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
d = X.dot(w)
if activation == 'relu':
a = relu(d)
elif activation == 'softmax':
a = softmax(d)
elif activation == None:
a = d
cache = a, d
return a, cache
def backward(upstream, cache, w, activation: str = None):
"""
upstream: gradient value from adjacent layer
cache: contains intermidiate step values within layers
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
c1, c2 = cache
local = upstream.dot(w.T)
if activation == 'relu':
upstream = local * relu(c2[1], 'bwd')
elif activation == 'tanh':
upstream = local * tanh(c2[1], 'bwd')
elif activation is None:
upstream = local
grad = c1[0].T.dot(upstream)
return grad, upstream
def momentum(weight, grad, cache: dict, lr=1e-3, mu=0.9):
if len(cache) == 0:
for layer in range(len(grad)):
v = -lr * grad[layer]
cache.append(v)
weight[layer] = weight[layer] + v
else:
for layer in range(len(grad)):
v = mu * cache[layer] - lr * grad[layer]
cache[layer] = v
weight[layer] = weight[layer] + v
def fit(x, y, epochs: int, lr: float=1e-2, mu: float=0.9):
cache = []
vel = []
loss = []
grad = [np.zeros_like(layer) for layer in params]
for epoch in range(epochs):
# forward pass
X, c = forward(x, params[0])
cache.append(c)
for layer in range(1, len(params)-1):
X, c = forward(X, params[layer], 'relu')
cache.append(c)
X, c = forward(X, params[-1], 'softmax')
cache.append(c)
# loss calculation
cost = cross_entropy(X, y, 'fwd')
loss.append(cost)
# backward pass
upstream = cross_entropy(X, y, 'bwd')
grad[-1] = cache[-2][0].T.dot(upstream)
for layer in range(len(grad)-2, 0, -1):
grad[layer], upstream = backward(upstream, cache[layer-1:layer+1], params[layer+1], 'relu')
local = upstream.dot(params[1].T)
upstream = local * relu(cache[0][1], 'bwd')
grad[0] = x.T.dot(upstream)
# SGD momentum optimizer
momentum(params, grad, cache = vel, lr=1e-2, mu=0.9)
def predict(X):
X, _ = forward(x, params[0])
for layer in range(1, len(params)-1):
X, _ = forward(X, params[layer], 'relu')
X, _ = forward(X, params[-1], 'softmax')
print(X)
Epilogue
人工ニューラルネットワークは入力値1を受け入れる.Weight(WWW)に乗算し、biasを省略します.2.アクティブ化関数(σ\sigmaσ, 結果値を信号関数でエクスポートします.
f=x⋅Wf=x\cdot Wf=x⋅W
z=σ(f)z=\sigma(f)z=σ(f)
この2つの関数は1つの層を構成し,前の層の結果値を次の層の入力値とし,多層重畳によりニューラルネットワークを形成する.私たちがしなければならないのは、結果値をできるだけ実際の値と等しくする重みを見つけることです.重み付け値をあちこち押して探すことができますが、このように最適な値を探すと、耳を覆って目を閉じて友達を探すようになります.これよりも科学的に近い方法はback propagationとgradient descentである.このときの核心は連鎖法則である.
All we need is Chain rule
上記の例では,zzz関数のWWWの変化量を理解することを望んでいるが,直接相関していない.賢い人がその時に使うために作ったチェーンルール.
∂z∂W=∂z∂f∂f∂W{\partial z\over\partial W}={\partial z\over\partial f}{\partial f\over\partial W}∂W∂z=∂f∂z∂W∂f
この点をよく覚えておくと、ついて行くのが簡単だと思います.
Calculating gradient
3層ネットワークにおける重み値の更新に必要な変化量を考慮する前に,以下に示すように,ニューラルネットワークを視覚的に解いた.
これにより,複雑な合成関数においても,所望の変化量をうまく捉えることができる.前述したように、各ウェイトが変化すると、費用関数がどのように変化するかを見出します.この角度から見ると、必要なのは∂C∂W 1 ∂C∂W2, ∂C∂W3{\partial C\over\partial W_1}\\{\partial C\over\partial W_2},\\{\partial C\over\partial W_3}∂W1∂C ∂W2∂C, ∂W 3∂Cのような3種類損失関数の各重み付け値の変化量をグラフに沿ってまとめると以下のようになる.
∂C∂W3=∂C∂z3∂z3∂f3∂f3∂W3=∂Cost⋅∂σ(f3)⋅z2{\partial C\over\partial W_3}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial W_3}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot z_2∂W3∂C=∂z3∂C∂f3∂z3∂W3∂f3=∂Cost⋅∂σ(f3)⋅z2
∂C∂W2=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂W2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1{\partial C\over\partial W_2}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial W2}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot z_1∂W2∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂W2∂f2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1
∂C∂W1=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂z1∂z1∂f1∂f1∂W1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input{\partial C\over\partial W_1} = {\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial z_1}{\partial z_1\over\partial f_1}{\partial f_1\over\partial W1}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot W_2\cdot\partial\sigma(f1)\cdot\text{Input}∂W1∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂z1∂f2∂f1∂z1∂W1∂f1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input
もちろん,この順序で行列を乗じると,演算が悪いという問題に直面する.従って,各層に局所勾配を求める場合,このモードで計算した.
∂C=C′(zn)⊙σ′(fn)\partial C = C'(z_n)\odot\sigma'(f_n)∂C=C′(zn)⊙σ′(fn)
δn=wn+1T⋅δn+1⊙σ′(fn)\delta_n = w_{n+1}^T\cdot\delta_{n+1}\odot\sigma'(f_n)δn=wn+1T⋅δn+1⊙σ′(fn)
ここで、 ⊙\cdot,\\\odot⋅, ∮この2つの演算はいずれも行列積であり,それぞれ内(点)積,Hadamard積と呼ぶ.現在,この課題では活性化関数を信号とし,パターンに従って整理するとこうなる.
∂C∂W3=2(f3−Gt)⋅z2T{\partial C\over\partial W_3}=2(f_3-Gt)\cdot z_2^T∂W3∂C=2(f3−Gt)⋅z2T
∂C∂W2=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T{\partial C\over\partial W_2}= {W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2))\cdot z_1^T∂W2∂C=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T
∂C∂W1=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT{\partial C\over\partial W_1}= {W_2}^T\cdot\{{W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2)\}\odot\sigma(f1)(1-\sigma(f1))\cdot x^T∂W1∂C=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT
これで、各レイヤのグラデーションを計算するルールが表示されます.
Gradient=local gradient×output of prior layer\text{Gradient}=\text{local gradient}\times\text{output of prior layer}Gradient=local gradient×output of prior layer
Put these all together import matplotlib.pyplot as plt
import numpy as np
def layer(X: np.array, units: int, activation: str = None):
"""
X: input matrices of each layer and it decides weight matrix shape.
units: decides number of nodes in a hidden layer
activation: decides weight initialization. He, Xaiver and LeCun initialization are available.
"""
h, w = X.shape
if activation == 'relu':
weight = np.random.randn(w, units) * np.sqrt(2 / w) # He initialization
elif activation == 'tanh' or activation == 'softmax':
weight = np.random.randn(w, units) * np.sqrt(1 / w) # Xavier initialization
elif activation is None:
weight = np.random.randn(w, units) * 0.01 # LeCun initialization
return weight
def relu(X, props: str = 'fwd'):
"""
X: input matrices.
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return np.maximum(0, X)
elif props == 'bwd':
return np.greater(0, X).astype(int)
def softmax(X):
"""
X: input matrices.
I didn't create derivitives of softmax because when back propagation it's much efficient
to calculate it with cross entropy
"""
score = np.exp(X)
return score / np.sum(score, axis = 1, keepdims = True)
def cross_entropy(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
n = len(X)
if props == 'fwd':
likelyhood = -np.log(X[range(n), Y])
return np.sum(likelyhood) / n
elif props == 'bwd':
X[range(n), Y] -= 1
return X / n
def sse(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return 0.5 * np.square(X - Y)
elif props == 'bwd':
return X - Y
def forward(X, w, activation: str = None):
"""
X: Input matrices that flows into the layer.
w: weight matrices
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
d = X.dot(w)
if activation == 'relu':
a = relu(d)
elif activation == 'softmax':
a = softmax(d)
elif activation == None:
a = d
cache = a, d
return a, cache
def backward(upstream, cache, w, activation: str = None):
"""
upstream: gradient value from adjacent layer
cache: contains intermidiate step values within layers
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
c1, c2 = cache
local = upstream.dot(w.T)
if activation == 'relu':
upstream = local * relu(c2[1], 'bwd')
elif activation == 'tanh':
upstream = local * tanh(c2[1], 'bwd')
elif activation is None:
upstream = local
grad = c1[0].T.dot(upstream)
return grad, upstream
def momentum(weight, grad, cache: dict, lr=1e-3, mu=0.9):
if len(cache) == 0:
for layer in range(len(grad)):
v = -lr * grad[layer]
cache.append(v)
weight[layer] = weight[layer] + v
else:
for layer in range(len(grad)):
v = mu * cache[layer] - lr * grad[layer]
cache[layer] = v
weight[layer] = weight[layer] + v
def fit(x, y, epochs: int, lr: float=1e-2, mu: float=0.9):
cache = []
vel = []
loss = []
grad = [np.zeros_like(layer) for layer in params]
for epoch in range(epochs):
# forward pass
X, c = forward(x, params[0])
cache.append(c)
for layer in range(1, len(params)-1):
X, c = forward(X, params[layer], 'relu')
cache.append(c)
X, c = forward(X, params[-1], 'softmax')
cache.append(c)
# loss calculation
cost = cross_entropy(X, y, 'fwd')
loss.append(cost)
# backward pass
upstream = cross_entropy(X, y, 'bwd')
grad[-1] = cache[-2][0].T.dot(upstream)
for layer in range(len(grad)-2, 0, -1):
grad[layer], upstream = backward(upstream, cache[layer-1:layer+1], params[layer+1], 'relu')
local = upstream.dot(params[1].T)
upstream = local * relu(cache[0][1], 'bwd')
grad[0] = x.T.dot(upstream)
# SGD momentum optimizer
momentum(params, grad, cache = vel, lr=1e-2, mu=0.9)
def predict(X):
X, _ = forward(x, params[0])
for layer in range(1, len(params)-1):
X, _ = forward(X, params[layer], 'relu')
X, _ = forward(X, params[-1], 'softmax')
print(X)
Epilogue
3層ネットワークにおける重み値の更新に必要な変化量を考慮する前に,以下に示すように,ニューラルネットワークを視覚的に解いた.
これにより,複雑な合成関数においても,所望の変化量をうまく捉えることができる.前述したように、各ウェイトが変化すると、費用関数がどのように変化するかを見出します.この角度から見ると、必要なのは∂C∂W 1 ∂C∂W2, ∂C∂W3{\partial C\over\partial W_1}\\{\partial C\over\partial W_2},\\{\partial C\over\partial W_3}∂W1∂C ∂W2∂C, ∂W 3∂Cのような3種類損失関数の各重み付け値の変化量をグラフに沿ってまとめると以下のようになる.
∂C∂W3=∂C∂z3∂z3∂f3∂f3∂W3=∂Cost⋅∂σ(f3)⋅z2{\partial C\over\partial W_3}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial W_3}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot z_2∂W3∂C=∂z3∂C∂f3∂z3∂W3∂f3=∂Cost⋅∂σ(f3)⋅z2
∂C∂W2=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂W2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1{\partial C\over\partial W_2}={\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial W2}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot z_1∂W2∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂W2∂f2=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅z1
∂C∂W1=∂C∂z3∂z3∂f3∂f3∂z2∂z2∂f2∂f2∂z1∂z1∂f1∂f1∂W1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input{\partial C\over\partial W_1} = {\partial C\over\partial z_3}{\partial z_3\over\partial f_3}{\partial f_3\over\partial z_2}{\partial z_2\over\partial f_2}{\partial f_2\over\partial z_1}{\partial z_1\over\partial f_1}{\partial f_1\over\partial W1}=\partial\text{Cost}\cdot\partial\sigma(f3)\cdot W_3\cdot\partial\sigma(f2)\cdot W_2\cdot\partial\sigma(f1)\cdot\text{Input}∂W1∂C=∂z3∂C∂f3∂z3∂z2∂f3∂f2∂z2∂z1∂f2∂f1∂z1∂W1∂f1=∂Cost⋅∂σ(f3)⋅W3⋅∂σ(f2)⋅W2⋅∂σ(f1)⋅Input
もちろん,この順序で行列を乗じると,演算が悪いという問題に直面する.従って,各層に局所勾配を求める場合,このモードで計算した.
∂C=C′(zn)⊙σ′(fn)\partial C = C'(z_n)\odot\sigma'(f_n)∂C=C′(zn)⊙σ′(fn)
δn=wn+1T⋅δn+1⊙σ′(fn)\delta_n = w_{n+1}^T\cdot\delta_{n+1}\odot\sigma'(f_n)δn=wn+1T⋅δn+1⊙σ′(fn)
ここで、 ⊙\cdot,\\\odot⋅, ∮この2つの演算はいずれも行列積であり,それぞれ内(点)積,Hadamard積と呼ぶ.現在,この課題では活性化関数を信号とし,パターンに従って整理するとこうなる.
∂C∂W3=2(f3−Gt)⋅z2T{\partial C\over\partial W_3}=2(f_3-Gt)\cdot z_2^T∂W3∂C=2(f3−Gt)⋅z2T
∂C∂W2=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T{\partial C\over\partial W_2}= {W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2))\cdot z_1^T∂W2∂C=W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2))⋅z1T
∂C∂W1=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT{\partial C\over\partial W_1}= {W_2}^T\cdot\{{W_3}^T\cdot2(f_3-Gt)\odot\sigma(f2)(1-\sigma(f2)\}\odot\sigma(f1)(1-\sigma(f1))\cdot x^T∂W1∂C=W2T⋅{W3T⋅2(f3−Gt)⊙σ(f2)(1−σ(f2)}⊙σ(f1)(1−σ(f1))⋅xT
これで、各レイヤのグラデーションを計算するルールが表示されます.
Gradient=local gradient×output of prior layer\text{Gradient}=\text{local gradient}\times\text{output of prior layer}Gradient=local gradient×output of prior layer
Put these all together import matplotlib.pyplot as plt
import numpy as np
def layer(X: np.array, units: int, activation: str = None):
"""
X: input matrices of each layer and it decides weight matrix shape.
units: decides number of nodes in a hidden layer
activation: decides weight initialization. He, Xaiver and LeCun initialization are available.
"""
h, w = X.shape
if activation == 'relu':
weight = np.random.randn(w, units) * np.sqrt(2 / w) # He initialization
elif activation == 'tanh' or activation == 'softmax':
weight = np.random.randn(w, units) * np.sqrt(1 / w) # Xavier initialization
elif activation is None:
weight = np.random.randn(w, units) * 0.01 # LeCun initialization
return weight
def relu(X, props: str = 'fwd'):
"""
X: input matrices.
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return np.maximum(0, X)
elif props == 'bwd':
return np.greater(0, X).astype(int)
def softmax(X):
"""
X: input matrices.
I didn't create derivitives of softmax because when back propagation it's much efficient
to calculate it with cross entropy
"""
score = np.exp(X)
return score / np.sum(score, axis = 1, keepdims = True)
def cross_entropy(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
n = len(X)
if props == 'fwd':
likelyhood = -np.log(X[range(n), Y])
return np.sum(likelyhood) / n
elif props == 'bwd':
X[range(n), Y] -= 1
return X / n
def sse(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return 0.5 * np.square(X - Y)
elif props == 'bwd':
return X - Y
def forward(X, w, activation: str = None):
"""
X: Input matrices that flows into the layer.
w: weight matrices
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
d = X.dot(w)
if activation == 'relu':
a = relu(d)
elif activation == 'softmax':
a = softmax(d)
elif activation == None:
a = d
cache = a, d
return a, cache
def backward(upstream, cache, w, activation: str = None):
"""
upstream: gradient value from adjacent layer
cache: contains intermidiate step values within layers
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
c1, c2 = cache
local = upstream.dot(w.T)
if activation == 'relu':
upstream = local * relu(c2[1], 'bwd')
elif activation == 'tanh':
upstream = local * tanh(c2[1], 'bwd')
elif activation is None:
upstream = local
grad = c1[0].T.dot(upstream)
return grad, upstream
def momentum(weight, grad, cache: dict, lr=1e-3, mu=0.9):
if len(cache) == 0:
for layer in range(len(grad)):
v = -lr * grad[layer]
cache.append(v)
weight[layer] = weight[layer] + v
else:
for layer in range(len(grad)):
v = mu * cache[layer] - lr * grad[layer]
cache[layer] = v
weight[layer] = weight[layer] + v
def fit(x, y, epochs: int, lr: float=1e-2, mu: float=0.9):
cache = []
vel = []
loss = []
grad = [np.zeros_like(layer) for layer in params]
for epoch in range(epochs):
# forward pass
X, c = forward(x, params[0])
cache.append(c)
for layer in range(1, len(params)-1):
X, c = forward(X, params[layer], 'relu')
cache.append(c)
X, c = forward(X, params[-1], 'softmax')
cache.append(c)
# loss calculation
cost = cross_entropy(X, y, 'fwd')
loss.append(cost)
# backward pass
upstream = cross_entropy(X, y, 'bwd')
grad[-1] = cache[-2][0].T.dot(upstream)
for layer in range(len(grad)-2, 0, -1):
grad[layer], upstream = backward(upstream, cache[layer-1:layer+1], params[layer+1], 'relu')
local = upstream.dot(params[1].T)
upstream = local * relu(cache[0][1], 'bwd')
grad[0] = x.T.dot(upstream)
# SGD momentum optimizer
momentum(params, grad, cache = vel, lr=1e-2, mu=0.9)
def predict(X):
X, _ = forward(x, params[0])
for layer in range(1, len(params)-1):
X, _ = forward(X, params[layer], 'relu')
X, _ = forward(X, params[-1], 'softmax')
print(X)
Epilogue
import matplotlib.pyplot as plt
import numpy as np
def layer(X: np.array, units: int, activation: str = None):
"""
X: input matrices of each layer and it decides weight matrix shape.
units: decides number of nodes in a hidden layer
activation: decides weight initialization. He, Xaiver and LeCun initialization are available.
"""
h, w = X.shape
if activation == 'relu':
weight = np.random.randn(w, units) * np.sqrt(2 / w) # He initialization
elif activation == 'tanh' or activation == 'softmax':
weight = np.random.randn(w, units) * np.sqrt(1 / w) # Xavier initialization
elif activation is None:
weight = np.random.randn(w, units) * 0.01 # LeCun initialization
return weight
def relu(X, props: str = 'fwd'):
"""
X: input matrices.
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return np.maximum(0, X)
elif props == 'bwd':
return np.greater(0, X).astype(int)
def softmax(X):
"""
X: input matrices.
I didn't create derivitives of softmax because when back propagation it's much efficient
to calculate it with cross entropy
"""
score = np.exp(X)
return score / np.sum(score, axis = 1, keepdims = True)
def cross_entropy(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
n = len(X)
if props == 'fwd':
likelyhood = -np.log(X[range(n), Y])
return np.sum(likelyhood) / n
elif props == 'bwd':
X[range(n), Y] -= 1
return X / n
def sse(X, Y, props: str = 'fwd'):
"""
X: input matrices
Y: label vector
props: decides propagate direction e.g. fwd means forward pass and bwd means backward pass.
"""
if props == 'fwd':
return 0.5 * np.square(X - Y)
elif props == 'bwd':
return X - Y
def forward(X, w, activation: str = None):
"""
X: Input matrices that flows into the layer.
w: weight matrices
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
d = X.dot(w)
if activation == 'relu':
a = relu(d)
elif activation == 'softmax':
a = softmax(d)
elif activation == None:
a = d
cache = a, d
return a, cache
def backward(upstream, cache, w, activation: str = None):
"""
upstream: gradient value from adjacent layer
cache: contains intermidiate step values within layers
activation: Decides which activation function to use. Relu and softmax are available for now.
"""
c1, c2 = cache
local = upstream.dot(w.T)
if activation == 'relu':
upstream = local * relu(c2[1], 'bwd')
elif activation == 'tanh':
upstream = local * tanh(c2[1], 'bwd')
elif activation is None:
upstream = local
grad = c1[0].T.dot(upstream)
return grad, upstream
def momentum(weight, grad, cache: dict, lr=1e-3, mu=0.9):
if len(cache) == 0:
for layer in range(len(grad)):
v = -lr * grad[layer]
cache.append(v)
weight[layer] = weight[layer] + v
else:
for layer in range(len(grad)):
v = mu * cache[layer] - lr * grad[layer]
cache[layer] = v
weight[layer] = weight[layer] + v
def fit(x, y, epochs: int, lr: float=1e-2, mu: float=0.9):
cache = []
vel = []
loss = []
grad = [np.zeros_like(layer) for layer in params]
for epoch in range(epochs):
# forward pass
X, c = forward(x, params[0])
cache.append(c)
for layer in range(1, len(params)-1):
X, c = forward(X, params[layer], 'relu')
cache.append(c)
X, c = forward(X, params[-1], 'softmax')
cache.append(c)
# loss calculation
cost = cross_entropy(X, y, 'fwd')
loss.append(cost)
# backward pass
upstream = cross_entropy(X, y, 'bwd')
grad[-1] = cache[-2][0].T.dot(upstream)
for layer in range(len(grad)-2, 0, -1):
grad[layer], upstream = backward(upstream, cache[layer-1:layer+1], params[layer+1], 'relu')
local = upstream.dot(params[1].T)
upstream = local * relu(cache[0][1], 'bwd')
grad[0] = x.T.dot(upstream)
# SGD momentum optimizer
momentum(params, grad, cache = vel, lr=1e-2, mu=0.9)
def predict(X):
X, _ = forward(x, params[0])
for layer in range(1, len(params)-1):
X, _ = forward(X, params[layer], 'relu')
X, _ = forward(X, params[-1], 'softmax')
print(X)
やってみました!
Reference
この問題について(Back propagation), 我々は、より多くの情報をここで見つけました https://velog.io/@iissaacc/Back-propagationテキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。
Collection and Share based on the CC Protocol