Pytorch学習(十五)------雑多な知識のまとめ

13891 ワード

pytoch PyTorch

いつも言う
ちょっとした知識を記録しておくと、調べやすいです.(更新中)

カスタムRNNの書き方:Implementation of Multiplicative LSTM

カスタム2段階書き方:How to implement customized layer with second order derivatives How to customize the double backward

Feeding a Custom Gradient into LSTMs

retain_graphの議論

更新が必要なパラメータを定義------2019.3.13

更新するパラメータの注意事項の定義
簡単に言えば、更新する必要があるパラメータは一般的に__init__関数の中にあり、Conv2d、Linearのような既存の層であれば、直接使えばよい.これらの層はいずれもModule類であるため、weights, bias等が定義されているが、これらはいずれも定層Parameter型であり、もちろんデフォルトはrequires_grad=Trueであるが、Parameter型のみがoptimizerに入手できることに注意すべきである.なぜなら、私たちが定義したネットワークはnn.Module()クラスから継承されているからです.
optimizer
一般的に、Moduleクラスのmodel.parameters()クラスを用いる、Parameter()クラスとして登録されているネットワーク内のすべてのパラメータを得ることができる.同時にmodel.zero_grad()は、Parameter()クラスのすべてのパラメータを勾配クリアすることができる.では、パラメータを更新するには、次のようにします.

with torch.no_grad():
   #         
   for p in model.parameters(): p -= p.grad * lr
   model.zero_grad()

パラメータ更新方式をより簡略化するために、pytorchはoptimizerクラスを提供し、操作を簡略化する.

my_optimizer.step()
my_optimizer.zero_grad()

optimizer.pyでzero_gradの定義は、optimizerに入れるすべてのパラメータを0に設定することを見ることができます.

    def zero_grad(self):
        r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.detach_()
					p.grad.zero_()

.step()の関数は、異なる最適化方法によって実現する.

class MyNet3(nn.Module):
    def __init__(self):
        super(MyNet3, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.relu1 = nn.ReLU(True)
     #  move this to forward
     #  self.conv2 = nn.Conv2d(64, 1, 3, padding=1)

    def forward(self, input):
        self.conv2_filters = nn.Parameter(torch.randn(1, 64, 3, 3))
        return F.conv2d(self.relu1(self.conv1)), self.conv2_filters, padding=1)

ここのconv2_filtersにも勾配があるが、optimizerは最初はこのconv2_filtersを手に入れることができなかったため、更新することができなかった.
retrain_についてgraphの議論
In pytorch, every time you perform a computation with Variables, your create a graph, then if you call backward on the last Variable, it will traverse this graph to compute the gradients for everything in it (and delete the graph as it goes through it if retain_graph=False). So in your command line, you created a single graph, trying to backprop twice through it (withour retain_graph) so it will fail. If now inside your forward loop you do redo the forward computation, then the Variable on which you call backward is not the same, and the graph attached to it is not the same as the previous iteration one. So no error here.
The common mistake (the would raise the mentioned error while you’re not supposed to share graph) that can happen is that you perform some computation just before the loop, and so even though you create new graphs in the loop, they share a common part out of the loop like below:

a = torch.rand(3,3)

# This will be share by both iterations and will make the second backward fail !
b = a * a

for i in range(10):
    d = b * b
    # The first here will work but the second will not !
    d.backward()

Autograd

import torch

# unbroken gradient, backward goes all the way to x
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)

# broken gradient, ends at _y
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2

_y = torch.tensor(y.detach(), requires_grad=True)
z = _y * _y * 3
out = z.mean()
out.backward()
print(x.grad)
print(_y.grad)

# we can however, take the grad of _y and put it in manually!
y.backward(_y.grad)
print(x.grad)

tensorが値であれば、直接grad_tensorsはありません.すなわち、スカラーはベクトルを導出する.これがdocsの中にmeanが入ってからbackwardが入っている理由です.もちろん、lossなら、得られるのもスカラーです.直接.backward()です.もう一つ、torch.autograd.backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None)

Computes the sum of gradients of given tensors w.r.t. graph leaves.

The graph is differentiated using the chain rule. If any of tensors are non-scalar
 (i.e. their data has more than one element) and require gradient, the function additionally requires specifying grad_tensors.
  It should be a sequence of matching length, that contains gradient of the differentiated function 
  w.r.t. corresponding tensors   (None is an acceptable value for all tensors that don’t need gradient tensors).

This function accumulates gradients in the leaves - you might need to zero them before calling it.

AndroidにおけるOnTouch事件の研究

C#記録システム時間