PyTorch学習ノート(二)分割データセット

13050 ワード

PyTorch学習ノート

Environment

OS: macOS Mojave

Python version: 3.7

PyTorch version: 1.4.0

IDE: PyCharm

文書ディレクトリ

0. 前に書いてある

1. 異なるサブディレクトリ

に画像をラベル別に分類する.

2. 区分訓練検証試験セット

0.前に書く

コンピュータの視覚的深さ学習タスクに対して、データの処理は

はデータセットを区分し、データセットを訓練セット、検証セット、テストセットに区分する.

前処理、画像のデータ強化と標準化を行う.

は、batchのデータ入力モデルを読み出します.

PyTorchがデータを読み込んで訓練する場合、データは特定のディレクトリ構造に従って置くように要求されるので、データセットを分割することは、データを特定のディレクトリフォーマットに整理することである.

1.異なるサブディレクトリに画像をラベル別に分類する

入手したトレーニング画像データセットは、PyTorch読み取りに便利なディレクトリフォーマットではない場合があります.TinyMind人民元額面認識タスクの訓練セットを例にとると、39620枚の画像があり、train_face_value_label.csvにおける各ピクチャに対応するラベル情報

├── train
│   ├── 39620 images.jpeg
└── train_face_value_label.csv

categoriseを作成します.pyカテゴリ別に異なるディレクトリに格納し、以下のディレクトリ構造を形成する

├── categorise.py
├── train
│   ├── 0.1
│	│	└── 4233 images.jpg
│	├── 0.2
│	│	└── 4373 images.jpg
│	├── 0.5
│	│	└── 4407 images.jpg
│	├── 1.0
│	│	└── 4424 images.jpg
│	├── 2.0
│	│	└── 4411 images.jpg
│	├── 5.0
│	│	└── 4413 images.jpg
│	├── 10.0
│	│	└── 4283 images.jpg
│	├── 50.0
│	│	└── 4408 images.jpg
│	├── 100.0
│	│	└── 4668 images.jpg
└── train_face_value_label.csv

import os
import time
import shutil
import pandas as pd

label_path = os.path.join(os.curdir, 'train_face_value_label.csv')
labels = pd.read_csv(label_path)

# move each image to the specified-class dir
since = time.time()
data_dir = os.path.join(os.curdir, 'train')
for root, dirs, files in os.walk(data_dir):
    for file in files:
        image_name = file  # sometimes, it needs to be split: file.split('.')[0]

        # get the class the image belongs to
        label = labels[labels['name'] == image_name]['label'].values.item()  # int type

        out_dir = os.path.join(data_dir, str(label))  # Note: int to str for 'label'
        if not os.path.exists(out_dir):
            os.makedirs(out_dir)

        to_path = os.path.join(out_dir, file)
        from_path = os.path.join(data_dir, file)
        shutil.copy(from_path, to_path)  # shutil.move  

time_taken = time.time() - since
print('Time taken: {:.0f}m {:.0f}s'.format(time_taken // 60, time_taken % 60))
# Time taken: 3m 0s

2.訓練検証テストセットの区分

次に、分類して格納した画像データを区分する.split.を作成します.pyは、検証セットとしてvalディレクトリに画像の一部を取り出します.例の場合、トレーニングセットの設定:検証セット=99%:1%

├── categorise.py
├── split.py
├── train
│   ├── 0.1
│	│	└── images.jpg
│	├── 0.2
│	│	└── images.jpg
│	├── 0.5
│	│	└── images.jpg
│	├── 1.0
│	│	└── images.jpg
│	├── 2.0
│	│	└── images.jpg
│	├── 5.0
│	│	└── images.jpg
│	├── 10.0
│	│	└── images.jpg
│	├── 50.0
│	│	└── images.jpg
│	├── 100.0
│	│	└── images.jpg
├── train_face_value_label.csv
├── val
│   ├── 0.1
│	│	└── 42 images.jpg
│	├── 0.2
│	│	└── 43 images.jpg
│	├── 0.5
│	│	└── 44 images.jpg
│	├── 1.0
│	│	└── 44 images.jpg
│	├── 2.0
│	│	└── 44 images.jpg
│	├── 5.0
│	│	└── 44 images.jpg
│	├── 10.0
│	│	└── 42 images.jpg
│	├── 50.0
│	│	└── 44 images.jpg
│	├── 100.0
│	│	└── 46 images.jpg

split.pyのコードは次のようにできます.

import os
import random
import shutil

random.seed(0)

split_name = 'val'
split_pct = 0.01  # split out 1% for validation

data_dir = os.path.join(os.curdir, 'train')
for root, dirs, files in os.walk(data_dir):
    for sub_dir in dirs:  # merely use the first iter
        # get a list of image names and shuffle them
        images = os.listdir(os.path.join(root, sub_dir))
        images = list(filter(lambda x: x.endswith('.jpg'), images))
        num_images = len(images)
        random.shuffle(images)

        num_split = int(num_images * split_pct)  # how many images will be split out

        for i in range(num_split):
            out_dir = os.path.join(os.curdir, split_name, sub_dir)
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)

            to_path = os.path.join(out_dir, images[i])
            from_path = os.path.join(data_dir, sub_dir, images[i])
            shutil.move(from_path, to_path)

テストセットを分割する必要がある場合は、もう一度実行し、split_name変数値をtestに変更し、分割割合split_pctを調整すればよい.

9度_タイトル1517:チェーンテーブルの最後からk番目のノード

Builder Pattern