HuggingFace transformersにおいてmbartやmt5のfine-tuningをする際 GPUメモリ不足を回避する方法、という名の反省の記録


きっかけ

ローカルにおいて、HuggingFace transformersのコードを用いたfine-tuningをしようとしていた。

PyTorch: setting up devices
12/20/2020 09:34:39 - WARNING - __main__ -   Process rank: -1,
 device: cuda:0, n_gpu: 1distributed training: False,
 16-bits training: False

RuntimeError: CUDA out of memory. Tried to allocate * MiB 
(GPU 0; * GiB total capacity; * GiB already allocated; * free; 
* GiB reserved in total by PyTorch)

うむう。GPUのメモリ不足は重々承知。
普段であればGoogle Colabなどで動かすかねと考えるところであるが、どうしてもローカルで続けたくなったのであった。
 

*前提を端折りすぎた。後日、fine-tuningの流れを一通り記載する予定。とりあえずこちら参照

 Huggingface Transformersによる日本語のテキスト分類のファインチューニング
 https://note.com/npaka/n/n6df2be2a91c5
 Sequence to Sequence Training and Evaluation
 https://github.com/huggingface/transformers/tree/master/examples/seq2seq

課題

・GPUメモリ不足のローカルにおいて、HuggingFace transformersのコードを用いてfine-tuningを行う。

解決手段

1 システム設定でCUDAを無効とする →無効とならない
2 transformers側からGPU、CUDAを無効とする
3 ・・・

2の方法
・明示的にCPUを指定しなければならないのかな?
 → コードを追う 
 → training_args.pyに device = torch.device("cpu") 設定行あり
 → 引数に--no_cudaが存在していたよ・・・

・無事動いた

PyTorch: setting up devices
12/20/2020 10:05:57 - WARNING - __main__ -   Process rank: -1,
 device: cpu, n_gpu: 0distributed training: False,
 16-bits training: False

0.24 it/s
遅い!けれど、ランタイム終了を気にする必要無く他の作業に集中できる、という安心感はなかなか。

結論

・--no_cudaを入れましょう。
・引数をきちんと確認しようね。

*実行時に表示されるTraining/evaluation parametersにも、no_cuda=Falseの記載があったよ。ここで引数が存在すると気づかな。
*gpu記載かcpu記載であればすぐに気が付いたのにcuda記載か・・・

→ mT5 fine-tuning(non-GPU)の例(一部省略)を末尾に追記。

補足

・run_glue.py引数一覧

usage: run_glue.py [-h] --model_name_or_path MODEL_NAME_OR_PATH
                   [--config_name CONFIG_NAME]
                   [--tokenizer_name TOKENIZER_NAME] [--cache_dir CACHE_DIR]
                   [--no_use_fast_tokenizer] [--task_name TASK_NAME]
                   [--max_seq_length MAX_SEQ_LENGTH] [--overwrite_cache]
                   [--no_pad_to_max_length] [--train_file TRAIN_FILE]
                   [--validation_file VALIDATION_FILE] --output_dir OUTPUT_DIR
                   [--overwrite_output_dir] [--do_train] [--do_eval]
                   [--do_predict] [--evaluate_during_training]
                   [--evaluation_strategy {EvaluationStrategy.NO,EvaluationStrategy.STEPS,EvaluationStrategy.EPOCH}]
                   [--prediction_loss_only]
                   [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE]
                   [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
                   [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
                   [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
                   [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                   [--eval_accumulation_steps EVAL_ACCUMULATION_STEPS]
                   [--learning_rate LEARNING_RATE]
                   [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
                   [--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON]
                   [--max_grad_norm MAX_GRAD_NORM]
                   [--num_train_epochs NUM_TRAIN_EPOCHS]
                   [--max_steps MAX_STEPS] [--warmup_steps WARMUP_STEPS]
                   [--logging_dir LOGGING_DIR] [--logging_first_step]
                   [--logging_steps LOGGING_STEPS] [--save_steps SAVE_STEPS]
                   [--save_total_limit SAVE_TOTAL_LIMIT] [--no_cuda]
                   [--seed SEED] [--fp16] [--fp16_opt_level FP16_OPT_LEVEL]
                   [--local_rank LOCAL_RANK] [--tpu_num_cores TPU_NUM_CORES]
                   [--tpu_metrics_debug] [--debug] [--dataloader_drop_last]
                   [--eval_steps EVAL_STEPS]
                   [--dataloader_num_workers DATALOADER_NUM_WORKERS]
                   [--past_index PAST_INDEX] [--run_name RUN_NAME]
                   [--disable_tqdm DISABLE_TQDM] [--no_remove_unused_columns]
                   [--label_names LABEL_NAMES [LABEL_NAMES ...]]
                   [--load_best_model_at_end]
                   [--metric_for_best_model METRIC_FOR_BEST_MODEL]
                   [--greater_is_better GREATER_IS_BETTER]

 

*そろそろ、1.8年ほど使ってきた生BERTからHuggingface transformersのBERTに移行しようかな。

*オプション、コマンドライン引数、パラメータ。用語を使い分ける意義を理解できていない。 https://note.nkmk.me/python-command-line-arguments/

*mT5-base fine-tuning(non-GPU)の例(一部省略)
自分の環境では約500文対3epochの処理にかかる時間は2時間ほど。CPU50%使用、メモリ20GB使用。
なおまともに動かすには、20000〜50000文対3epochは欲しい。単純に80〜200時間、3〜8日間(実測は35時間〜)。
まともなGPUが欲しい。

%cd C:/Users/0000/transformers/examples/seq2seq/
%run run_seq2seq.py \
    --model_name_or_path C:/Users/0000/mt5model \
    --do_train \
    --do_eval \
    --task summarization \
    --train_file C:/Users/0000/t.json \
    --validation_file C:/Users/0000/v.json \
    --output_dir C:/Users/0000/mt5model/mt5model_finetuned \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate \
    --max_train_samples 500 \
    --max_val_samples 500 \
    --no_cuda
    task: str = field(
        default="summarization",
        metadata={
            "help": "The name of the task, should be summarization (or summarization_{dataset} for evaluating "
            "pegasus) or translation (or translation_{xx}_to_{yy})."
        },
    )
    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    text_column: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
    )
    summary_column: Optional[str] = field(
        default=None,
        metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
    )
    train_file: Optional[str] = field(
        default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
    )
    validation_file: Optional[str] = field(
        default=None,
        metadata={
            "help": "An optional input evaluation data file to evaluate the metrics (rouge/sacreblue) on "
            "(a jsonlines or csv file)."
        },
    )
    test_file: Optional[str] = field(
        default=None,
        metadata={
            "help": "An optional input test data file to evaluate the metrics (rouge/sacreblue) on "
            "(a jsonlines or csv file)."
        },
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_source_length: Optional[int] = field(
        default=1024,
        metadata={
            "help": "The maximum total input sequence length after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    max_target_length: Optional[int] = field(
        default=128,
        metadata={
            "help": "The maximum total sequence length for target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded."
        },
    )
    val_max_target_length: Optional[int] = field(
        default=None,
        metadata={
            "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
            "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
            "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
            "during ``evaluate`` and ``predict``."
        },
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": "Whether to pad all samples to model maximum sentence length. "
            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
            "efficient on GPU but very bad for TPU."
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
            "value if set."
        },
    )
    max_val_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
            "value if set."
        },
    )
    max_test_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": "For debugging purposes or quicker training, truncate the number of test examples to this "
            "value if set."
        },
    )
    source_lang: Optional[str] = field(default=None, metadata={"help": "Source language id for translation."})
    target_lang: Optional[str] = field(default=None, metadata={"help": "Target language id for translation."})
    num_beams: Optional[int] = field(
        default=None,
        metadata={
            "help": "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
            "which is used during ``evaluate`` and ``predict``."
        },
    )
    ignore_pad_token_for_loss: bool = field(
        default=True,
        metadata={
            "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
        },
    )
    source_prefix: Optional[str] = field(
        default=None, metadata={"help": "A prefix to add before every source text (useful for T5 models)."}

Trainerかpytorch lightningを使えって?。いや頻繁にfine tuningすることはないので…

*Data to textは熱いと思う

*誤記訂正2万文対fine tuning済みモデルをどこかに上げておこう。hugginface model hubでよいのかな?

*Recent Advances in Language Model Fine-tuning
https://ruder.io/recent-advances-lm-fine-tuning/
「事前トレーニング済みモデルを微調整する際の実際的な問題は、特に小さなデータセットでは、パフォーマンスが実行ごとに大幅に異なる可能性があることです〜
さらに、BERTを微調整するときに、小さな学習率を使用し、エポックの数を増やすことをお勧めします。」

*Google colab Pro日本対応。GPUメモリは足りるか足りないか微妙なところか。