HuggingFace transformersにおいてmbartやmt5のfine-tuningをする際 GPUメモリ不足を回避する方法、という名の反省の記録
きっかけ
ローカルにおいて、HuggingFace transformersのコードを用いたfine-tuningをしようとしていた。
PyTorch: setting up devices
12/20/2020 09:34:39 - WARNING - __main__ - Process rank: -1,
device: cuda:0, n_gpu: 1distributed training: False,
16-bits training: False
~
RuntimeError: CUDA out of memory. Tried to allocate * MiB
(GPU 0; * GiB total capacity; * GiB already allocated; * free;
* GiB reserved in total by PyTorch)
うむう。GPUのメモリ不足は重々承知。
普段であればGoogle Colabなどで動かすかねと考えるところであるが、どうしてもローカルで続けたくなったのであった。
*前提を端折りすぎた。後日、fine-tuningの流れを一通り記載する予定。とりあえずこちら参照
Huggingface Transformersによる日本語のテキスト分類のファインチューニング
https://note.com/npaka/n/n6df2be2a91c5
Sequence to Sequence Training and Evaluation
https://github.com/huggingface/transformers/tree/master/examples/seq2seq
課題
・GPUメモリ不足のローカルにおいて、HuggingFace transformersのコードを用いてfine-tuningを行う。
解決手段
1 システム設定でCUDAを無効とする →無効とならない
2 transformers側からGPU、CUDAを無効とする
3 ・・・
2の方法
・明示的にCPUを指定しなければならないのかな?
→ コードを追う
→ training_args.pyに device = torch.device("cpu") 設定行あり
→ 引数に--no_cudaが存在していたよ・・・
・無事動いた
PyTorch: setting up devices
12/20/2020 10:05:57 - WARNING - __main__ - Process rank: -1,
device: cpu, n_gpu: 0distributed training: False,
16-bits training: False
~
0.24 it/s
遅い!けれど、ランタイム終了を気にする必要無く他の作業に集中できる、という安心感はなかなか。
結論
・--no_cudaを入れましょう。
・引数をきちんと確認しようね。
*実行時に表示されるTraining/evaluation parametersにも、no_cuda=Falseの記載があったよ。ここで引数が存在すると気づかな。
*gpu記載かcpu記載であればすぐに気が付いたのにcuda記載か・・・
→ mT5 fine-tuning(non-GPU)の例(一部省略)を末尾に追記。
補足
・run_glue.py引数一覧
usage: run_glue.py [-h] --model_name_or_path MODEL_NAME_OR_PATH
[--config_name CONFIG_NAME]
[--tokenizer_name TOKENIZER_NAME] [--cache_dir CACHE_DIR]
[--no_use_fast_tokenizer] [--task_name TASK_NAME]
[--max_seq_length MAX_SEQ_LENGTH] [--overwrite_cache]
[--no_pad_to_max_length] [--train_file TRAIN_FILE]
[--validation_file VALIDATION_FILE] --output_dir OUTPUT_DIR
[--overwrite_output_dir] [--do_train] [--do_eval]
[--do_predict] [--evaluate_during_training]
[--evaluation_strategy {EvaluationStrategy.NO,EvaluationStrategy.STEPS,EvaluationStrategy.EPOCH}]
[--prediction_loss_only]
[--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE]
[--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE]
[--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE]
[--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS]
[--learning_rate LEARNING_RATE]
[--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON]
[--max_grad_norm MAX_GRAD_NORM]
[--num_train_epochs NUM_TRAIN_EPOCHS]
[--max_steps MAX_STEPS] [--warmup_steps WARMUP_STEPS]
[--logging_dir LOGGING_DIR] [--logging_first_step]
[--logging_steps LOGGING_STEPS] [--save_steps SAVE_STEPS]
[--save_total_limit SAVE_TOTAL_LIMIT] [--no_cuda]
[--seed SEED] [--fp16] [--fp16_opt_level FP16_OPT_LEVEL]
[--local_rank LOCAL_RANK] [--tpu_num_cores TPU_NUM_CORES]
[--tpu_metrics_debug] [--debug] [--dataloader_drop_last]
[--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS]
[--past_index PAST_INDEX] [--run_name RUN_NAME]
[--disable_tqdm DISABLE_TQDM] [--no_remove_unused_columns]
[--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end]
[--metric_for_best_model METRIC_FOR_BEST_MODEL]
[--greater_is_better GREATER_IS_BETTER]
*そろそろ、1.8年ほど使ってきた生BERTからHuggingface transformersのBERTに移行しようかな。
*オプション、コマンドライン引数、パラメータ。用語を使い分ける意義を理解できていない。 https://note.nkmk.me/python-command-line-arguments/
*mT5-base fine-tuning(non-GPU)の例(一部省略)
自分の環境では約500文対3epochの処理にかかる時間は2時間ほど。CPU50%使用、メモリ20GB使用。
なおまともに動かすには、20000〜50000文対3epochは欲しい。単純に80〜200時間、3〜8日間(実測は35時間〜)。
まともなGPUが欲しい。
%cd C:/Users/0000/transformers/examples/seq2seq/
%run run_seq2seq.py \
--model_name_or_path C:/Users/0000/mt5model \
--do_train \
--do_eval \
--task summarization \
--train_file C:/Users/0000/t.json \
--validation_file C:/Users/0000/v.json \
--output_dir C:/Users/0000/mt5model/mt5model_finetuned \
--overwrite_output_dir \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--predict_with_generate \
--max_train_samples 500 \
--max_val_samples 500 \
--no_cuda
task: str = field(
default="summarization",
metadata={
"help": "The name of the task, should be summarization (or summarization_{dataset} for evaluating "
"pegasus) or translation (or translation_{xx}_to_{yy})."
},
)
dataset_name: Optional[str] = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
text_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
)
summary_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
)
train_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
)
validation_file: Optional[str] = field(
default=None,
metadata={
"help": "An optional input evaluation data file to evaluate the metrics (rouge/sacreblue) on "
"(a jsonlines or csv file)."
},
)
test_file: Optional[str] = field(
default=None,
metadata={
"help": "An optional input test data file to evaluate the metrics (rouge/sacreblue) on "
"(a jsonlines or csv file)."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_source_length: Optional[int] = field(
default=1024,
metadata={
"help": "The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
max_target_length: Optional[int] = field(
default=128,
metadata={
"help": "The maximum total sequence length for target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
},
)
val_max_target_length: Optional[int] = field(
default=None,
metadata={
"help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
"This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
"during ``evaluate`` and ``predict``."
},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": "Whether to pad all samples to model maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
"efficient on GPU but very bad for TPU."
},
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
},
)
max_val_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
"value if set."
},
)
max_test_samples: Optional[int] = field(
default=None,
metadata={
"help": "For debugging purposes or quicker training, truncate the number of test examples to this "
"value if set."
},
)
source_lang: Optional[str] = field(default=None, metadata={"help": "Source language id for translation."})
target_lang: Optional[str] = field(default=None, metadata={"help": "Target language id for translation."})
num_beams: Optional[int] = field(
default=None,
metadata={
"help": "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
"which is used during ``evaluate`` and ``predict``."
},
)
ignore_pad_token_for_loss: bool = field(
default=True,
metadata={
"help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
},
)
source_prefix: Optional[str] = field(
default=None, metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
Trainerかpytorch lightningを使えって?。いや頻繁にfine tuningすることはないので…
*Data to textは熱いと思う
*誤記訂正2万文対fine tuning済みモデルをどこかに上げておこう。hugginface model hubでよいのかな?
*Recent Advances in Language Model Fine-tuning
https://ruder.io/recent-advances-lm-fine-tuning/
「事前トレーニング済みモデルを微調整する際の実際的な問題は、特に小さなデータセットでは、パフォーマンスが実行ごとに大幅に異なる可能性があることです〜
さらに、BERTを微調整するときに、小さな学習率を使用し、エポックの数を増やすことをお勧めします。」
*Google colab Pro日本対応。GPUメモリは足りるか足りないか微妙なところか。
🇧🇷 🇫🇷 🇩🇪 🇮🇳🇯🇵 🇹🇭 🇬🇧
— Colaboratory (@GoogleColab) March 24, 2021
Colab Pro is now available in Brazil, France, Germany, India, Japan, Thailand, and the United Kingdom.
Sign up at https://t.co/0z7WKH0F35
🇧🇷 🇫🇷 🇩🇪 🇮🇳🇯🇵 🇹🇭 🇬🇧
Author And Source
この問題について(HuggingFace transformersにおいてmbartやmt5のfine-tuningをする際 GPUメモリ不足を回避する方法、という名の反省の記録), 我々は、より多くの情報をここで見つけました https://qiita.com/kzuzuo/items/a4fb6db773fa38b43408著者帰属:元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。
Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .