ImageNet の訓練にどれくらい時間がかかるか？

How long to train VGG19 on ImageNet? : computervision

There are 4 TeslaV100 in my server and I can only use 2 of them, and the other 2 were used by others. Now one epoch will take about 2 hours. Is it normal?

Yes, that is normal, imageNet is very big.

うわ・・・。分かるけど。つまり、V100x2 で 90 エポック回すとすると 2*90/24=4.17 日かかりますってことか・・・。1 基の GPU ならざっくり 1 週間以上だろうか。確か Kaggle に置いてあるデータセットが 166 GB あるので、サイズ的にも日にち的にも費用的にも絶望的な気持ちになる。これで収束しなかったり、Top-1 精度が思ったほど出なければ泣けるな・・・。

Amazon EC2 P3 インスタンス | AWS

インスタンスサイズ	GPUs – Tesla V100	GPU Peer to Peer	GPU メモリ (GB)	vCPU	メモリ (GB)	ネットワーク帯域幅	EBS 帯域幅	オンデマンド料金/時間
p3.2xlarge	1	該当なし	16	8	61	最大 10 Gbps	1.5 Gbps	3.06 USD
p3.8xlarge	4	NVLink	64	32	244	10 Gbps	7 Gbps	12.24 USD

なので、例えば p3.8xlarge だと、2.08 日くらいだとして、オンデマンド料金で 611 USD くらいなので、なかなかひやひやするお値段である。3 年間のリザーブドインスタンスで考えると、200 USD ちょいになるので、これで漸くといったところか・・・。かなり本気でディープラーニングする場合じゃないとそこまで投資できないな・・・。

Hands-on TensorFlow Tutorial: Train ResNet-50 From Scratch Using the ImageNet Dataset | by James Montantes | Towards Data Science

に TensorFlow での ResNet-50 の訓練サンプルが載っている。30 エポック目くらいから急に精度が上昇しているが、そこまでは結構緩やかだ。それにしてわりと最初から 0.1 くらいの accuracy が出るのかな？

https://github.com/pytorch/examples/tree/main/imagenet

PyTorch で訓練する時はこのスクリプトが参考になりそうだ。計算機の環境は必要だが・・・。

Am I able to train a model with ImageNet in Google Colab? : deeplearning を見つつ Colab Pro+ と Google One 2TB だったらいけるだろうか・・・？しかし、keras - Google Colab is so slow while reading images from Google Drive - Stack Overflow のようにとても遅い可能性が・・・。

Subsets of ImageNet. ILSVRC 2012, commonly known as… | by Roland Gao | The Startup | Medium によると、

It requires more than 150GB of storage, and training a resnet50 on it will take around 215 hours using a T4 GPU on Google Colab, not to mention that Colab limits each session to 12 hours.

ということで、具体的な数字が見えて良い。

らんだむな記憶

blogというものを体験してみようか！的なー

ImageNet の訓練にどれくらい時間がかかるか？