#807MPI run with upto 3xH100 GPU works but starts giving an "illegal memory access" error when run using "mpirun -np 4 ./train_gpt2cu"

dm1976-gitdm1976-git
opened 6 months ago
Author

Running "./train_gpt2cu" and "mpirun -np 2 ./train_gpt2cu" are executing fine without any error, but "mpirun -np 4 ./train_gpt2cu" gives the error "[CUDA ERROR] at file train_gpt2.cu:961: an illegal memory access was encountered" on DGXH100 Server.

Again, the same binary when executed "mpirun -np 4 ./train_gpt2cu" works fine on another server with 4 x L40s GPU.

Output of the failed run on the DGX H100 Server.

(base) webel@dgx05:~/new-llm.c$ mpirun -np 4 ./train_gpt2cu +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin | | val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin | | output log dir | NULL | | checkpoint_every | 0 | | resume | 0 | | micro batch size B | 4 | | sequence length T | 1024 | | total batch size | 16384 | | LR scheduler | cosine | | learning rate (LR) | 3.000000e-04 | | warmup iterations | 0 | | final LR fraction | 1.000000e+00 | | weight decay | 0.000000e+00 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 20 | | val_max_steps | 20 | | sample_every | 20 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA H100 80GB HBM3 | | peak TFlops | 988.8 | | precision | BF16 | +-----------------------+----------------------------------------------------+ | weight init method | gpt2_124M_bf16.bin | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 18 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | no | +-----------------------+----------------------------------------------------+ | Zero Optimization is disabled | | num_processes | 4 | | zero_stage | 0 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=4 * seq_len T=1024 * num_processes=4 and total_batch_size=16384 => setting grad_accum_steps=1 allocating 237 MiB for parameter gradients allocating 1326 MiB for activations allocating 474 MiB for AdamW optimizer state m allocating 474 MiB for AdamW optimizer state v allocating 474 MiB for master copy of params device memory usage: 9666 MiB / 81090 MiB memory per sequence: 331 MiB -> estimated maximum batch size: 219 val loss 4.569264 step 1/18 | loss 4.332580 (+nanz)| norm 10.5463 (+nanz)| lr 3.00e-04 | 364.32 ms | 0.9% bf16 MFU | 44971 tok/s step 2/18 | loss 4.976806 (+nanz)| norm 23.1475 (+nanz)| lr 3.00e-04 | 39.84 ms | 8.4% bf16 MFU | 411246 tok/s step 3/18 | loss 4.164364 (+nanz)| norm 14.1995 (+nanz)| lr 3.00e-04 | 39.58 ms | 8.4% bf16 MFU | 412607 tok/s step 4/18 | loss 3.963217 (+nanz)| norm 8.6413 (+nanz)| lr 3.00e-04 | 39.63 ms | 8.4% bf16 MFU | 412902 tok/s step 5/18 | loss 3.536778 (+nanz)| norm 4.0721 (+nanz)| lr 3.00e-04 | 39.57 ms | 8.4% bf16 MFU | 413221 tok/s [CUDA ERROR] at file train_gpt2.cu:961: an illegal memory access was encountered

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[18346,1],3] Exit code: 1

(base) webel@dgx05:~/new-llm.c$