Clarification on per_device_train_batch_size in Trainer
Author
System Info
transformers
version: 4.52.1- Platform: Linux-5.15.0-1061-nvidia-x86_64-with-glibc2.35
- Python version: 3.10.16
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.2
- Accelerate version: 1.7.0
- Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - main_process_ip: 10.3.0.43 - main_process_port: 5678 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: False - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': True, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_version': 2} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
- DeepSpeed version: 0.15.3
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA H100 80GB HBM3
Who can help?
@zach-huggingface @muellerzr @SunMarc can you please help.
Brief summary:
- Trying to train LLM using custom collator and iterable dataset with accelerate (FSDP)
- Setup is multiGPU on single node (8 GPUs)
- Need to calculate max_steps parameter prior due to iterable dataset
My understanding was :
- Per device means per gpu, so if my
per_device_train_batch_size
is 64 and i have 8 gpus , effective batch size should be 512 - Extending that to no. tokens processed per step -> sequence_len (2048), total tokens per step should be 512*2048 ~ 1M tokens (assuming grad_acc step is 1)
Problem:
- Training a dummy LLM model from scratch using 10M tokens
- According to my setup (explained above), it should take 10 steps approx to finish the training.
- However it takes exactly 8x more steps (only possible if the per device batch size is actually spread across all gpus equally)
Note:
- I have no padding involved as all data is concatenated to be equal to sequence_len i.e. 2048
- I have no sliding_window or chunking when doing this test. Chunk size and stride is set to sequence_len
- Additionally I logged the tokens seen by my custom collator at each step
batch_size: 64, max_len: 2048 Input shape: torch.Size([64, 2048]) [Step] Tokens this step: 131072
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
With the same setup, train any LLM using one node.
Sharing my data collator code for reference:
class CustomDataset(IterableDataset): """ Custom Dataset class for GPT model Input: data_files: list of all the files used for training Example: { "train": ["file1","file2" ...], "validation": ["file1","file2" ...], } split: dataset split (train or val) chunk_len: Max len of sequence a model can handle stride: analogus to window size tokenizer: a tiktoken based tokenizer """ def __init__( self, data_files: dict, split: str, chunk_len: int, stride: int, tokenizer, ): self.data = load_dataset( 'json', data_files=data_files, streaming=True, split=split, ) self.chunk_len = chunk_len self.stride = stride self.tokenizer = tokenizer def __iter__(self): # add sequences to buffer -> less padding tokens buffer = [] last_file = None for example in self.data: current_file = example.get('file_name') if current_file is not None and current_file != last_file: logger.info(f'Processing file: {current_file}') last_file = current_file sequence_ids = example['token_ids'] # Inject BOS and EOS buffer.append(self.tokenizer.bos_id) buffer.extend(sequence_ids) buffer.append(self.tokenizer.eos_id) while len(buffer) >= self.chunk_len: chunk = buffer[:self.chunk_len] buffer = buffer[self.stride:] # slide the window yield {'input_ids': torch.tensor(chunk, dtype=torch.long)} class PretrainCollator: """ Collator for variable-length pretraining sequences. Pads to the batch’s max length, builds attention masks, and uses `ignore_index` for label padding. """ def __init__(self, tokenizer, ignore_index: int = -100): self.tokenizer = tokenizer self.ignore_index = ignore_index self.total_seen_samples = 0 self.total_tokens_seen = 0 def __call__(self, batch: list[dict[str, Tensor]]) -> dict[str, Tensor]: self.total_seen_samples += len(batch) self.total_tokens_seen += sum(len(item['input_ids']) for item in batch) # 1) collect all input-id sequences sequences: list[Tensor] = [item['input_ids'] for item in batch] # 2) pad inputs (pad with pad_id) and labels (pad with ignore_index) inputs_padded = pad_sequence( sequences, batch_first=True, padding_value=self.tokenizer.pad_id, ) labels_padded = pad_sequence( sequences, batch_first=True, padding_value=self.ignore_index, ) # 3) build attention mask (1 for real tokens, 0 for padding) attention_mask = (inputs_padded != self.tokenizer.pad_id).long() return { 'input_ids': inputs_padded, 'attention_mask': attention_mask, 'labels': labels_padded, }
Expected behavior
The training should finish in 10 steps.
Clarification on per_device_train_batch_size in Trainer
Author
System Info
transformers
version: 4.52.1- Platform: Linux-5.15.0-1061-nvidia-x86_64-with-glibc2.35
- Python version: 3.10.16
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.2
- Accelerate version: 1.7.0
- Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - main_process_ip: 10.3.0.43 - main_process_port: 5678 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: False - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': True, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_version': 2} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
- DeepSpeed version: 0.15.3
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA H100 80GB HBM3
Who can help?
@zach-huggingface @muellerzr @SunMarc can you please help.
Brief summary:
- Trying to train LLM using custom collator and iterable dataset with accelerate (FSDP)
- Setup is multiGPU on single node (8 GPUs)
- Need to calculate max_steps parameter prior due to iterable dataset
My understanding was :
- Per device means per gpu, so if my
per_device_train_batch_size
is 64 and i have 8 gpus , effective batch size should be 512 - Extending that to no. tokens processed per step -> sequence_len (2048), total tokens per step should be 512*2048 ~ 1M tokens (assuming grad_acc step is 1)
Problem:
- Training a dummy LLM model from scratch using 10M tokens
- According to my setup (explained above), it should take 10 steps approx to finish the training.
- However it takes exactly 8x more steps (only possible if the per device batch size is actually spread across all gpus equally)
Note:
- I have no padding involved as all data is concatenated to be equal to sequence_len i.e. 2048
- I have no sliding_window or chunking when doing this test. Chunk size and stride is set to sequence_len
- Additionally I logged the tokens seen by my custom collator at each step
batch_size: 64, max_len: 2048 Input shape: torch.Size([64, 2048]) [Step] Tokens this step: 131072
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
With the same setup, train any LLM using one node.
Sharing my data collator code for reference:
class CustomDataset(IterableDataset): """ Custom Dataset class for GPT model Input: data_files: list of all the files used for training Example: { "train": ["file1","file2" ...], "validation": ["file1","file2" ...], } split: dataset split (train or val) chunk_len: Max len of sequence a model can handle stride: analogus to window size tokenizer: a tiktoken based tokenizer """ def __init__( self, data_files: dict, split: str, chunk_len: int, stride: int, tokenizer, ): self.data = load_dataset( 'json', data_files=data_files, streaming=True, split=split, ) self.chunk_len = chunk_len self.stride = stride self.tokenizer = tokenizer def __iter__(self): # add sequences to buffer -> less padding tokens buffer = [] last_file = None for example in self.data: current_file = example.get('file_name') if current_file is not None and current_file != last_file: logger.info(f'Processing file: {current_file}') last_file = current_file sequence_ids = example['token_ids'] # Inject BOS and EOS buffer.append(self.tokenizer.bos_id) buffer.extend(sequence_ids) buffer.append(self.tokenizer.eos_id) while len(buffer) >= self.chunk_len: chunk = buffer[:self.chunk_len] buffer = buffer[self.stride:] # slide the window yield {'input_ids': torch.tensor(chunk, dtype=torch.long)} class PretrainCollator: """ Collator for variable-length pretraining sequences. Pads to the batch’s max length, builds attention masks, and uses `ignore_index` for label padding. """ def __init__(self, tokenizer, ignore_index: int = -100): self.tokenizer = tokenizer self.ignore_index = ignore_index self.total_seen_samples = 0 self.total_tokens_seen = 0 def __call__(self, batch: list[dict[str, Tensor]]) -> dict[str, Tensor]: self.total_seen_samples += len(batch) self.total_tokens_seen += sum(len(item['input_ids']) for item in batch) # 1) collect all input-id sequences sequences: list[Tensor] = [item['input_ids'] for item in batch] # 2) pad inputs (pad with pad_id) and labels (pad with ignore_index) inputs_padded = pad_sequence( sequences, batch_first=True, padding_value=self.tokenizer.pad_id, ) labels_padded = pad_sequence( sequences, batch_first=True, padding_value=self.ignore_index, ) # 3) build attention mask (1 for real tokens, 0 for padding) attention_mask = (inputs_padded != self.tokenizer.pad_id).long() return { 'input_ids': inputs_padded, 'attention_mask': attention_mask, 'labels': labels_padded, }
Expected behavior
The training should finish in 10 steps.