Clarification on per_device_train_batch_size in Trainer

#38484

transformers

huggingface

Issue Details

2 months ago

No assignee

Good First Issuebug

View on GitHub

KeshavSingh29

opened 2 months ago

Author

System Info

transformers version: 4.52.1
Platform: Linux-5.15.0-1061-nvidia-x86_64-with-glibc2.35
Python version: 3.10.16
Huggingface_hub version: 0.30.2
Safetensors version: 0.5.2
Accelerate version: 1.7.0
Accelerate config: - compute_environment: LOCAL_MACHINE - distributed_type: FSDP - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 8 - machine_rank: 0 - num_machines: 1 - main_process_ip: 10.3.0.43 - main_process_port: 5678 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: False - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_offload_params': False, 'fsdp_reshard_after_forward': True, 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_version': 2} - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: []
DeepSpeed version: 0.15.3
PyTorch version (GPU?): 2.6.0+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@zach-huggingface @muellerzr @SunMarc can you please help.

Brief summary:

Trying to train LLM using custom collator and iterable dataset with accelerate (FSDP)
Setup is multiGPU on single node (8 GPUs)
Need to calculate max_steps parameter prior due to iterable dataset

My understanding was :

Per device means per gpu, so if my per_device_train_batch_size is 64 and i have 8 gpus , effective batch size should be 512
Extending that to no. tokens processed per step -> sequence_len (2048), total tokens per step should be 512*2048 ~ 1M tokens (assuming grad_acc step is 1)

Problem:

Training a dummy LLM model from scratch using 10M tokens
According to my setup (explained above), it should take 10 steps approx to finish the training.
However it takes exactly 8x more steps (only possible if the per device batch size is actually spread across all gpus equally)

Note:

I have no padding involved as all data is concatenated to be equal to sequence_len i.e. 2048
I have no sliding_window or chunking when doing this test. Chunk size and stride is set to sequence_len
Additionally I logged the tokens seen by my custom collator at each step

batch_size: 64, max_len: 2048                                                                                                                                                                                                     
Input shape: torch.Size([64, 2048])                                                                                                                                                                                               
[Step] Tokens this step: 131072

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

With the same setup, train any LLM using one node.

Sharing my data collator code for reference:

class CustomDataset(IterableDataset):
    """
    Custom Dataset class for GPT model
    Input:
        data_files: list of all the files used for training
            Example:
                {
                "train": ["file1","file2" ...],
                "validation": ["file1","file2" ...],
                }
        split: dataset split (train or val)
        chunk_len: Max len of sequence a model can handle
        stride: analogus to window size
        tokenizer: a tiktoken based tokenizer
    """

    def __init__(
        self,
        data_files: dict,
        split: str,
        chunk_len: int,
        stride: int,
        tokenizer,
    ):
        self.data = load_dataset(
            'json',
            data_files=data_files,
            streaming=True,
            split=split,
        )
        self.chunk_len = chunk_len
        self.stride = stride
        self.tokenizer = tokenizer

    def __iter__(self):
        # add sequences to buffer -> less padding tokens
        buffer = []
        last_file = None
        for example in self.data:
            current_file = example.get('file_name')

            if current_file is not None and current_file != last_file:
                logger.info(f'Processing file: {current_file}')
                last_file = current_file

            sequence_ids = example['token_ids']

            # Inject BOS and EOS
            buffer.append(self.tokenizer.bos_id)
            buffer.extend(sequence_ids)
            buffer.append(self.tokenizer.eos_id)

            while len(buffer) >= self.chunk_len:
                chunk = buffer[:self.chunk_len]
                buffer = buffer[self.stride:]  # slide the window
                yield {'input_ids': torch.tensor(chunk, dtype=torch.long)}


class PretrainCollator:
    """
    Collator for variable-length pretraining sequences.
    Pads to the batch’s max length, builds attention masks,
    and uses `ignore_index` for label padding.
    """

    def __init__(self, tokenizer, ignore_index: int = -100):
        self.tokenizer = tokenizer
        self.ignore_index = ignore_index
        self.total_seen_samples = 0
        self.total_tokens_seen = 0

    def __call__(self, batch: list[dict[str, Tensor]]) -> dict[str, Tensor]:
        self.total_seen_samples += len(batch)
        self.total_tokens_seen += sum(len(item['input_ids']) for item in batch)

        # 1) collect all input-id sequences
        sequences: list[Tensor] = [item['input_ids'] for item in batch]

        # 2) pad inputs (pad with pad_id) and labels (pad with ignore_index)
        inputs_padded = pad_sequence(
            sequences, batch_first=True, padding_value=self.tokenizer.pad_id,
        )
        labels_padded = pad_sequence(
            sequences, batch_first=True, padding_value=self.ignore_index,
        )

        # 3) build attention mask (1 for real tokens, 0 for padding)
        attention_mask = (inputs_padded != self.tokenizer.pad_id).long()

        return {
            'input_ids': inputs_padded,
            'attention_mask': attention_mask,
            'labels': labels_padded,
        }

Expected behavior

The training should finish in 10 steps.