#39593`gemma-3-1b-it` with `use_cache=True` and `past_key_values` throws `RuntimeError: CUDA error: device-side assert` error
Author
System Info
(dev) nicholas@B306177:chatbot-utils(master)$ transformers env
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformersversion: 4.52.4- Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
- Python version: 3.11.12
- Huggingface_hub version: 0.33.0
- Safetensors version: 0.5.3
- Accelerate version: 1.8.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.7.0+cu126 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA RTX 6000 Ada Generation
(dev) nicholas@B306177:chatbot-utils(master)$
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am trying to use gemma-3-1b-it with use_cache=True. The following snippet of code runs perfectly fine and does not set use_cache=true.
from transformers.cache_utils import HybridCache from transformers.models.auto.modeling_auto import AutoModelForCausalLM from transformers.models.auto.tokenization_auto import AutoTokenizer from transformers.models.gemma.tokenization_gemma_fast import ( GemmaTokenizerFast, ) from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM import torch def stream( model: Gemma3ForCausalLM, tokenizer: GemmaTokenizerFast, prompt: str ): input_ids = tokenizer.encode(prompt) input_ids = torch.tensor( input_ids, device=model.device, dtype=torch.long ).unsqueeze(0) attention_mask = torch.ones_like( input_ids, device=model.device, dtype=torch.long ) eos_token_id = [tokenizer.eos_token_id, 106] for _ in range(100): with torch.no_grad(): outputs = model.forward( input_ids=input_ids, # type: ignore attention_mask=attention_mask, use_cache=False, ) logits = outputs.logits assert logits is not None next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True) token_id = next_token.item() if eos_token_id is not None and token_id in eos_token_id: break print(tokenizer.decode(token_id), end="", flush=True) while len(next_token.shape) < len(input_ids.shape): next_token = next_token.unsqueeze(0) input_ids = torch.concat((input_ids, next_token), dim=-1) attention_mask = torch.ones_like( input_ids, device=model.device, dtype=torch.long ) print() stream(model, tokenizer, "How do I add two ints in python?. Give a short answer.") # Here is the response '''python a = 10 b = 20 sum = a + b print(sum) ''' Output: 30 The code adds the integers `a` and `b` and stores the result in the variable `sum`. Finally, it prints the value of `sum`.
However, if I try to set use_cache=True and use the past_key_values, we get the following error.
from transformers.cache_utils import HybridCache from transformers.models.auto.modeling_auto import AutoModelForCausalLM from transformers.models.auto.tokenization_auto import AutoTokenizer from transformers.models.gemma.tokenization_gemma_fast import ( GemmaTokenizerFast, ) from transformers.models.gemma3.modeling_gemma3 import Gemma3ForCausalLM import torch def stream_with_cache( model: Gemma3ForCausalLM, tokenizer: GemmaTokenizerFast, prompt: str ): input_ids = tokenizer.encode(prompt) input_ids = torch.tensor( input_ids, device=model.device, dtype=torch.long ).unsqueeze(0) attention_mask = torch.ones_like( input_ids, device=model.device, dtype=torch.long ) past_key_values = None eos_token_id = [tokenizer.eos_token_id, 106] for _ in range(100): with torch.no_grad(): outputs = model.forward( input_ids=input_ids, # type: ignore attention_mask=attention_mask, use_cache=True, past_key_values=past_key_values ) logits = outputs.logits assert logits is not None past_key_values = outputs.past_key_values assert isinstance(past_key_values, HybridCache) next_token = torch.argmax(logits[:, -1, :], dim=-1, keepdim=True) token_id = next_token.item() if eos_token_id is not None and token_id in eos_token_id: break print(tokenizer.decode(token_id), end="", flush=True) while len(next_token.shape) < len(input_ids.shape): next_token = next_token.unsqueeze(0) input_ids = next_token attention_mask = None print() model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it").to("cuda:0") tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-1b-it") stream_with_cache(model, tokenizer, "Hello how are you") # Here is the error /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [64,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [65,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [66,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [67,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [68,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [69,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [70,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [71,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [72,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [73,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [74,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [75,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [76,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [77,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [78,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [79,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [80,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [81,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [82,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [83,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [84,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [85,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [86,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [87,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [88,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [89,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [90,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [91,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [92,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [93,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [94,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [95,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [1,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [2,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [3,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [4,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [5,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [6,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [7,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [8,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [9,0,0] Assertion `idx >= 0 && idx < self_dim_size && "in dex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [10,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [11,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [12,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [13,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [14,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [15,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [16,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [17,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [18,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [19,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [20,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [21,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [22,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [23,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [24,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [25,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [26,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [27,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [28,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [29,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [30,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [31,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [96,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [97,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [98,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [99,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [100,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [101,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [102,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [103,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [104,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [105,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [106,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [107,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [108,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [109,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [110,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [111,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [112,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [113,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [114,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [115,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [116,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [117,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [118,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [119,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [120,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [121,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [122,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [123,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [124,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [125,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [126,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [127,0,0] Assertion `idx >= 0 && idx < self_dim_size && " index_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [32,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [33,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [34,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [35,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [36,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [37,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [38,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [39,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [40,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [41,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [42,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [43,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [44,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [45,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [46,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [47,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [48,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [49,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [50,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [51,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [52,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [53,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [54,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [55,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [56,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [57,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [58,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [59,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [62,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. /pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:175: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx >= 0 && idx < self_dim_size && "i ndex_copy_(): index out of bounds"` failed. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 28, in stream_with_cache RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call , so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Expected behavior
The expected behavior is that I can use the past_key_values so these do not need to be recalculated when using the model auto regressively. I have verified that the above works with "deepseek-ai/deepseek-coder-1.3b-instruct" but it does not work with "google/gemma-3-1b-it".