#40049Please support loading Qwen 2.5 VL from GGUF

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Issue Details

2 months ago

No assignee

Feature request

View on GitHub

ihendley

opened 2 months ago

Author

Feature request

The new Qwen Image uses Qwen 2.5 VL 7B as a text encoder. Given memory constraints, some users may want to load a quantized image model and text encoder for a diffusers QwenImagePipeline, for example:

from diffusers import QwenImagePipeline, QwenImageTransformer2DModel, GGUFQuantizationConfig
import torch
from transformers import AutoModelForCausalLM


transformer = QwenImageTransformer2DModel.from_single_file(
    "https://huggingface.co/QuantStack/Qwen-Image-GGUF/blob/main/Qwen_Image-Q4_K_M.gguf",
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
    config="Qwen/Qwen-Image",
    subfolder="transformer",
)

text_encoder = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-GGUF", 
    gguf_file="Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf", 
    torch_dtype=torch.bfloat16, 
)

pipe = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image",
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16,
)

However, this currently fails with the error:

ValueError: GGUF model with architecture qwen2vl is not supported yet.

Motivation

As described above, Qwen 2.5 VL 7B is the text encoder for the new state-of-the-art Qwen Image model, and diffusers will either attempt to download and load the full unquanitzed Qwen 2.5 VL 7B (~15GB), or it will accept a transformers model text_encoder argument, so it would be very useful to be able to load a GGUF model here to save memory.

Your contribution

With some help getting started and support along the way I could make an attempt at a PR. However it might be quicker if someone with more experience takes the lead.

transformers

huggingface

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Issue Details

2 months ago

No assignee

Feature request

View on GitHub

ihendley

opened 2 months ago

Author

Feature request

from diffusers import QwenImagePipeline, QwenImageTransformer2DModel, GGUFQuantizationConfig
import torch
from transformers import AutoModelForCausalLM


transformer = QwenImageTransformer2DModel.from_single_file(
    "https://huggingface.co/QuantStack/Qwen-Image-GGUF/blob/main/Qwen_Image-Q4_K_M.gguf",
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
    config="Qwen/Qwen-Image",
    subfolder="transformer",
)

text_encoder = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct-GGUF", 
    gguf_file="Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf", 
    torch_dtype=torch.bfloat16, 
)

pipe = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image",
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16,
)

However, this currently fails with the error:

ValueError: GGUF model with architecture qwen2vl is not supported yet.

Motivation

Your contribution

With some help getting started and support along the way I could make an attempt at a PR. However it might be quicker if someone with more experience takes the lead.