[Feature] Support CuteDSL BF16 Gemm kernels

#9631
Fridge003Fridge003
opened 6 days ago
Author

Checklist

Motivation

Image

Currently all the bf16 gemm kernels in SGLang are called from torch.linear, with underlying Cublas kernels. On B200 CuteDSL might provide bf16 gemm kernels faster than Cublas.

https://github.com/Dao-AILab/quack/blob/main/quack/dense_gemm_sm100.py

No response