Skip to content

Add qd8-qb4w-bf16 scalar kernels#9960

Open
GregoryComer wants to merge 2 commits into
google:masterfrom
GregoryComer:qd8-qb4w-bf16-scalar
Open

Add qd8-qb4w-bf16 scalar kernels#9960
GregoryComer wants to merge 2 commits into
google:masterfrom
GregoryComer:qd8-qb4w-bf16-scalar

Conversation

@GregoryComer
Copy link
Copy Markdown
Contributor

Add scalar qd8-qb4w-bf16 kernels. This is mainly just an output conversion, so kernels are very similar to the existing qd8-qb4w-f16/f32. Performance is with ~5% of scalar fp16. The scalar rounding logic is strictly correct for NaNs/INFs and has correct round to even support, so it's a little expensive. The M4 Max f16 kernels beat bf16 slightly because it uses native scalar fp16 arithmetic in the default build config.

Benchmarks

M4 Max

Kernel BL 128/128/1024 128/4096/1024 128/11008/4096
scalar 1x4 bf16 32 39.3 39.4 39.6
scalar 1x4 f16 32 43.1 43.0 43.4
scalar 1x4 bf16 256 48.9 49.4 50.0
scalar 1x4 f16 256 50.3 50.0 50.7
scalar 4x4 bf16 32 28.0 28.1 28.2
scalar 4x4 f16 32 29.7 29.4 29.8
scalar 4x4 bf16 256 31.8 31.8 32.0
scalar 4x4 f16 256 32.0 32.1 32.1

Cortex-X1 (Pixel 7 Pro)

Kernel BL 128/128/1024 128/4096/1024 128/11008/4096
scalar 1x4 bf16 32 19.3 19.1 19.4
scalar 1x4 f16 32 19.9 19.7 20.2
scalar 1x4 bf16 256 25.1 25.1 25.3
scalar 1x4 f16 256 23.3 23.3 23.8
scalar 4x4 bf16 32 11.4 11.3 11.3
scalar 4x4 f16 32 12.0 12.0 12.0
scalar 4x4 bf16 256 13.4 13.3 13.4
scalar 4x4 f16 256 12.9 12.8 12.9

Broadwell

Kernel BL 128/128/1024 128/4096/1024 128/11008/4096
scalar 1x4 bf16 32 3.8 3.8 3.7
scalar 1x4 f16 32 3.5 3.7 3.8
scalar 1x4 bf16 256 3.8 3.7 3.8
scalar 1x4 f16 256 3.8 3.9 3.9
scalar 4x4 bf16 32 3.9 4.0 4.1
scalar 4x4 f16 32 4.1 4.0 3.8
scalar 4x4 bf16 256 4.1 4.1 4.3
scalar 4x4 f16 256 4.4 4.2 4.0
f16 4x8c8 avx2 32 37.0 37.8 38.8
f16 4x8c8 avx2 256 55.7 52.3 52.0

@GregoryComer GregoryComer marked this pull request as ready for review April 13, 2026 23:06
@GregoryComer GregoryComer force-pushed the qd8-qb4w-bf16-scalar branch from 597b894 to 2b8ee89 Compare April 14, 2026 21:03
@GregoryComer
Copy link
Copy Markdown
Contributor Author

I regenerated the kernel build sources - CI should pass now. Verified on x86 Linux and ARM Mac locally.

@GregoryComer GregoryComer changed the title Qd8 qb4w bf16 scalar Add qd8-qb4w-bf16 scalar kernels Apr 16, 2026
@GregoryComer
Copy link
Copy Markdown
Contributor Author

@dsharlet Could you retrigger CI when you have time? Not urgent.

@GregoryComer GregoryComer force-pushed the qd8-qb4w-bf16-scalar branch from 2b8ee89 to 032143b Compare May 8, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants