Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
-
Updated
Apr 13, 2026 - Python
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
Open quantization tooling for TurboQuant-style low-bit LLM releases, stock GGUF deployment, and Apple Silicon runtime experiments.
[CAAI AIR'24] Minimize Quantization Output Error with Bias Compensation
A high-performance, memory-efficient healthcare framework that deploys fine-tuned Large Language Models (LLMs) on edge devices. Multi-agent system to provide personalized diagnostic reasoning, health education, and dietary planning.
A more deep research about TurboQuant algorithms
Ternary Quantization for LLMs: Implement balanced ternary (T3_K) weights for 2.63-bit quantization—the first working solution for modern large language models.
Shift-based post-training quantization analysis for LLMs (ShiftQuant paper)
Let me make GGUF files quickly
LLM quantization project built around `llama.cpp` + `Ollama` + `GGUF`
Production-grade LLM quantization, benchmarking, and edge deployment toolkit. Supports bitsandbytes INT8/INT4, GPTQ (Hessian calibration), AWQ (activation-aware), and GGUF (Q2_K–Q8_0). Four-dimensional benchmarking: perplexity, TPS/TTFT, VRAM profiling, and LLM-as-Judge quality scoring. RTX 5090 Blackwell sm_120 ready.
Paired capability-level GGUF quantization fragility benchmark across Qwen2.5-3B and SmolLM2 1.7B.
PentaNet extends BitNet's ternary quantization to pentanary {-2,-1,0,+1,+2}, improving perplexity by 6.4% at 124M params while preserving zero-multiplier arithmetic.
Local & lightweight LLM inference runtime in C++ with support for GGUF & quantization
GWIQ-Atlas: is a brain-atlasing and model-interpretability suite that combines per-layer census, compliance behaviour tracing, SAE features, and quantization analyses for LLMs.
A high-performance inference engine optimized for deploying quantized LLMs on edge devices. Focuses on SIMD optimizations and memory management.
OpenVINO Model Manager — desktop GUI for Intel Arc
Implemented and fine-tuned BERT for a custom sequence classification task, leveraging LoRA adapters for efficient parameter updates and 4-bit quantization to optimize performance and resource utilization.
Implementation of advanced Natural Language Processing architectures and optimization techniques, built from scratch. The projects focus on understanding the internal mechanics of Transformers, LLM efficiency through quantization, and scaling via Mixture-of-Experts (MoE).
Add a description, image, and links to the llm-quantization topic page so that developers can more easily learn about it.
To associate your repository with the llm-quantization topic, visit your repo's landing page and select "manage topics."