I designed a bfloat16/FP8 alternative in a week using LLMs
Posted by k1832 4 hours ago
Comments
Comment by LuxBennu 4 hours ago
Comment by k1832 3 hours ago
Regarding GGUF Q8_0: I haven't benchmarked against it yet. My focus so far was on proving the hardware thesis (RTL synthesis via SkyWater 130nm) and validating the numerics/convergence via PyTorch QAT.
Bridging this into the ggml/llama.cpp ecosystem to run standard LLM benchmarks is absolutely the next logical step. Getting this to run efficiently in software (simulating the hardware behavior) to compare against Q8_0 is something I'm looking into next.
If anyone in the local inference community is interested in exploring this or has pointers on the best way to integrate custom QAT formats into standard benchmarking pipelines, I'm all ears!
Comment by jacquesm 36 minutes ago
Comment by k1832 4 hours ago