Show HN: Speculative Decoding from Scratch in PyTorch (2.8x CPU Speedup)

(github.com)

3 points | by kunal51107 7 hours ago

1 comments

kunal51107 7 hours ago
I implemented Speculative Decoding from scratch in pure PyTorch to see how much I could speed up inference on consumer hardware (Intel Core Ultra 5 225h).
It uses an OPT-125M draft model to accelerate an OPT-1.3B target model. I achieved a ~2.8x speedup (4.16 tok/s vs 1.47 tok/s baseline) by parallelizing the verification step.
The repo includes a minimal implementation of rejection sampling and a benchmark script.