I implemented Speculative Decoding from scratch in pure PyTorch to see how much I could speed up inference on consumer hardware (Intel Core Ultra 5 225h).
It uses an OPT-125M draft model to accelerate an OPT-1.3B target model. I achieved a ~2.8x speedup (4.16 tok/s vs 1.47 tok/s baseline) by parallelizing the verification step.
The repo includes a minimal implementation of rejection sampling and a benchmark script.
It uses an OPT-125M draft model to accelerate an OPT-1.3B target model. I achieved a ~2.8x speedup (4.16 tok/s vs 1.47 tok/s baseline) by parallelizing the verification step.
The repo includes a minimal implementation of rejection sampling and a benchmark script.