Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.
It's great to see vLLM getting faster/better for DeepSeek. I tested vLLM vs SGLang a couple weeks ago and SGLang's DeepSeek support was much better/faster (on 2 x p5 H100 nodes). It's great that no one's standing still, I saw this recent AMD article that reported SGLang perf on MI300X has increased by 4X over the past couple weeks: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)
It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.
Pretty significant improvements. However, my back on the napkin math suggests that MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates the compute in attention implementation? Those would be the prefill-phase (or TTFT) and training (when batch_size >> 1) but not the decode phase (inference)?
Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.
It depends on the batch size and the accelerator you're running on! Decode is *typically* memory bound unless you can hit high batch sizes (in the hundreds), which is hard during serving due to the contention between batch size and low TTFT.
You've got it backwards. After FlashAttention, it's the decoding part being bound mainly by memory access. With FA as long as you have enough batch size you can push training/prefill to be compute-bound.
I don't think I got it backwards, I believe what I said is correct - FA does not improve inference time.
From the authors of FlashAttention:
> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results
And then they continue with:
> However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU!
And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:
> Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.
That's correct, because FA can't turn inference time from memory-access bound into compute-bound. But your claim on that decoding is compute-bound is plainly wrong.
FA, compared to naive implementation, made training / prefill (i.e. when you can have multiple tokens in the same sequence visible) compute-bound instead of memory-access bound.
So, currently, on MHA/GQA, with Flash Attention, training/prefill is compute-bound, whereas decoding is memory-access-bound.
Before FA, both prefill / decode are bound by memory-access. FA solved the problem of training/prefill. But because kvcache is large, decoding is inherently bound by memory-access.
Our goal is always to make everything compute-bound.
> But your claim on that decoding is compute-bound is plainly wrong.
I did not say anything like that? What I said is that FlashAttention and arguably MLA will not make any significant gains in the inference time. And this is true.
Also, FWIW there are certainly model shapes that are compute-bound in the decode phase so saying that decoding is universally inherently bound by memory access is what is plain wrong, if I were to use your dictionary.
> MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates
> Those would be [...] not the decode phase
This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.
Reading your quotes, it looks like maybe you are talking about GPU utilization issues? (i.e. not launching enough threads). Due to the parallelization strategy of the original FA it indeed does not even keep the GPU busy if q*bs is too small. But this is not an inherent limitation of FA-style kernels and can be solved and people did solve it. Or you simply batch more. Now you can keep the GPUs busy at 100% waiting for memory access, but memory access time still dominates, hence "memory-access-bound". And here comes MLA.
> FWIW there are certainly model shapes that are compute-bound in the decode phase
Yeah. But so far all I read don't really work ("work" means being at least just slightly worse than alternatives) under same wall-clock time compute budget. Do you have any pointer to a working example, even on smaller 3B-ish models?
> This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.
Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx.
Assuming that our HW is H100, is this compute-bound or memory-bound?
You need to load cached k/v tensor, in addition to weights. It's going to take me some minutes to find out what's wrong in this napkin math. Will edit or reply this comment later.
Re-computing everything every time is the worst-case scenario and which is why I included it in the example (1k tokens). In that case, KV-cache is obviously set to 0 but it is also obvious that it is a much worse alternative than using the KV-cache. Which is pretty much the reason why we have the KV-cache. Therefore the argument about loading the cached tensors doesn't make a difference at all.
> It's going to take me some minutes to find out what's wrong in this napkin math.
... and batching does not help, you batch more requests and get more kvcache to load, still memory-access bound.
MLA made it possible to cache a smaller form of k/v, mitigating (but not completely solve, on shorter context & smaller batches it's still memory-access bound) the problem.
I also just read that paper. But I wonder, even though MLA is strictly more powerful, do you really gain by that in experiments? This paper doesn't really do too much experimental comparisons. GQA on the other side should still be faster (no need to an extra linear transformation).
It seems to me that MLA will become the standard from here on out.
If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.
Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.
I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!
Just the H100, the H800 is a region-specific version of the card for china with shitty nvlink bandwidth which makes it rougher for making big clusters, but deepseek was able to mitigate the impact of that by being clever (rumored to have made significant use of PTX assembly instead of just using CUDA, we'll probably find out in the releases this week)
It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side.
Eh, the linked repo cites the H800. The H800 comes up in every discussion about DeepSeek, so the "aha! got em!" bit gets kind of boring. And Chinese companies can fully rent all the H100 compute they want.
And for that matter the entire position of "did they just admit" is growing old. Not only do Chinese companies not have to care about US export restrictions, the conspiratorial take that DeepSeek is actually some big scam no longer deserves kid gloves.
I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other?
We should forget about the sanction BS … it damages US industry when it has money to make while motivating others to be more self reliant and build the product to compete …
Why does anyone need to be careful using that word? What a bizarre way to try to intimidate someone over speech.
Another country has X because they were expected (in the terms of their purchase) to not sell it to an adversary. So yes they’re supposed to honor that agreement and are not supposed to trade that particular thing X with each other. Not doing so invites sanctions and other consequences. Is it worth the risk just to do business with a dictatorship? Probably not.
If free citizens in the USofA have {X} and China has sanctioned Germany from having {X} should the free citizens of the USofA honor that agreement they made with China to not sell to Germany when they acquired {X} from China?
How about if they got {X} from Mexico ( who got it from Agnes .. ) ?
Some purchases come with strict protocols coded into contracts. Try buying F-35 and selling it to China, for example; See what happens. Other risk you not being able to purchase for yourself anymore and possible sanctions. H100 and others are under export control, I'm just not sure if it's an explicit export control or automatic, like what famously made PowerMac G4 a weapon export. I found a source there was an executive order for hardware exceeding 1e26 floating point operations or 1e23 integer operations. In any case, if an item is under export control that means paperwork and, if you're eligible to purchase, paperwork includes you also signing what you can and cannot do with the item purchased.
People say “it’s used for money laundering” as if we’re supposed to be on China’s side about restricting people’s ability to move money out of the country over certain amounts
Like, oh you’re against freedom from a repressive regime? Or oh you’re only against it when it’s the American government restricting US citizens flow of capital? like I’m confused, pick a lane
Smuggling is normally thought of as hiding something when crossing a border/checkpoint. In this case, it would simply be nvidia violating US sanctions. The goods would have never entered or exited the USA so it's a strange or incorrect use of the word smuggling.
Open sourcing is the runner-up’s way to ensure the current best player doesn’t steal the whole market. The elephant in the room is obviously the cluster size required, it hardly matters for normal people that the weights are free. We needed more efficiency breakthroughs.
It matters a lot, even if you never intend to run it yourself or look at the code.
It means that people can and will provide this service, and 1000's will build on this and make offers that you can use in either a commodity base market, or with a specific niche target.
It means regulatory capture and control will be much, much harder to execute.
It means AI might continue to be a benefit also to you rather than just a way to control, propagandize and exploit you.
> .. it hardly matters for normal people that the weights are free. We needed more efficiency breakthroughs.
That atleast allows other companies/research labs to develop competing cutting edge LLM technology and come up with efficiency breakthroughs. The alternative is for the tech to be hidden inside OpenAI and FANGs or released as old versions.
This is the minimum bar that I expect very elite programmers should be striving for in the age of AI and DeepSeek should be studied as an example and this is the only just the first of many projects from them.
There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.
Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.
You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.
Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?
Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.
Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.
My own opinion is that GenAI is different because of (a) its recursive reflexive potential (you can use the tool itself to help you past the failure) and (b) it shifts the input out of the necessity for algorithmic/systemic thinking (which may come as a surprise to the audience here but my experience has taught me is alien to dare I say the majority of people).
Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.
As for going deeper into the stack to "escape" AI, I would venture that is probably a non starter as the deeper you go the more constrained the domain is, so your escape strategy relies on AI reasoning making little progress, where AI reasoning has always been more successful in smaller well defined spaces.
It's an interesting opinion, but I read the exact same opinions about JS developers in 2008 too.
I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.
What do you mean with "only" developer? Someone who just knows how to code when given a spec but lacking domain knowledge (in this case ai math and hardware optimization) and larger context?
I agree that DeepSeek continues to prove themselves as a great example of engineering but the number of job positions requiring this type of knowledge IME is typically very very low so I am not sure if this would be the right advice to follow. Though I wish it was different.
Yeah, you're hitting the nail on the head. Low tier coding work can be reduced and the high end developers can now avoid boiler plate type coding problems and get back to high level work at reengineering complex frameworks.
Yes, this unfortunately does mean a reduction in the less skilled workforce, but frankly that's an on the whole good thing. Does anyone really enjoy writing and testing boilerplate day in day out for low pay, it's the same as the old white collar pushing paper around until retirement...
I don't find it a reasonable take, it's like saying stackoverflow.com is taking developer jobs by making it easy to code, we better develop new stackoverflow.com
Speaks more about how many low hanging fruits remaining in "NOOOOO I DON'T WANT TO DOWNLOAD 200MiB PYTORCH I'D BETTER REINVENT THE WHEEL"-gang inference stacks.
To be fair torch didn't try very hard optimizing on CPU either.
FWIW as someone who "NOOO DOESN'T WANT TO DOWNLOAD 200MB[0] PYTORCH"s i'm glad for those who make alternative minimal/no-dependency stacks that are based on C/C++, like ggml.
[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.
The wheel of CPU-only PyTorch 2.6.0 for Python 3.12 is ~170MiB in size.
It is indeed pretty silly that's not the default and you have to go to https://pytorch.org/get-started/locally/, copy the argument `--index-url https://download.pytorch.org/whl/cpu` to install CPU-only torch. But the alternative would be having the worlds scientists wondering why they can't use their GPUs after `pip install torch` so /shrug
I heard their inferencing framework is way lower than typical deployment methods. Can this be verified from that open-source project? How does it stack up against vllm or llama.cpp
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.
Don't think the decision is based on infra, or any technical reasons.
It's more on the service support side.
How a 200-person company supports 44M iPhone users in China?
right, I think PTX use is a bigger deal than its getting coverage for. this opens an opening for other vendors to get their foot in with PTX to LLVM-ir translation for existing cuda kernels.
https://github.com/vllm-project/vllm/releases/tag/v0.7.1
MHA is still faster in low QPS regime apparently.
https://neuralmagic.com/blog/enhancing-deepseek-models-with-...
Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.
https://arxiv.org/pdf/2502.07864
I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.
(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)
It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.
Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.
> FlashAttention ... such that you can remain compute bound at lower batch sizes during decode.
So, which one is it then?
https://jax-ml.github.io/scaling-book/inference/ - good read!
From the authors of FlashAttention:
> This [decoding] operation has been optimized with FlashAttention (v1 and v2 recently) in the training case, where the bottleneck is the memory bandwidth to read and write the intermediate results
And then they continue with:
> However, these optimizations don’t apply directly to the inference case, because the bottlenecks are different. For training, FlashAttention parallelizes across the batch size and query length dimensions. During inference, the query length is typically 1 ... With a batch size of 1, FlashAttention will use less than 1% of the GPU!
And then they come up with a different proposal, FlashDecoding, that optimizes for inference time:
> Our new approach Flash-Decoding is based on FlashAttention, and adds a new parallelization dimension: the keys/values sequence length. It combines the benefits of the 2 approaches from above. Like FlashAttention, it stores very little extra data to global memory, however it fully utilizes the GPU even when the batch size is small, as long as the context length is large enough.
Link: https://crfm.stanford.edu/2023/10/12/flashdecoding.html
FA, compared to naive implementation, made training / prefill (i.e. when you can have multiple tokens in the same sequence visible) compute-bound instead of memory-access bound.
So, currently, on MHA/GQA, with Flash Attention, training/prefill is compute-bound, whereas decoding is memory-access-bound.
Before FA, both prefill / decode are bound by memory-access. FA solved the problem of training/prefill. But because kvcache is large, decoding is inherently bound by memory-access.
Our goal is always to make everything compute-bound.
I did not say anything like that? What I said is that FlashAttention and arguably MLA will not make any significant gains in the inference time. And this is true.
Also, FWIW there are certainly model shapes that are compute-bound in the decode phase so saying that decoding is universally inherently bound by memory access is what is plain wrong, if I were to use your dictionary.
> MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates
> Those would be [...] not the decode phase
This does sound like you are saying that memory access time does NOT dominate during the decode phase. But it does.
Reading your quotes, it looks like maybe you are talking about GPU utilization issues? (i.e. not launching enough threads). Due to the parallelization strategy of the original FA it indeed does not even keep the GPU busy if q*bs is too small. But this is not an inherent limitation of FA-style kernels and can be solved and people did solve it. Or you simply batch more. Now you can keep the GPUs busy at 100% waiting for memory access, but memory access time still dominates, hence "memory-access-bound". And here comes MLA.
> FWIW there are certainly model shapes that are compute-bound in the decode phase
Yeah. But so far all I read don't really work ("work" means being at least just slightly worse than alternatives) under same wall-clock time compute budget. Do you have any pointer to a working example, even on smaller 3B-ish models?
Let's take llama3-8B for an example. GFLOPS needed for self-attention per-layer per-token is roughly 0.15 GFLOPS. For simplicity reasons let's assume that we store all our weights in FP8 precision, then our load memory-bandwidth required for the same is 0.05 GB. Store memory-bandwidth is negligible. If we expand this further to a 1k tokens context, this becomes ~180 GFLOPS and ~0.35 GB per-layer per-1k-ctx.
Assuming that our HW is H100, is this compute-bound or memory-bound?
> It's going to take me some minutes to find out what's wrong in this napkin math.
I am sure you will. Please don't be so entitled.
MLA made it possible to cache a smaller form of k/v, mitigating (but not completely solve, on shorter context & smaller batches it's still memory-access bound) the problem.
If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.
Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.
https://www.nvidia.com/en-us/data-center/h100/
https://verticalserve.medium.com/group-query-attention-58283...
https://paperswithcode.com/method/multi-head-attention
Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.
>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.
And for that matter the entire position of "did they just admit" is growing old. Not only do Chinese companies not have to care about US export restrictions, the conspiratorial take that DeepSeek is actually some big scam no longer deserves kid gloves.
Also, they could have outsourced the computation to a subsidiary company in the US, I suppose.
Another country has X because they were expected (in the terms of their purchase) to not sell it to an adversary. So yes they’re supposed to honor that agreement and are not supposed to trade that particular thing X with each other. Not doing so invites sanctions and other consequences. Is it worth the risk just to do business with a dictatorship? Probably not.
How about if they got {X} from Mexico ( who got it from Agnes .. ) ?
People say “it’s used for money laundering” as if we’re supposed to be on China’s side about restricting people’s ability to move money out of the country over certain amounts
Like, oh you’re against freedom from a repressive regime? Or oh you’re only against it when it’s the American government restricting US citizens flow of capital? like I’m confused, pick a lane
Capital controls are obsoleted under any context
(Showing my lack of breadth of knowledge in the ecosystem (s))
It means that people can and will provide this service, and 1000's will build on this and make offers that you can use in either a commodity base market, or with a specific niche target.
It means regulatory capture and control will be much, much harder to execute.
It means AI might continue to be a benefit also to you rather than just a way to control, propagandize and exploit you.
That atleast allows other companies/research labs to develop competing cutting edge LLM technology and come up with efficiency breakthroughs. The alternative is for the tech to be hidden inside OpenAI and FANGs or released as old versions.
With the next wave of investment targeting local on-device robotics, I'm way more bullish about local AI than vertical SaaS AI.
There is an extremely high chance (in fact a 99.9% chance) that an AI did not build this and the ones who are able to build or adapt projects like this which are deep into hardware systems will be the most sort after.
Not the horrendous JS or even TS slop across GitHub that is extremely easy for an AI to generate correctly.
You've got until 2030 to decide. And my advice is to study the codebases of pytorch (backends), DeepSeek, tinygrad and ggml.
Do you feel GenAI coding is substantially different from the lineage of 4GL to 'low code' approaches?
Reason I'm asking is because despite all promises al suffered from what Spolsky coined the 'leaky abstraction' problem.
Once something goes wrong, the user is left without recourse in a sea of additional complexity created by the tooling that was meant to not have to deal with it in the first place.
My own opinion is that GenAI is different because of (a) its recursive reflexive potential (you can use the tool itself to help you past the failure) and (b) it shifts the input out of the necessity for algorithmic/systemic thinking (which may come as a surprise to the audience here but my experience has taught me is alien to dare I say the majority of people).
Now don't get me wrong. We have not reached the point where (a)+(b) make it to where you don't need application layer devs, but we are definitely seeing some progress.
As for going deeper into the stack to "escape" AI, I would venture that is probably a non starter as the deeper you go the more constrained the domain is, so your escape strategy relies on AI reasoning making little progress, where AI reasoning has always been more successful in smaller well defined spaces.
I do agree that if you are "only" a developer, you will have to be in some sort of tightly defined niche, and how long those niches survive is anyone's guess.
Yes, this unfortunately does mean a reduction in the less skilled workforce, but frankly that's an on the whole good thing. Does anyone really enjoy writing and testing boilerplate day in day out for low pay, it's the same as the old white collar pushing paper around until retirement...
https://sakana.ai/ai-cuda-engineer/
"99% written by DeepSeek-R1" according to the author.
To be fair torch didn't try very hard optimizing on CPU either.
[0] 200MB is actually a very generous number, i tried to download some AI thing via pip3 the other day and it wanted 600MB or so of CUDA stuff. Meanwhile i do not even have an Nvidia GPU.
It is indeed pretty silly that's not the default and you have to go to https://pytorch.org/get-started/locally/, copy the argument `--index-url https://download.pytorch.org/whl/cpu` to install CPU-only torch. But the alternative would be having the worlds scientists wondering why they can't use their GPUs after `pip install torch` so /shrug
I suspect it's much higher throughput than vLLM, which in turn is much higher throughput than llama.cpp. The MLA kernel they just open-sourced seems to indicate that, although we'll see how it does in third party benchmarks on non-hobbled GPUs vs FlashAttention. They only released the BF16 version — whereas most people, including DeepSeek themselves, serve in FP8 — so it might not be immediately useful to most companies quite yet, although I imagine there'll be FP8 ports soon enough.