I really think there ought to be more discussion of this paper.
copying from my previous comment: A first-generation diffusion model is beating LLama 3 in some areas, a model with a huge amount of tuning and improvement work. And it's from China again!
A whole new "tree" of development has opened up. With so many possibilities - traditional scaling laws, out-loud chain of thought, in-model layer-repeating chain of thought, and now diffusion models - it seems unlikely to me that LLMs are going to hit a wall that the river of technological progress cannot flow around.
I wonder how well they'll work at translation. The paper indicates that they're rather good at poetry.
I'm still reading the paper, but my main question is how slow is the model compared to LLM of the same size. It seems like to get the best accuracy they need to set number of time steps to the number of tokens to be generated. Does it make it comparable in speed to an LLM?
Update: finished the paper, and as I suspected, there's a serious downside in speed and memory consumption. LLaDA model has to process the entire output sequence on every time step - without anything like KV cache. Also, full quadratic attention happens on the entire output sequence on every time step, which makes it unfeasible for sequence length longer than a few thousand tokens.
I really think there ought to be more discussion of this paper.
copying from my previous comment: A first-generation diffusion model is beating LLama 3 in some areas, a model with a huge amount of tuning and improvement work. And it's from China again!
A whole new "tree" of development has opened up. With so many possibilities - traditional scaling laws, out-loud chain of thought, in-model layer-repeating chain of thought, and now diffusion models - it seems unlikely to me that LLMs are going to hit a wall that the river of technological progress cannot flow around.
I wonder how well they'll work at translation. The paper indicates that they're rather good at poetry.
Interesting times.