**Albert Ge Chandan Singh Dinghuai Zhang Letian Peng Yufan Zhuang Ning Shang Li Lyna Zhang Liyuan Liu Jianfeng Gao** (Work in Progress)
First published on Sep 18, 2025 | GitHub: https://github.com/lbertge/longdllm | https://pypi.org/project/longdllm/
<aside>
TL;DR
Diffusion language models (dLLM) exhibit fundamentally different attention mechanisms from autoregressive LLMs—notably lacking attention sinks—which impedes existing long-context scaling practice. We discuss simple-yet-effective modifications that enable robust context extension to 131k tokens and release the longdllm
package for future research.
</aside>
Figure 1: (Left): longdllm
significantly boosts long context performance on open-source dLLMs, matching closed-source dLLM performance with DiffuCoder. (Right): dLLM attention patterns deviate significantly from that of AR models on passkey retrieval.
You can try out our long-context adaptations for DiffuCoder and LLaDA with pip install longdllm
. longdllm
fixed two memory inefficiencies in LLaDA and reduces memory footprint by 60%, allowing us to process up to 131k input tokens on a single A6000 GPU.
We provide our optimized rescale factors for DiffuCoder here and for LLaDA here, which are based on a modified version of the LongRoPE2 codebase adapted for dLLMs.
Figure 2: Needle-in-a-haystack performance for Diffucoder-7B-Instruct (left), LLaDA-8B-Instruct (middle), and Mercury dLLM (right). Open source dLLMs struggle with long-context inputs past 16k tokens.
As Figure 2 shows, open source dLLMs like LLaDA-Instruct and Diffucoder-Instruct suffer >50% accuracy drop when processing inputs beyond 16k tokens on retrieval tasks. This demonstrates a critical limitation in long-context performance for language diffusion models (dLLMs).
We observe that dLLMs exhibit more diffuse attention compared to autoregressive models (Figure 1 right). ****This suggests that existing context extension methods—originally designed for autoregressive (AR) models—may not be well-suited for the bidirectional attention patterns inherent in dLLMs. ****In this blog post, we'll explore how current context extension methods fail in dLLM and propose solutions to mitigate this gap.
Figure 2: Visualization of attention scores from the first head of the last layer for Llama3, Qwen2.5-Coder, LLaDA-Instruct, and Diffucoder-Instruct (from left to right). All attention maps were generated using an identical passkey needle retrieval task at 32k context length, with each square representing the maximum value within a 400×400 region.
Where does the model pay attention to at long contexts? In Figure 2, we visualize the attention scores after softmax of the first attention head of the last layer of various AR and diffusion LLMs performing a passphrase retrieval task at 32k context. We observe two striking patterns:
Figure 3: Attention map changes before and after applying LongRoPE2 at 131k context. (left): For AR models, LongRoPE2 extends the attention sink pattern, and attends to the correct needle location. (right): For dLLMs, however, LongRoPE2 amplifies noise across the attention map - creating spurious patterns.
In our first attempt to improve dLLM long-context capabilities, we applied LongRoPE2 to modify the RoPE hyperparameters used in dLLMs. Although LongRoPE2 achieves state-of-the-art performance on autoregressive models, we found that it fails to achieve robust length generalization in dLLMs past 100k tokens, as illustrated in Figure 1 (left).