**Albert Ge Chandan Singh Dinghuai Zhang Letian Peng Yufan Zhuang Ning Shang Li Lyna Zhang Liyuan Liu Jianfeng Gao** (Work in Progress)

First published on Sep 18, 2025 | GitHub: https://github.com/lbertge/longdllm | https://pypi.org/project/longdllm/

<aside>

TL;DR Diffusion language models (dLLM) exhibit fundamentally different attention mechanisms from autoregressive LLMs—notably lacking attention sinks—which impedes existing long-context scaling practice. We discuss simple-yet-effective modifications that enable robust context extension to 131k tokens and release the longdllm package for future research.

</aside>

Figure 1: (Left):   significantly boosts long context performance on open-source dLLMs, matching closed-source dLLM performance with DiffuCoder. (Right): dLLM attention patterns deviate significantly from that of AR models on passkey retrieval.

Figure 1: (Left): longdllm significantly boosts long context performance on open-source dLLMs, matching closed-source dLLM performance with DiffuCoder. (Right): dLLM attention patterns deviate significantly from that of AR models on passkey retrieval.

Quick Start

You can try out our long-context adaptations for DiffuCoder and LLaDA with pip install longdllm . longdllm fixed two memory inefficiencies in LLaDA and reduces memory footprint by 60%, allowing us to process up to 131k input tokens on a single A6000 GPU.

Working Example

We provide our optimized rescale factors for DiffuCoder here and for LLaDA here, which are based on a modified version of the LongRoPE2 codebase adapted for dLLMs.

Long Context Performance of Open dLLMs

Figure 2: Needle-in-a-haystack performance for Diffucoder-7B-Instruct (left), LLaDA-8B-Instruct (middle), and Mercury dLLM (right). Open source dLLMs struggle with long-context inputs past 16k tokens.

Figure 2: Needle-in-a-haystack performance for Diffucoder-7B-Instruct (left), LLaDA-8B-Instruct (middle), and Mercury dLLM (right). Open source dLLMs struggle with long-context inputs past 16k tokens.

As Figure 2 shows, open source dLLMs like LLaDA-Instruct and Diffucoder-Instruct suffer >50% accuracy drop when processing inputs beyond 16k tokens on retrieval tasks. This demonstrates a critical limitation in long-context performance for language diffusion models (dLLMs).

We observe that dLLMs exhibit more diffuse attention compared to autoregressive models (Figure 1 right). ****This suggests that existing context extension methods—originally designed for autoregressive (AR) models—may not be well-suited for the bidirectional attention patterns inherent in dLLMs. ****In this blog post, we'll explore how current context extension methods fail in dLLM and propose solutions to mitigate this gap.

Diagnosing Long Context Attention in dLLMs

 Figure 2: Visualization of attention scores from the first head of the last layer for Llama3, Qwen2.5-Coder, LLaDA-Instruct, and Diffucoder-Instruct (from left to right). All attention maps were generated using an identical passkey needle retrieval task at 32k context length, with each square representing the maximum value within a 400×400 region.

Figure 2: Visualization of attention scores from the first head of the last layer for Llama3, Qwen2.5-Coder, LLaDA-Instruct, and Diffucoder-Instruct (from left to right). All attention maps were generated using an identical passkey needle retrieval task at 32k context length, with each square representing the maximum value within a 400×400 region.

Where does the model pay attention to at long contexts? In Figure 2, we visualize the attention scores after softmax of the first attention head of the last layer of various AR and diffusion LLMs performing a passphrase retrieval task at 32k context. We observe two striking patterns:

  1. Similar to Figure 1, dLLMs produce more dispersed attention scores compared to AR models. For LLaDA, we observe numerous off-diagonal attention patterns local correlations in attention behavior; and for DiffuCoder, we notice the attention is spread between the beginning, needle position, and end. We also do not see a prominent attention sink pattern, as is clearly visible in AR models.
  2. dLLMs by default can attend to the relevant needle positions, even at lengths far outside their pretraining length. For both LLaDA and Diffucoder, a significant amount of attention weight is placed on the needle positions, compared to their AR counterparts.

Advancing dLLM Long Context Performance

Figure 3: Attention map changes before and after applying LongRoPE2 at 131k context. (left): For AR models, LongRoPE2 extends the attention sink pattern, and attends to the correct needle location. (right): For dLLMs, however, LongRoPE2 amplifies noise across the attention map - creating spurious patterns.

Figure 3: Attention map changes before and after applying LongRoPE2 at 131k context. (left): For AR models, LongRoPE2 extends the attention sink pattern, and attends to the correct needle location. (right): For dLLMs, however, LongRoPE2 amplifies noise across the attention map - creating spurious patterns.

In our first attempt to improve dLLM long-context capabilities, we applied LongRoPE2 to modify the RoPE hyperparameters used in dLLMs. Although LongRoPE2 achieves state-of-the-art performance on autoregressive models, we found that it fails to achieve robust length generalization in dLLMs past 100k tokens, as illustrated in Figure 1 (left).