近年来通向通用人工智能(AGI)的进展,在很大程度上由自回归大语言模型(LLMs)的兴起,以及用于图像与视频生成的扩散模型所推动。这些模型在跨多种模态的理解与生成方面展现出惊人的能力,达到了此前难以想象的性能水平。这些模型空前的规模——体现在庞大的参数量、海量数据、巨大的训练投入以及推理时可观的算力需求上——将 AI 推向新的高度,使其具备了广博的通用知识,以及对语言与现实世界的深刻理解。
这种混合策略使模型能通过自回归跨块捕捉长程依赖,同时通过块内并行扩散加速生成。该设计还支持灵活的输出长度,并兼容 AR 模型中广泛使用的 KV-Cache。值得注意的是,近期的掩码扩散语言模型也采用了类似的、基于块的半自回归解码策略,可被视为混合 AR-扩散建模的实例。
除了在序列层面结合 AR 与扩散的块级方法外,混合也可发生在架构层面:神经网络的某一部分(通常是编码器)将整个序列一并扩散到某中间表示,随后由自回归解码器生成最终序列。LADIDA 是一种略有不同的方法,它在文档层面扩散,但由 AR 解码器逐句解码。SpecDiff 提出一个协同的推测解码框架:轻量扩散模型起草候选输出,再由大型 AR 模型验证并最终确定(必要时纠正)。TiDAR 提出一种序列级混合架构,在单次前向传递中通过结构化的因果-双向注意力,将基于扩散的并行起草与自回归采样集成;它有效统一了扩散模型的效率与 AR 解码的质量,在保持 AR 级性能的同时实现最高 5× 的吞吐量提升。SDLM 引入"下一序列预测(NSP)"范式,统一下一 token 与下一块预测以支持自适应长度生成;通过为预训练自回归模型加装并行块训练与基于置信度的动态解码,SDLM 在保持 KV-cache 兼容的同时实现了高效的扩散式块内生成。
所综述的 DLM 规模从 1B 以下到 8B 参数不等。为便于比较,我们也报告了相近规模的代表性 AR 模型的性能。性能数据主要取自原始论文;若源论文中未提供结果,则参考报告了可比评测的后续工作。我们的发现表明,DLM 总体上与同规模 AR 模型表现相当。在 PIQA、HellaSwag 等一般语言理解基准上,LLaDA 等模型取得略低于或持平于 LLaMA2、Qwen2.5 等 AR 模型的性能。然而,DLM 在数学与科学相关基准(包括 GSM8K、GPQA 与 MATH)上表现更强,LLaDA、Dream 等模型持续胜过同规模 AR 对手。在多模态任务上,MMaDA、LLaDA-V 等模型常常超越基于 AR 的多模态模型,凸显 DLM 在统一与跨模态推理上的潜力。在代码生成任务上,DLM 同样展现出有竞争力的能力——值得注意的是,DiffuCoder 在开源模型中取得有竞争力的 HumanEval 表现,体现了 DLM 在结构化、逻辑密集领域的潜力。此外,Gemini Diffusion、Mercury 等闭源 DLM 在所有 DLM 中取得最先进的结果,可与 GPT-4o 等顶级 AR 模型匹敌。
图 6. 在八个基准上的性能比较:Overall-GenEval、MME、GQA、HellaSwag、PIQA、HumanEval、GSM8K 与 MMMU。每个子图的横轴表示模型规模(参数量),纵轴表示对应基准下的得分,分数越高性能越好。模型类型以颜色区分:蓝色代表 AR 语言模型,橙色代表 DLM。
鉴于当前大多数 DLM 训练所用的数据与算力相对有限,这些结果表明 DLM 在许多真实世界应用中作为 AR 模型替代方案具有强大潜力。近期的缩放研究进一步表明,DLM 在数据受限的多轮训练(multi-epoch)情形下往往胜过 AR 模型,这很可能是因为其任意阶(any-order)去噪目标能更有效地复用有限数据。
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[3] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[4] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[5] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[6] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, vol. 1, no. 2, 2023.
[7] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025.
[8] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[9] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems, vol. 35, pp. 36 479–36 494, 2022.
[10] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” in The Twelfth International Conference on Learning Representations.
[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel et al., “Scaling rectified flow transformers for high-resolution image synthesis,” in Forty-first international conference on machine learning, 2024.
[12] T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman et al., “Video generation models as world simulators,” OpenAI Blog, vol. 1, p. 8, 2024.
[13] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[15] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[16] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024.
[17] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
[18] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[19] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[20] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations.
[21] X. Liu, C. Gong et al., “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in The Eleventh International Conference on Learning Representations.
[22] X. Li, J. Thickstun, I. Gulrajani, P. S. Liang, and T. B. Hashimoto, “Diffusion-lm improves controllable text generation,” Advances in neural information processing systems, vol. 35, pp. 4328–4343, 2022.
[23] R. Strudel, C. Tallec, F. Altch´e, Y. Du, Y. Ganin, A. Mensch, W. Grathwohl, N. Savinov, S. Dieleman, L. Sifre et al., “Selfconditioned embedding diffusion for text generation,” arXiv preprint arXiv:2211.04236, 2022.
[24] J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,” Advances in neural information processing systems, vol. 34, pp. 17 981–17 993, 2021.
[25] Z. He, T. Sun, Q. Tang, K. Wang, X. Huang, and X. Qiu, “Diffusionbert: Improving generative masked language models with diffusion models,” in The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
[26] J. Ye, Z. Xie, L. Zheng, J. Gao, Z. Wu, X. Jiang, Z. Li, and L. Kong, “Dream 7b,” 2025. [Online]. Available: https: //hkunlp.github.io/blog/2025/dream
[27] S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, J. Han et al., “Scaling diffusion language models via adaptation from autoregressive models,” in The Thirteenth International Conference on Learning Representations.
[28] S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y. Lin, J.-R. Wen, and C. Li, “Large language diffusion models,” arXiv preprint arXiv:2502.09992, 2025.
[29] Z. You, S. Nie, X. Zhang, J. Hu, J. Zhou, Z. Lu, J.-R. Wen, and C. Li, “Llada-v: Large language diffusion models with visual instruction tuning,” arXiv preprint arXiv:2505.16933, 2025.
[30] R. Yu, X. Ma, and X. Wang, “Dimple: Discrete diffusion multimodal large language model with parallel decoding,” arXiv preprint arXiv:2505.16990, 2025.
[31] L. Yang, Y. Tian, B. Li, X. Zhang, K. Shen, Y. Tong, and M. Wang, “Mmada: Multimodal large diffusion language models,” arXiv preprint arXiv:2505.15809, 2025.
[32] I. Labs, S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birnbaum, Z. Luo, Y. Miraoui, A. Palrecha et al., “Mercury: Ultra-fast language models based on diffusion,” arXiv preprint arXiv:2506.17298, 2025.
[34] Y. Song, Z. Zhang, C. Luo, P. Gao, F. Xia, H. Luo, Z. Li, Y. Yang, H. Yu, X. Qu et al., “Seed diffusion: A large-scale diffusion language model with high-speed inference,” arXiv preprint arXiv:2508.02193, 2025.
[35] M. Xu, T. Geffner, K. Kreis, W. Nie, Y. Xu, J. Leskovec, S. Ermon, and A. Vahdat, “Energy-based diffusion language models for text generation,” arXiv preprint arXiv:2410.21357, 2024.
[36] J. Deschenaux and C. Gulcehre, “Beyond autoregression: Fast llms via self-distillation through time,” in The Thirteenth International Conference on Learning Representations.
[37] K. Han, K. Kenealy, A. Barua, N. Fiedel, and N. Constant, “Transfer learning for text diffusion models,” arXiv preprint arXiv:2401.17181, 2024.
[38] S. S. Sahoo, J. Deschenaux, A. Gokaslan, G. Wang, J. Chiu, and V. Kuleshov, “The diffusion duality,” arXiv preprint arXiv:2506.10892, 2025.
[39] Y. Zhang, S. He, D. Levine, L. Zhao, D. Zhang, S. A. Rizvi, E. Zappala, R. Ying, and D. van Dijk, “Non-markovian dis22 crete diffusion with causal language models,” arXiv preprint arXiv:2502.09767, 2025.
[40] M. Dang, J. Han, M. Xu, K. Xu, A. Srivastava, and S. Ermon, “Inference-time scaling of diffusion language models with particle gibbs sampling,” arXiv preprint arXiv:2507.08390, 2025.
[41] L. Rout, C. Caramanis, and S. Shakkottai, “Anchored diffusion language model,” arXiv preprint arXiv:2505.18456, 2025.
[42] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024.
[43] S. Zhao, D. Gupta, Q. Zheng, and A. Grover, “d1: Scaling reasoning in diffusion large language models via reinforcement learning,” arXiv preprint arXiv:2504.12216, 2025.
[44] T. Chen, S. Zhang, and M. Zhou, “Dlm-one: Diffusion language models for one-step sequence generation,” arXiv e-prints, pp. arXiv–2506, 2025.
[45] C. Wu, H. Zhang, S. Xue, Z. Liu, S. Diao, L. Zhu, P. Luo, S. Han, and E. Xie, “Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding,” arXiv preprint arXiv:2505.22618, 2025.
[46] D. Israel, G. V. d. Broeck, and A. Grover, “Accelerating diffusion llms via adaptive parallel decoding,” arXiv preprint arXiv:2506.00413, 2025.
[47] G. Wang, Y. Schiff, S. S. Sahoo, and V. Kuleshov, “Remasking discrete diffusion models with inference-time scaling,” in ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy.
[48] Z. Liu, Y. Yang, Y. Zhang, J. Chen, C. Zou, Q. Wei, S. Wang, and L. Zhang, “dllm-cache: Accelerating diffusion large language models with adaptive caching,” arXiv preprint arXiv:2506.06295, 2025.
[49] X. Ma, R. Yu, G. Fang, and X. Wang, “dkv-cache: The cache for diffusion language models,” arXiv preprint arXiv:2505.15781, 2025.
[50] G. Liu, Z. Feng, Y. Gao, Z. Yang, X. Liang, J. Bao, X. He, S. Cui, Z. Li, and Z. Hu, “Composable text controls in latent space with odes,” arXiv preprint arXiv:2208.00638, 2022.
[51] S. Gong, M. Li, J. Feng, Z. Wu, and L. Kong, “Diffuseq: Sequence to sequence text generation with diffusion models,” in The Eleventh International Conference on Learning Representations.
[52] S. Dieleman, L. Sartran, A. Roshannai, N. Savinov, Y. Ganin, P. H. Richemond, A. Doucet, R. Strudel, C. Dyer, C. Durkan et al., “Continuous diffusion for categorical data,” arXiv preprint arXiv:2211.15089, 2022.
[53] Z. Gao, J. Guo, X. Tan, Y. Zhu, F. Zhang, J. Bian, and L. Xu, “Empowering diffusion models on the embedding space for text generation,” arXiv preprint arXiv:2212.09412, 2022.
[54] J. Lovelace, V. Kishore, C. Wan, E. Shekhtman, and K. Q. Weinberger, “Latent diffusion for language generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 56 998–57 025, 2023.
[55] Z. Lin, Y. Gong, Y. Shen, T. Wu, Z. Fan, C. Lin, N. Duan, and W. Chen, “Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise,” in International Conference on Machine Learning. PMLR, 2023, pp. 21 051–21 064.
[56] R. Wang, J. Li, and P. Li, “Infodiffusion: Information entropy aware diffusion process for non-autoregressive text generation,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 13 757–13 770.
[57] G. Liu, Y. Wang, Z. Feng, Q. Wu, L. Tang, Y. Gao, Z. Li, S. Cui, J. Mcauley, Z. Yang et al., “Unified generation, reconstruction, and representation: Generalized diffusion with adaptive latent encoding-decoding,” in International Conference on Machine Learning. PMLR, 2024, pp. 31 964–31 993.
[58] A. Shabalin, V. Meshchaninov, and D. Vetrov, “Smoothie: Smoothing diffusion on token embeddings for text generation,” arXiv preprint arXiv:2505.18853, 2025.
[59] R. K. Mahabadi, H. Ivison, J. Tae, J. Henderson, I. Beltagy, M. E. Peters, and A. Cohan, “Tess: Text-to-text self-conditioned simplex diffusion,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2347–2361.
[60] J. Tae, H. Ivison, S. Kumar, and A. Cohan, “Tess 2: A large-scale generalist diffusion language model,” arXiv preprint arXiv:2502.13917, 2025.
[61] P. Yu, S. Xie, X. Ma, B. Jia, B. Pang, R. Gao, Y. Zhu, S.-C. Zhu, and Y. N. Wu, “Latent diffusion energy-based model for interpretable text modelling,” in International Conference on Machine Learning. PMLR, 2022, pp. 25 702–25 720.
[62] Y. Chen, C. Liang, H. Sui, R. Guo, C. Cheng, J. You, and G. Liu, “Langflow: Continuous diffusion rivals discrete in language modeling,” arXiv preprint arXiv:2604.11748, 2026.
[63] K. Hu, L. Qiu, Y. Lu, H. Zhao, T. Li, Y. Kim, J. Andreas, and K. He, “Elf: Embedded language flows,” arXiv preprint arXiv:2605.10938, 2026.
[64] S. Zhuang, Y. Ai, J. Han, X. Li, H. Huang, X. Yue, X. Hu, K. Xu, Y. Wang, and H. Chen, “Bitlm: Unlocking multi-token language generation with bitwise continuous diffusion,” arXiv preprint arXiv:2605.11577, 2026.
[65] C. Lee, J. Yoo, M. Agarwal, S. Shah, J. Huang, A. Raghunathan, S. Hong, N. M. Boffi, and J. Kim, “Flow map language models: One-step language modeling via continuous denoising,” arXiv preprint arXiv:2602.16813, 2026.
[66] H. Guo, Q. Zhao, Y. Zhao, S. Nie, R. Zhu, Q. Guo, F. Wang, T. Yang, H. Zhao, G. Wei et al., “Continuous latent diffusion language model,” arXiv preprint arXiv:2605.06548, 2026.
[67] L. Zheng, J. Yuan, L. Yu, and L. Kong, “A reparameterized discrete diffusion model for text generation,” in First Conference on Language Modeling.
[68] J. Shi, K. Han, Z. Wang, A. Doucet, and M. Titsias, “Simplified and generalized masked diffusion for discrete data,” Advances in neural information processing systems, vol. 37, pp. 103 131–103 167, 2024.
[69] S. Sahoo, M. Arriola, Y. Schiff, A. Gokaslan, E. Marroquin, J. Chiu, A. Rush, and V. Kuleshov, “Simple and effective masked diffusion language models,” Advances in Neural Information Processing Systems, vol. 37, pp. 130 136–130 184, 2024.
[70] J. Ye, Z. Zheng, Y. Bao, L. Qian, and Q. Gu, “Diffusion language models can perform many tasks with scaling and instructionfinetuning,” arXiv preprint arXiv:2308.12219, 2023.
[71] K. Zhou, Y. Li, W. X. Zhao, and J.-R. Wen, “Diffusion-nat: Selfprompting discrete diffusion for non-autoregressive text generation,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1438–1451.
[72] I. Gulrajani and T. B. Hashimoto, “Likelihood-based diffusion language models,” Advances in Neural Information Processing Systems, vol. 36, pp. 16 693–16 715, 2023.
[73] A. Lou, C. Meng, and S. Ermon, “Discrete diffusion modeling by estimating the ratios of the data distribution,” in International Conference on Machine Learning. PMLR, 2024, pp. 32 819–32 848.
[74] J. Ou, S. Nie, K. Xue, F. Zhu, J. Sun, Z. Li, and C. Li, “Your absorbing discrete diffusion secretly models the conditional distributions of clean data,” in The Thirteenth International Conference on Learning Representations, 2024.
[75] I. Gat, T. Remez, N. Shaul, F. Kreuk, R. T. Chen, G. Synnaeve, Y. Adi, and Y. Lipman, “Discrete flow matching,” Advances in Neural Information Processing Systems, vol. 37, pp. 133 345–133 385, 2024.
[76] S. Liu, J. Nam, A. Campbell, H. Stark, Y. Xu, T. Jaakkola, and R. Gomez-Bombarelli, “Think while you generate: Discrete diffusion with planned denoising,” in The Thirteenth International Conference on Learning Representations, 2024.
[77] J. Ye, J. Gao, S. Gong, L. Zheng, X. Jiang, Z. Li, and L. Kong, “Beyond autoregression: Discrete diffusion for complex reasoning and planning,” arXiv preprint arXiv:2410.14157, 2024.
[78] D. von R¨utte, J. Fluri, Y. Ding, A. Orvieto, B. Sch¨olkopf, and T. Hofmann, “Generalized interpolating discrete diffusion,” in Forty-second International Conference on Machine Learning, 2025.
[79] X. Liu, Z. Liu, Z. Huang, Q. Guo, Z. He, and X. Qiu, “Longllada: Unlocking long context capabilities in diffusion llms,” arXiv preprint arXiv:2506.14429, 2025.
[80] F. Zhu, Z. You, Y. Xing, Z. Huang, L. Liu, Y. Zhuang, G. Lu, K. Wang, X. Wang, L. Wei et al., “Llada-moe: A sparse moe diffusion language model,” arXiv preprint arXiv:2509.24389, 2025.
[81] T. Bie, M. Cao, K. Chen, L. Du, M. Gong, Z. Gong, Y. Gu, J. Hu, Z. Huang, Z. Lan et al., “Llada2. 0: Scaling up diffusion language models to 100b,” arXiv preprint arXiv:2512.15745, 2025.
[82] X. Han, S. Kumar, and Y. Tsvetkov, “Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control,” in Proceedings of the 61st Annual Meeting of the 23 Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 11 575–11 596.
[83] T. Wu, Z. Fan, X. Liu, H.-T. Zheng, Y. Gong, J. Jiao, J. Li, J. Guo, N. Duan, W. Chen et al., “Ar-diffusion: Auto-regressive diffusion model for text generation,” Advances in Neural Information Processing Systems, vol. 36, pp. 39 957–39 974, 2023.
[84] M. Arriola, A. Gokaslan, J. T. Chiu, Z. Yang, Z. Qi, J. Han, S. S. Sahoo, and V. Kuleshov, “Block diffusion: Interpolating between autoregressive and diffusion language models,” in The Thirteenth International Conference on Learning Representations.
[85] C. Huang and H. Tang, “Ctrldiff: Boosting large diffusion language models with dynamic block prediction and controllable generation,” arXiv preprint arXiv:2505.14455, 2025.
[86] J. K. Christopher, B. R. Bartoldson, T. Ben-Nun, M. Cardei, B. Kailkhura, and F. Fioretto, “Speculative diffusion decoding: Accelerating language generation through diffusion,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 12 042–12 059.
[87] S. Cheng, Y. Bian, D. Liu, L. Zhang, Q. Yao, Z. Tian, W. Wang, Q. Guo, K. Chen, B. Qi et al., “Sdar: A synergistic diffusionautoregression paradigm for scalable sequence generation,” arXiv preprint arXiv:2510.06303, 2025.
[88] J. Liu, X. Dong, Z. Ye, R. Mehta, Y. Fu, V. Singh, J. Kautz, C. Zhang, and P. Molchanov, “Tidar: Think in diffusion, talk in autoregression,” arXiv preprint arXiv:2511.08923, 2025.
[89] Y. Liu, Y. Cao, H. Li, G. Luo, Z. Chen, W. Wang, X. Liang, B. Qi, L. Wu, C. Tian et al., “Sequential diffusion language models,” arXiv preprint arXiv:2509.24007, 2025.
[90] Y. Tian, Y. Liang, S. Zhang, Y. Shu, G. Yang, W. He, S. Fang, T. Guo, K. Han, C. Xu et al., “From next-token to next-block: A principled adaptation path for diffusion llms,” arXiv preprint arXiv:2512.06776, 2025.
[91] Y. Fu, L. Whalen, Z. Ye, X. Dong, S. Diao, J. Liu, C. Wu, H. Zhang, E. Xie, S. Han et al., “Efficient-dlm: From autoregressive to diffusion language models, and beyond in speed,” arXiv preprint arXiv:2512.14067, 2025.
[92] Y. Yu, Y. Jian, J. Wang, Z. Zhou, D. Zhuang, X. Fang, S. Yanamandra, X. Wu, Q. Wu, S. L. Song et al., “Introspective diffusion language models,” arXiv preprint arXiv:2604.11035, 2026.
[93] J.-N. Li, J. Guan, W. Wu, and C. Li, “Refusion: A diffusion large language model with parallel autoregressive decoding,” arXiv preprint arXiv:2512.13586, 2025.
[94] J. Ruan, B. Li, Y. Yin, P. Huang, X. Chen, J. Wang, X. Cai, T. Xiao, and J. Zhu, “Causal autoregressive diffusion language model,” arXiv preprint arXiv:2601.22031, 2026.
[95] Z. Li, H. Li, Y. Shi, A. B. Farimani, Y. Kluger, L. Yang, and P. Wang, “Dual diffusion for unified image generation and understanding,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2779–2790.
[96] Q. Shi, J. Bai, Z. Zhao, W. Chai, K. Yu, J. Wu, S. Song, Y. Tong, X. Li, X. Li et al., “Muddit: Liberating generation beyond textto-image with a unified discrete diffusion model,” arXiv preprint arXiv:2505.23606, 2025.
[97] J. Ye, S. Gong, L. Chen, L. Zheng, J. Gao, H. Shi, C. Wu, X. Jiang, Z. Li, W. Bi et al., “Diffusion of thought: Chain-of-thought reasoning in diffusion language models,” Advances in Neural Information Processing Systems, vol. 37, pp. 105 345–105 374, 2024.
[98] Z. Huang, Z. Chen, Z. Wang, T. Li, and G.-J. Qi, “Reinforcing the diffusion chain of lateral thought with diffusion language models,” arXiv preprint arXiv:2505.10446, 2025.
[99] O. Zekri and N. Boull´e, “Fine-tuning discrete diffusion models with policy gradient methods,” arXiv preprint arXiv:2502.01384, 2025.
[100] S. Gong, R. Zhang, H. Zheng, J. Gu, N. Jaitly, L. Kong, and Y. Zhang, “Diffucoder: Understanding and improving masked diffusion models for code generation,” arXiv preprint arXiv:2506.20639, 2025.
[101] X. Tang, R. Dolga, S. Yoon, and I. Bogunovic, “wd1: Weighted policy optimization for reasoning in diffusion language models,” arXiv preprint arXiv:2507.08838, 2025.
[102] S. Zhao, M. Liu, J. Huang, M. Liu, C. Wang, B. Liu, Y. Tian, G. Pang, S. Bell, A. Grover et al., “Inpainting-guided policy optimization for diffusion large language models,” arXiv preprint arXiv:2509.10396, 2025.
[103] C. Wang, P. Rashidinejad, D. Su, S. Jiang, S. Wang, S. Zhao, C. Zhou, S. Z. Shen, F. Chen, T. Jaakkola et al., “Spg: Sandwiched policy gradient for masked diffusion language models,” arXiv preprint arXiv:2510.09541, 2025.
[104] S. Xie, L. Kong, X. Song, X. Dong, G. Chen, E. P. Xing, and K. Zhang, “Step-aware policy optimization for reasoning in diffusion large language models,” arXiv preprint arXiv:2510.01544, 2025.
[105] N. Lin, J. Zhang, L. Hou, and J. Li, “Boundary-guided policy optimization for memory-efficient rl of diffusion large language models,” arXiv preprint arXiv:2510.11683, 2025.
[106] Z. Ni, S. Wang, Y. Yue, T. Yu, W. Zhao, Y. Hua, T. Chen, J. Song, C. Yu, B. Zheng et al., “The flexibility trap: Why arbitrary order limits reasoning potential in diffusion language models,” arXiv preprint arXiv:2601.15165, 2026.
[107] F. Zhu, R. Wang, S. Nie, X. Zhang, C. Wu, J. Hu, J. Zhou, J. Chen, Y. Lin, J.-R. Wen et al., “Llada 1.5: Variance-reduced preference optimization for large language diffusion models,” arXiv preprint arXiv:2505.19223, 2025.
[108] Q. Wei, Y. Zhang, Z. Liu, D. Liu, and L. Zhang, “Accelerating diffusion large language models with slowfast: The three golden principles,” arXiv preprint arXiv:2506.10848, 2025.
[109] W. Bao, Z. Chen, D. Xu, and Y. Shang, “Learning to parallel: Accelerating diffusion large language models via learnable parallel decoding,” arXiv preprint arXiv:2509.25188, 2025.
[110] Z. Chen, G. Fang, X. Ma, R. Yu, and X. Wang, “dparallel: Learnable parallel decoding for dllms,” arXiv preprint arXiv:2509.26488, 2025.
[111] J. Chen, Y. Liang, and Z. Liu, “Dflash: Block diffusion for flash speculative decoding,” arXiv preprint arXiv:2602.06036, 2026.
[112] Z. Chen, G. Fang, X. Ma, R. Yu, and X. Wang, “Dmax: Aggressive parallel decoding for dllms,” arXiv preprint arXiv:2604.08302, 2026.
[113] P. Li, D. Muhtar, T. Chen, L. Yin, and S. Liu, “Why diffusion language models struggle with truly parallel (non-autoregressive) decoding?” arXiv preprint arXiv:2602.23225, 2026.
[114] P. Li, S. Yan, J. Tsai, R. Zhang, R. An, Z. Guo, and X. Gao, “Adaptive classifier-free guidance via dynamic low-confidence masking,” arXiv preprint arXiv:2505.20199, 2025.
[115] Z. Hu, J. Meng, Y. Akhauri, M. S. Abdelfattah, J.-s. Seo, Z. Zhang, and U. Gupta, “Accelerating diffusion language model inference via efficient kv caching and guided diffusion,” arXiv preprint arXiv:2505.21467, 2025.
[116] T. Suresh, D. Banerjee, S. Ugare, S. Misailovic, and G. Singh, “Dingo: Constrained inference for diffusion llms,” in ICML 2025 Workshop on Reliable and Responsible Foundation Models.
[117] Q. Nguyen-Tri, M. Ranjan, and Z. Shen, “Attention is all you need for kv cache in diffusion llms,” arXiv preprint arXiv:2510.14973, 2025.
[118] Y. Jiang, Y. Cai, X. Luo, J. Fu, J. Wang, C. Liu, and X. Yang, “d 2 cache: Accelerating diffusion-based llms via dual adaptive caching,” arXiv preprint arXiv:2509.23094, 2025.
[119] X. Ma, G. Fang, and X. Wang, “Deepcache: Accelerating diffusion models for free,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 15 762–15 772.
[120] P. Chen, M. Shen, P. Ye, J. Cao, C. Tu, C.-S. Bouganis, Y. Zhao, and T. Chen, “∆-dit: A training-free acceleration method tailored for diffusion transformers,” arXiv preprint arXiv:2406.01125, 2024.
[121] X. Ma, G. Fang, M. Bi Mi, and X. Wang, “Learning-to-cache: Accelerating diffusion transformer via layer caching,” Advances in Neural Information Processing Systems, vol. 37, pp. 133 282– 133 304, 2024.
[122] Z. Lv, C. Si, J. Song, Z. Yang, Y. Qiao, Z. Liu, and K.-Y. K. Wong, “Fastercache: Training-free video diffusion model acceleration with high quality,” in The Thirteenth International Conference on Learning Representations.
[123] S. Hayakawa, Y. Takida, M. Imaizumi, H. Wakaki, and Y. Mitsufuji, “Distillation of discrete diffusion through dimensional correlations,” in Forty-second International Conference on Machine Learning.
[124] T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in International Conference on Learning Representations.
[125] Y.-Y. Qian, J. Su, L. Hu, P. Zhang, Z. Deng, P. Zhao, and H. Zhang, “d3llm: Ultra-fast diffusion llm using pseudo-trajectory distillation,” arXiv preprint arXiv:2601.07568, 2026.
[126] A. Myrzakhan, T. Li, B. Guo, S. Tang, and Z. Shen, “Sinkaware pruning for diffusion language models,” arXiv preprint arXiv:2602.17664, 2026. 24
[127] S. Li, K. Kallidromitis, H. Bansal, A. Gokul, Y. Kato, K. Kozuka, J. Kuen, Z. Lin, K.-W. Chang, and A. Grover, “Lavida: A large diffusion language model for multimodal understanding,” arXiv preprint arXiv:2505.16839, 2025.
[128] J. Wang, Y. Lai, A. Li, S. Zhang, J. Sun, N. Kang, C. Wu, Z. Li, and P. Luo, “Fudoki: Discrete flow-based unified understanding and generation via kinetic-optimal velocities,” arXiv preprint arXiv:2505.20147, 2025.
[129] A. Swerdlow, M. Prabhudesai, S. Gandhi, D. Pathak, and K. Fragkiadaki, “Unified multimodal discrete diffusion,” arXiv preprint arXiv:2503.20853, 2025.
[130] Y. Xin, Q. Qin, S. Luo, K. Zhu, J. Yan, Y. Tai, J. Lei, Y. Cao, K. Wang, Y. Wang et al., “Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding,” arXiv preprint arXiv:2510.06308, 2025.
[131] S. Li, J. Gu, K. Liu, Z. Lin, Z. Wei, A. Grover, and J. Kuen, “Lavidao: Elastic large masked diffusion models for unified multimodal understanding and generation,” arXiv preprint arXiv:2509.19244, 2025.
[132] Y. Tian, L. Yang, J. Yang, A. Wang, Y. Tian, J. Zheng, H. Wang, Z. Teng, Z. Wang, Y. Wang et al., “Mmada-parallel: Multimodal large diffusion language models for thinking-aware editing and generation,” arXiv preprint arXiv:2511.09611, 2025.
[133] L. Zeng, J. Yao, B. Liao, H. Tao, W. Liu, and X. Wang, “Diffusionvl: Translating any autoregressive models into diffusion vision language models,” arXiv preprint arXiv:2512.15713, 2025.
[134] C. Wu, S. Lan, Y. Fu, S. Gao, J. Wang, J. Yu, J. M. Alvarez, P. Molchanov, P. Luo, S. Han et al., “Fast-dvlm: Efficient blockdiffusion vlm via direct conversion from autoregressive vlm,” arXiv preprint arXiv:2604.06832, 2026.
[135] Z. He, T. Chen, K. Wang, Z. Qin, Y. Shao, C. Gan, S. Li, Z. Wu, and W. Lin, “Vidlada: Bidirectional diffusion large language models for efficient video understanding,” arXiv preprint arXiv:2601.17868, 2026.
[136] S. Yuan, W. Yuan, H. Yin, and T. He, “Roic-dm: Robust text inference and classification via diffusion model,” arXiv preprint arXiv:2401.03514, 2024.
[137] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Diffusionner: Boundary diffusion for named entity recognition,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 3875– 3890.
[138] X. Yang, Z. Qiao, and Y. Zhou, “Ipad: Iterative, parallel, and diffusion-based network for scene text recognition,” International Journal of Computer Vision, pp. 1–21, 2025.
[139] S. Liu, J. Zhou, Q. Zhu, Q. Chen, Q. Bai, J. Xiao, and L. He, “Let’s rectify step by step: Improving aspect-based sentiment analysis with diffusion models,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 10 324–10 335.
[140] H. Zhang, X. Liu, and J. Zhang, “Diffusum: Generation enhanced extractive summarization with diffusion,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 13 089– 13 100.
[141] X. Dong, W. Li, Y. Le, Z. Jiang, J. Zhong, and Z. Wang, “Termdiffusum: a term-guided diffusion model for extractive summarization of legal documents,” in Proceedings of the 31st international conference on computational linguistics, 2025, pp. 3222–3235.
[142] Y. Luo, Q. Zhou, and F. Zhou, “Enhancing phrase representation by information bottleneck guided text diffusion process for keyphrase extraction,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 6036–6047.
[143] J. Zhao, C. Xu, and B. Jiang, “Iped: An implicit perspective for relational triple extraction based on diffusion model,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 2080–2092.
[144] C. H. Lee, H. Kim, J. Yeom, and S. Yoon, “Editext: Controllable coarse-to-fine text editing with diffusion language models,” arXiv preprint arXiv:2502.19765, 2025.
[145] G. Bi, L. Shen, Y. Cao, M. Chen, Y. Xie, Z. Lin, and X. He, “Diffusemp: A diffusion model-based framework with multi-grained control for empathetic response generation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2812–2831.
[146] G. Floto, M. M. A. Pour, P. Farinneya, Z. Tang, A. Pesaranghader, M. Bharadwaj, and S. Sanner, “Diffudetox: A mixed diffusion model for text detoxification,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 7566–7574.
[147] Z. Horvitz, A. Patel, C. Callison-Burch, Z. Yu, and K. McKeown, “Paraguide: Guided diffusion paraphrasers for plug-and-play textual style transfer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 16, 2024, pp. 18 216–18 224.
[148] Y. Zhang, J. Gu, Z. Wu, S. Zhai, J. Susskind, and N. Jaitly, “Planner: Generating diversified paragraph via latent language diffusion model,” Advances in Neural Information Processing Systems, vol. 36, pp. 80 178–80 190, 2023.
[149] J. Liu, P. Cheng, J. Dai, and J. Liu, “Diffucom: A novel diffusion model for comment generation,” Knowledge-Based Systems, vol. 281, p. 111069, 2023.
[150] J. Xiang, Z. Liu, H. Liu, Y. Bai, J. Cheng, and W. Chen, “Diffusiondialog: A diffusion model for diverse dialog generation with latent space,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 4912–4921.
[151] W. Zou, Z. Zhuang, X. Geng, S. Huang, J. Liu, and J. Chen, “Improved paraphrase generation via controllable latent diffusion,” arXiv preprint arXiv:2404.08938, 2024.
[152] Z. Hu, C. Liu, Y. Feng, A. T. Luu, and B. Hooi, “Poetrydiffusion: Towards joint semantic and metrical manipulation in poetry generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 279–18 288.
[153] L. Chen, A. Feng, B. Yang, and Z. Li, “Xdlm: Cross-lingual diffusion language model for machine translation,” arXiv preprint arXiv:2307.13560, 2023.
[154] S. Qiao, X. Liu, and S.-H. Na, “Diffusionret: Diffusion-enhanced generative retriever using constrained decoding,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 9515–9529.
[155] K. Yan, M. Liu, Y. Liu, R. Fu, Z. Wen, J. Tao, and X. Liu, “Debunk and infer: Multimodal fake news detection via diffusion-generated evidence and llm reasoning,” arXiv preprint arXiv:2506.21557, 2025.
[156] O. Luxembourg, H. Permuter, and E. Nachmani, “Plan for speed– dilated scheduling for masked diffusion language models,” arXiv preprint arXiv:2506.19037, 2025.
[157] C. Fan, W. Heng, B. Li, S. Liu, Y. Song, J. Su, X. Qu, K. Shen, and W. Wei, “Stable-diffcoder: Pushing the frontier of code diffusion large language model,” arXiv preprint arXiv:2601.15892, 2026.
[158] H. Bai, L. Kong, X. Chen, J. Wang, Z. Tao, and H. Wang, “Dice: Diffusion large language models excel at generating cuda kernels,” arXiv preprint arXiv:2602.11715, 2026.
[159] Y. Xiong, K. Li, J. Chen, H. Zhang, D. Lin, Y. Che, and W. Hu, “Text-guided multi-property molecular optimization with a diffusion language model,” arXiv preprint arXiv:2410.13597, 2024.
[160] H. Gong, Q. Liu, S. Wu, and L. Wang, “Text-guided molecule generation with diffusion language model,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 109–117.
[161] S. Goel, V. Thoutam, E. M. Marroquin, A. Gokaslan, A. Firouzbakht, S. Vincoff, V. Kuleshov, H. T. Kratochvil, and P. Chatterjee, “Memdlm: De novo membrane protein design with masked discrete diffusion protein language models,” in NeurIPS 2024 Workshop on AI for New Drug Modalities.
[162] X. Wang, Z. Zheng, D. Xue, S. Huang, Q. Gu et al., “Diffusion language models are versatile protein learners,” in Forty-first International Conference on Machine Learning.
[163] J. Yin, C. Zha, W. He, C. Xu, and X. Gao, “Cfp-gen: Combinatorial functional protein generation via diffusion language models,” in Forty-second International Conference on Machine Learning.
[164] C. Wang, M. Uehara, Y. He, A. Wang, A. Lal, T. Jaakkola, S. Levine, A. Regev, T. Biancalani et al., “Fine-tuning discrete diffusion models via reward optimization with applications to dna and protein design,” in The Thirteenth International Conference on Learning Representations.
[165] B. Ni, D. L. Kaplan, and M. J. Buehler, “Forcegen: End-toend de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model,” Science Advances, vol. 10, no. 6, p. eadl4000, 2024.
[166] L. Hallee, N. Rafailidis, D. B. Bichara, and J. P. Gleghorn, “Diffusion sequence models for enhanced protein representation and generation,” arXiv preprint arXiv:2506.08293, 2025. 25
[167] X. Wang, Z. Zheng, F. Ye, D. Xue, S. Huang, and Q. Gu, “Dplm-2: A multimodal diffusion protein language model,” arXiv preprint arXiv:2410.13782, 2024.
[168] Y. Wen, H. Li, K. Gu, Y. Zhao, T. Wang, and X. Sun, “Lladavla: Vision language diffusion action models,” arXiv preprint arXiv:2509.06932, 2025.
[169] J. Wen, M. Zhu, J. Liu, Z. Liu, Y. Yang, L. Zhang, S. Zhang, Y. Zhu, and Y. Xu, “dvla: Diffusion vision-language-action model with multimodal chain-of-thought,” arXiv preprint arXiv:2509.25681, 2025.
[170] J. Chen, W. Song, P. Ding, Z. Zhou, H. Zhao, F. Tang, D. Wang, and H. Li, “Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process,” arXiv preprint arXiv:2511.01718, 2025.
[171] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. pmlr, 2015, pp. 2256–2265.
[172] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186.
[173] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[174] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations.
[175] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in International Conference on Learning Representations.
[176] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixedlength context,” arXiv preprint arXiv:1901.02860, 2019.
[177] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
[178] F. Gloeckle, B. Y. Idrissi, B. Roziere, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” in Forty-first International Conference on Machine Learning.
[179] R. Chen, W. Chai, Z. Yang, X. Zhang, J. T. Zhou, T. Quek, S. Poria, and Z. Liu, “Diffpo: Diffusion-styled preference optimization for efficient inference-time alignment of large language models,” arXiv preprint arXiv:2503.04240, 2025.
[180] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
[181] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[182] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
[183] H. Yuan, Z. Yuan, C. Tan, F. Huang, and S. Huang, “Seqdiffuseq: Text diffusion with encoder-decoder transformers,” arXiv preprint arXiv:2212.10325, 2022.
[184] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
[185] T. Chen, R. Zhang, and G. Hinton, “Analog bits: Generating discrete data using diffusion models with self-conditioning,” arXiv preprint arXiv:2208.04202, 2022.
[186] Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github. io/blog/qwen2.5/
[187] G. He, S. Nie, F. Zhu, Y. Zhao, T. Bai, R. Yan, J. Fu, C. Li, and B. Yuan, “Ultrallada: Scaling the context length to 128k for diffusion large language models,” arXiv preprint arXiv:2510.10481, 2025.
[188] X. Zhu, G. Karadzhov, C. Whitehouse, and A. Vlachos, “Segment-level diffusion: A framework for controllable longform generation with diffusion language models,” arXiv preprint arXiv:2412.11333, 2024.
[189] Y. Zihuiwen, Y. Elle Michelle, and B. Phil, “Latent diffusion for document generation with sequential decoding,” in NeurIPS 2023 Workshop on Diffusion Models, 2023. [Online]. Available: https://neurips.cc/virtual/2023/74876
[190] E. Cetin, T. Zhao, and Y. Tang, “Large language models to diffusion finetuning,” arXiv preprint arXiv:2501.15781, 2025.
[191] J. Bai, T. Ye, W. Chow, E. Song, Q.-G. Chen, X. Li, Z. Dong, L. Zhu, and S. Yan, “Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis,” in The Thirteenth International Conference on Learning Representations, 2024.
[192] J. Ni, Q. Liu, C. Du, L. Dou, H. Yan, Z. Wang, T. Pang, and M. Q. Shieh, “Training optimal large diffusion language models,” arXiv preprint arXiv:2510.03280, 2025.
[193] J. Ni, Q. Liu, L. Dou, C. Du, Z. Wang, H. Yan, T. Pang, and M. Q. Shieh, “Diffusion language models are super data learners,” arXiv preprint arXiv:2511.03276, 2025.
[194] M. Asada and M. Miwa, “Addressing the training-inference discrepancy in discrete diffusion for text generation,” in Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 7156–7164.
[195] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[196] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” in European Conference on Computer Vision. Springer, 2024, pp. 87–103.
[197] A. Sauer, F. Boesel, T. Dockhorn, A. Blattmann, P. Esser, and R. Rombach, “Fast high-resolution image synthesis with latent adversarial diffusion distillation,” in SIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11.
[198] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023.
[199] F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-interleave: Tackling multi-image, video, and 3d in large multimodal models,” in The Thirteenth International Conference on Learning Representations.
[200] J. Guo, T. Zheng, Y. Bai, B. Li, Y. Wang, K. Zhu, Y. Li, G. Neubig, W. Chen, and X. Yue, “Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale,” arXiv preprint arXiv:2412.05237, 2024.
[201] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. AlDahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[202] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge et al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,” arXiv preprint arXiv:2409.12191, 2024.
[203] J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y. Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” in The Thirteenth International Conference on Learning Representations.
[204] S. Kou, J. Jin, Z. Liu, C. Liu, Y. Ma, J. Jia, Q. Chen, P. Jiang, and Z. Deng, “Orthus: Autoregressive interleaved imagetext generation with modality-specific heads,” arXiv preprint arXiv:2412.00127, 2024.
[205] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan et al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 12 966–12 977.
[206] S. Patil, W. Berman, R. Rombach, and P. von Platen, “amused: An open muse reproduction,” arXiv preprint arXiv:2401.01808, 2024.
[207] Y. Wang, Z. Li, Y. Zang, Y. Zhou, J. Bu, C. Wang, Q. Lu, C. Jin, and J. Wang, “Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning,” arXiv preprint arXiv:2508.20751, 2025. 26
[208] Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about physical commonsense in natural language,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432–7439.
[209] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800.
[210] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[211] D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An objectfocused framework for evaluating text-to-image alignment,” Advances in Neural Information Processing Systems, vol. 36, pp. 52 132– 52 152, 2023.
[212] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, J. Yang, X. Zheng, K. Li, X. Sun et al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,” arXiv preprint arXiv:2306.13394, 2023.
[213] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun et al., “Mmmu: A massive multidiscipline multimodal understanding and reasoning benchmark for expert agi,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567.
[214] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for realworld visual reasoning and compositional question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709.
[215] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
[217] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level googleproof q&a benchmark,” in First Conference on Language Modeling, 2024.
[218] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[219] Y. Lyu, T. Luo, J. Shi, T. C. Hollon, and H. Lee, “Fine-grained text style transfer with diffusion-based language models,” arXiv preprint arXiv:2305.19512, 2023.
[220] Y. Demirag, D. Liu, and J. Niehues, “Benchmarking diffusion models for machine translation,” in Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, 2024, pp. 313–324.
[221] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45.
[222] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[223] Z. Wu, L. Zheng, Z. Xie, J. Ye, J. Gao, Y. Feng, Z. Li, V. W., G. Zhou, and L. Kong, “Dreamon: Diffusion language models for code infilling beyond fixed-size canvas,” 2025. [Online]. Available: https://hkunlp.github.io/blog/2025/dreamon
[224] Y. Yang, C. Wang, S. Wang, Z. Wen, B. Qi, H. Xu, and L. Zhang, “Diffusion llm with native variable generation lengths: Let [eos] lead the way,” arXiv preprint arXiv:2510.24605, 2025.
[225] J. Li, X. Dong, Y. Zang, Y. Cao, J. Wang, and D. Lin, “Beyond fixed: Training-free variable-length denoising for diffusion large language models,” arXiv preprint arXiv:2508.00819, 2025.
[226] X. Chen, S. Huang, C. Guo, C. Wei, Y. He, J. Zhang, H. Li, Y. Chen et al., “Dpad: Efficient diffusion language models with suffix dropout,” arXiv preprint arXiv:2508.14148, 2025.
[227] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025.
[228] K. Team, Y. Bai, Y. Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y. Chen, Y. Chen, Y. Chen et al., “Kimi k2: Open agentic intelligence,” arXiv preprint arXiv:2507.20534, 2025.