{"id":31047,"date":"2026-05-11T19:53:30","date_gmt":"2026-05-11T11:53:30","guid":{"rendered":"http:\/\/139.9.1.231\/?p=31047"},"modified":"2026-05-11T19:53:33","modified_gmt":"2026-05-11T11:53:33","slug":"llm-asr-stream-20222026","status":"publish","type":"post","link":"http:\/\/139.9.1.231\/index.php\/2026\/05\/11\/llm-asr-stream-20222026\/","title":{"rendered":"\u6d41\u5f0f LLM-ASR \u6a21\u578b\u4f18\u5316\u8bba\u6587\u5168\u666f\uff082022\u20132026\uff09"},"content":{"rendered":"\n<p>\u539f\u521b\uff1a\u8d3e\u5f66 <\/p>\n\n\n\n\n\n<p>\u65f6\u95f4\u8303\u56f4\uff1a2022.01\u20132026.04\uff0c\u5171\u6536\u5f55<strong>17 \u7bc7<\/strong>&nbsp;\u4ee3\u8868\u6027\u8bba\u6587\uff0c\u6309\u65f6\u95f4\u987a\u5e8f\u6392\u5217\u3002\u6bcf\u7bc7\u5305\u542b\uff1a\u7b80\u4ecb\u3001\u67b6\u6784\u3001\u5173\u952e\u521b\u65b0\u3001\u8bad\u7ec3\u6570\u636e\u3001\u5b9e\u9a8c\u7ed3\u679c\u3001\u7280\u5229\u70b9\u8bc4\u3001\u8bc4\u5206\u3002\u2b50\u2b50 = \u91cc\u7a0b\u7891\u8bba\u6587\uff1b\u2b50 = \u503c\u5f97\u7cbe\u8bfb<\/p>\n\n\n\n<h2><strong>2022\u20132023\uff1a\u5960\u57fa\u671f\u2014\u2014LLM \u5982\u4f55\u63a5\u7ba1 ASR<\/strong><\/h2>\n\n\n\n<p>\u8fd9\u4e00\u9636\u6bb5\u7684\u6838\u5fc3\u95ee\u9898\u662f\uff1a&#8221;<strong>\u80fd\u4e0d\u80fd\u628aLLM\u7528\u5230\u8bed\u97f3\u8bc6\u522b\u4e0a\uff1f<\/strong>&nbsp;&#8221; \u7814\u7a76\u8005\u4eec\u521a\u521a\u5f00\u59cb\u5c1d\u8bd5\u628a Whisper\u3001LLaMA \u7b49\u6a21\u578b\u5f15\u5165 ASR\uff0c\u6d41\u5f0f\u80fd\u529b\u8fd8\u662f\u6b21\u8981\u95ee\u9898\uff0c\u4e3b\u8981\u5728\u9a8c\u8bc1\u53ef\u884c\u6027\u3002<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>1. Prompting Large Language Models with Speech Recognition Abilities \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2307.11795<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2023-07-21<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2024<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Meta AI<\/p>\n\n\n\n<p><em><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;:<a href=\"https:\/\/arxiv.org\/abs\/2307.11795\" target=\"_blank\" rel=\"noreferrer noopener\"> https:\/\/arxiv.org\/abs\/2307.11795<\/a><\/em><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u6700\u65e9\u7cfb\u7edf\u9a8c\u8bc1&#8221;\u628a\u5c0f\u97f3\u9891\u7f16\u7801\u5668\u76f4\u63a5\u63a5\u5230\u51bb\u7ed3 LLM \u524d\u7aef\u505a ASR&#8221;\u8fd9\u4e2a GPT-style \u8303\u5f0f\u53ef\u884c\u7684\u8bba\u6587\u4e4b\u4e00\u3002\u5c06 Conformer encoder \u8f93\u51fa\u4f5c\u4e3a prefix embedding \u62fc\u63a5\u5230 LLaMA-7B \u7684 text token \u524d\uff0c\u9a8c\u8bc1\u591a\u8bed\u8a00 ASR \u80fd\u529b\uff0c\u4ee5\u53ca LLM \u51bb\u7ed3\u65f6\u662f\u5426\u4ecd\u53ef\u5b66\u5230\u591a\u8bed\u8a00\u8bc6\u522b\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 Conformer Encoder \u2192 Prefix Embedding<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u6587\u672c Token Embedding \u2192 LLaMA-7B (\u53ef\u51bb\u7ed3) \u2192 \u8f6c\u5f55\u8f93\u51fa<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u5efa\u7acb\u4e86&#8221;Speech Prompt + LLM&#8221;\u7684 GPT-style baseline \u8303\u5f0f<\/li><li>\u9a8c\u8bc1\uff1aLLM \u51bb\u7ed3 + \u4ec5\u8bad\u7ec3 encoder \u65f6\u4ecd\u6709\u6548\uff0c\u65e0\u9700 LLM \u53c2\u4e0e ASR \u8bad\u7ec3<\/li><li>\u5927\u6b65\u957f striding\uff08~1s\uff09\u4e0b\u4ecd\u4fdd\u6301\u591a\u8bed\u8a00\u8bc6\u522b\u80fd\u529b<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aMultilingual LibriSpeech\uff08MLS\uff0c44.5k h\uff0c\u591a\u8bed\u8a00\uff09<\/li><li>MLS \u82f1\u8bed WER 4.3%\uff0c\u591a\u8bed\u8a00\u8d85\u8fc7 monolingual baseline 18%<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8fd9\u7bc7\u7684\u5386\u53f2\u4ef7\u503c\u5728\u4e8e&#8221;\u7b2c\u4e00\u6279\u9a8c\u8bc1\u8005&#8221;\u800c\u975e&#8221;\u521b\u65b0\u8005&#8221;\u3002Encoder \u63a5 LLM \u505a ASR \u8fd9\u4ef6\u4e8b\u5927\u5bb6\u90fd\u5728\u60f3\uff0c\u5b83\u53ea\u662f\u7b2c\u4e00\u6279\u505a\u51fa\u6765\u5e76\u5199\u6e05\u695a\u7684\u3002\u4e0d\u652f\u6301\u6d41\u5f0f\u662f\u786c\u4f24\u2014\u2014GPT-style \u5fc5\u987b\u628a\u6574\u6bb5\u97f3\u9891\u5148\u7f16\u597d\u518d\u5582 LLM\uff0c\u5b9e\u65f6\u573a\u666f\u5b8c\u5168\u7528\u4e0d\u4e86\u3002\u8bba\u6587\u672c\u8eab\u504f\u5de5\u7a0b\u62a5\u544a\uff0cablation \u4e5f\u6bd4\u8f83\u7c97\u7cd9\u3002\u4e0d\u8fc7\u4f5c\u4e3a\u8fd9\u4e2a\u65b9\u5411\u7684\u5f00\u5c71\u4e4b\u4f5c\u5fc5\u987b\u4e86\u89e3\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 6\/10<\/p>\n\n\n\n<h3><strong>2. Chunked Attention-based Encoder-Decoder for Streaming Speech Recognition \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2309.08436<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2023-09-15<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2024<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: RWTH Aachen \/ Google<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2309.08436\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2309.08436<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u5c06<strong> AED\uff08Attention Encoder-Decoder\uff09<\/strong>\u6a21\u578b\u6539\u9020\u4e3a<strong> chunk-wise \u6d41\u5f0f\u6a21\u578b<\/strong>\uff0c\u7528\u7279\u6b8a\u7684 <strong>End-of-Chunk\uff08EOC\uff09<\/strong>\u7b26\u53f7\u4ee3\u66ff\u4f20\u7edf EOS \u7b26\u53f7\u9a71\u52a8 chunk \u95f4\u8df3\u8f6c\u3002\u7406\u8bba\u5206\u6790\u8868\u660e <strong>Chunked-AED \u7b49\u4ef7\u4e8e\u4e00\u4e2a chunk \u7ea7\u522b\u7684 Transducer \uff08RNN-T\uff09<\/strong>\u3002\u540c\u65f6\u7814\u7a76\u4e86\u957f\u97f3\u9891\u6cdb\u5316\u3001beam size\u3001length normalization \u7b49\u5b9e\u9645\u90e8\u7f72\u95ee\u9898\u3002<\/p>\n\n\n\n<p><strong><em>PS: RNN-T consists of three major building blocks:<\/em><\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img src=\"https:\/\/3.bp.blogspot.com\/-Z_W8MILyUsk\/XIauDigQ93I\/AAAAAAAAD5E\/0LsZWC3mWaIDrzSN0I4QCGGBWBzg5XNYgCEwYBhgL\/s400\/image2.png\" alt=\"RNN-T Architecture\"\/><\/figure><\/div>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 Chunk-aware Encoder\uff08\u9650\u5236\u672a\u6765\u5e27\u53ef\u89c1\u8303\u56f4\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp;Chunk-wise Decoder\uff08EOC token \u9a71\u52a8 chunk \u8df3\u8f6c\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u6d41\u5f0f\u8f6c\u5f55\u8f93\u51fa\uff08chunk-by-chunk\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>AED \u6d41\u5f0f\u6539\u9020\uff1aEOC token \u66ff\u6362 EOS\uff0c\u4f7f decoder \u53ef chunk-wise \u751f\u6210<\/li><li>\u7406\u8bba\u8bc1\u660e Chunked-AED \u2248 Chunk-level Transducer\uff0c\u7edf\u4e00\u4e24\u7c7b\u6a21\u578b<\/li><li>\u957f\u97f3\u9891\u6cdb\u5316\uff1a\u4e32\u8054\u77ed\u97f3\u9891\u5e8f\u5217\u8bad\u7ec3\uff0c\u65e0\u9700\u4e13\u95e8\u957f\u97f3\u9891\u6570\u636e<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech\uff08960h\uff09+ TED-LIUM-v2<\/li><li>LibriSpeech test-clean \u6d41\u5f0f WER 2.7%\uff0c\u4e0e\u975e\u6d41\u5f0f\u5dee\u8ddd\u6781\u5c0f<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8fd9\u7bc7\u7684\u610f\u4e49\u88ab\u4f4e\u4f30\u4e86\u3002\u5b83\u628a AED \u548c Transducer \u7684\u7406\u8bba\u5173\u7cfb\u8bf4\u6e05\u695a\u4e86\uff0c\u540e\u7eed\u5f88\u591a\u6d41\u5f0f LLM-ASR \u8bbe\u8ba1\u90fd\u662f\u8fd9\u4e2a\u601d\u8def\u7684\u53d8\u4f53\u3002\u4f46\u5b83\u672c\u8eab\u5e76\u6ca1\u6709\u5f15\u5165 LLM\uff0c\u662f&#8221;\u6d41\u5f0f AED \u4f18\u5316&#8221;\u8bba\u6587\uff0c\u548c&#8221;LLM-ASR&#8221;\u4e25\u683c\u6765\u8bf4\u4e0d\u5728\u4e00\u4e2a\u8d5b\u9053\u3002CHAT\uff082602.24245\uff09\u53ef\u4ee5\u76f4\u63a5\u770b\u4f5c\u8fd9\u7bc7\u7684 LLM \u65f6\u4ee3\u7eed\u4f5c\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>3. Smoothed Label Distillation for Decoder-Only ASR\uff08SLD\uff09<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2311.04534<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2023-11-08<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2024<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Alibaba DAMO Academy<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;:<a href=\" https:\/\/arxiv.org\/abs\/2311.04534\"> https:\/\/arxiv.org\/abs\/2311.04534<\/a><\/p>\n\n\n\n<p><strong>\u4ee3\u7801\u94fe\u63a5<\/strong>&nbsp;: https:\/\/github.com\/alibaba-damo-academy\/SpokenNLP<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u7814\u7a76 decoder-only Transformer\uff08GPT-style\uff09\u505a ASR \u65f6\u5982\u4f55\u5904\u7406\u79bb\u6563\u8bed\u97f3 token \u7684\u8bad\u7ec3\u635f\u5931\u95ee\u9898\u3002<strong>\u53d1\u73b0\u76f4\u63a5\u5728\u97f3\u9891 token \u4e0a\u7528 CE loss \u5e76\u4e0d\u7a33\u5b9a\uff0c\u63d0\u51fa Smoothed Label Distillation\uff08SLD\uff09\uff0c\u7528 KL \u6563\u5ea6 + \u5e73\u6ed1\u6807\u7b7e\u5bf9\u97f3\u9891 token \u8fdb\u884c\u81ea\u56de\u5f52\u5efa\u6a21<\/strong>\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 \u79bb\u6563\u5316\uff08HuBERT\/EnCodec\u7b49\uff09 \u2192 \u97f3\u9891\u79bb\u6563 token<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Decoder-Only Transformer\uff08GPT-style\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u97f3\u9891 token \u9884\u6d4b &nbsp; &nbsp; &nbsp; &nbsp;\u6587\u672c token \u9884\u6d4b<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \uff08SLD: KL\u6563\u5ea6+\u5e73\u6ed1\u6807\u7b7e\uff09 \uff08\u6807\u51c6 CE loss\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u6307\u51fa Loss Masking\uff08\u5ffd\u7565\u97f3\u9891 token \u7684 loss\uff09\u548c\u76f4\u63a5 CE \u90fd\u4e0d\u662f\u6700\u4f18\u7684<\/li><li>SLD\uff1a<strong>KL \u6563\u5ea6 + \u5e73\u6ed1\u6807\u7b7e\uff0c\u8ba9\u6a21\u578b\u5b66\u5230\u97f3\u9891 token \u95f4\u7684\u81ea\u56de\u5f52\u4f9d\u8d56<\/strong><\/li><li>\u5bf9 SpeechGPT \u7b49\u79bb\u6563 token ASR \u8303\u5f0f\u7684\u8bad\u7ec3\u76ee\u6807\u4f18\u5316\u6709\u6307\u5bfc\u610f\u4e49<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech\uff08960h\uff09<\/li><li>\u8d85\u8d8a Loss Masking \u7b56\u7565\uff0c\u5728\u591a\u79cd\u8bed\u97f3\u79bb\u6563\u5316\u65b9\u6cd5\u4e0b\u4e00\u81f4\u6539\u5584<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8fd9\u662f\u4e00\u7bc7&#8221;\u627e\u5230\u771f\u6b63\u95ee\u9898\u5e76\u89e3\u51b3\u5b83&#8221;\u7684\u5c0f\u800c\u7cbe\u7684\u5de5\u4f5c\u3002\u79bb\u6563 token ASR \u7684\u8bad\u7ec3\u635f\u5931\u8be5\u600e\u4e48\u8bbe\u8ba1\u8fd9\u4e2a\u95ee\u9898\u5728\u5f53\u65f6\u6ca1\u4eba\u4ed4\u7ec6\u7814\u7a76\uff0c\u5b83\u8ba4\u771f\u7814\u7a76\u4e86\u3002\u4f46\u79bb\u6563 token ASR \u7684\u7cbe\u5ea6\u4e0a\u9650\u672c\u6765\u5c31\u6bd4\u8fde\u7eed\u7279\u5f81\u5dee\uff0cSLD \u6539\u5584\u7684\u662f&#8221;\u8bad\u7ec3\u65b9\u5f0f&#8221;\u800c\u975e&#8221;\u67b6\u6784\u4e0a\u9650&#8221;\u3002\u6d41\u5f0f\u80fd\u529b\u6ca1\u6709\u6d89\u53ca\uff0c\u5c5e\u4e8e decoder-only ASR \u7684\u8bad\u7ec3\u57fa\u7840\u7814\u7a76\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 6\/10<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2><strong>\u258c2024\uff1a\u7206\u53d1\u671f\u2014\u2014\u6d41\u5f0f\u6846\u67b6\u3001\u591a\u4efb\u52a1\u3001\u5de5\u7a0b\u5316<\/strong><\/h2>\n\n\n\n<p>2024 \u5e74\u662f\u6d41\u5f0f LLM-ASR \u771f\u6b63\u7206\u53d1\u7684\u4e00\u5e74\u3002BESTOW \u786e\u7acb\u4e86 read-write policy \u6846\u67b6\uff0cTransducer-Llama \u7ed9\u51fa RNN-T \u4e0b\u6700\u4f18 LLM \u96c6\u6210\u65b9\u6848\uff0cSeed-ASR \u5c55\u793a\u4e86\u5de5\u4e1a LLM-ASR \u7684\u771f\u5b9e\u8fb9\u754c\u3002<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>4. BESTOW: Efficient and Streamable Speech Language Model \u2b50\u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2406.19954<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2024-06-28<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: Interspeech 2024 \/ NeurIPS 2024 Workshop<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: NVIDIA<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <strong><a href=\"https:\/\/arxiv.org\/abs\/2406.19954\">https:\/\/arxiv.org\/abs\/2406.19954<\/a><\/strong><\/p>\n\n\n\n<p><strong>\u4ee3\u7801\u94fe\u63a5<\/strong>&nbsp;: https:\/\/github.com\/NVIDIA\/NeMo\uff08\u542b BESTOW \u5b9e\u73b0\uff09<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa BESTOW \u67b6\u6784\uff0c\u5c06 GPT-style\uff08\u9884\u62fc\u63a5\u97f3\u9891 embedding\uff09\u548c T5-style\uff08\u9010\u5c42 cross-attention\uff09\u7684\u4f18\u70b9\u878d\u5408\u3002\u6838\u5fc3\u662f\u7528<strong>\u6587\u672c query + \u97f3\u9891 key\/value \u7684 cross-attention<\/strong>&nbsp;\u66ff\u4ee3\u97f3\u9891 prefix \u62fc\u63a5\uff0c\u65e2\u4fdd\u6301\u9ad8\u6548\u7387\u53c8\u5929\u7136\u652f\u6301\u6d41\u5f0f\u3002\u5c06\u6d41\u5f0f SpeechLLM \u91cd\u65b0\u5b9a\u4e49\u4e3a read-write policy \u95ee\u9898\uff0c\u7edf\u4e00\u79bb\u7ebf\u4e0e\u6d41\u5f0f\u7814\u7a76\u6846\u67b6\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 \u6d41\u5f0f Speech Encoder \u2192 \u97f3\u9891\u7279\u5f81\uff08Key\/Value\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>\u6587\u672c Prompt \u2192 LLM \u5185\u5404\u5c42 Cross-Attention\uff08\u6587\u672c\u4f5c Query\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;read-write policy \u7f51\u7edc<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\uff08\u51b3\u5b9a\u4f55\u65f6\u8f93\u51fa token\uff0c\u4f55\u65f6\u7ee7\u7eed&nbsp;read\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u6d41\u5f0f\u591a\u4efb\u52a1\u8f93\u51fa\uff08ASR\/AST\/SQA\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u9996\u4e2a\u540c\u65f6\u652f\u6301\u6d41\u5f0f\u548c\u591a\u4efb\u52a1\uff08ASR\/AST\/SQA\uff09\u7684\u5f00\u6e90 SpeechLLM<\/li><li>\u5c06\u6d41\u5f0f\u95ee\u9898\u8f6c\u5316\u4e3a read-write policy\uff0c\u501f\u9274\u540c\u6b65\u7ffb\u8bd1\u9886\u57df\u6210\u719f\u7814\u7a76<\/li><li>text query \u9a71\u52a8\u97f3\u9891 cross-attention\uff0c\u6548\u7387\u4f18\u4e8e GPT-style prefix \u62fc\u63a5<\/li><li>87k \u5c0f\u65f6\u6570\u636e\u89c4\u6a21\uff0c\u4e00\u5929\u5185\u53ef\u5b8c\u6210\u8bad\u7ec3<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a87,000 \u5c0f\u65f6\u591a\u8bed\u8a00\u8bed\u97f3\uff08\u516c\u5f00 + \u79c1\u6709\uff09<\/li><li>ASR\u3001AST\u3001SQA \u591a\u4efb\u52a1 SOTA\uff1bLibriSpeech test-clean WER 1.9%<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>2024 \u5e74\u6d41\u5f0f LLM-ASR \u91cc\u6700\u503c\u5f97\u7cbe\u8bfb\u7684\u8bba\u6587\uff0c\u6ca1\u6709\u4e4b\u4e00\u3002\u5b83\u628a&#8221;\u6d41\u5f0f SpeechLLM&#8221;\u7684\u95ee\u9898\u7a7a\u95f4\u5b9a\u4e49\u6e05\u695a\u4e86\u2014\u2014read-write policy\u2014\u2014\u5e76\u7ed9\u51fa\u4e86\u7b2c\u4e00\u4e2a\u80fd\u8dd1\u3001\u80fd\u5f00\u6e90\u7684\u591a\u4efb\u52a1\u6d41\u5f0f\u89e3\u6cd5\u3002\u4f46 87k \u5c0f\u65f6\u6570\u636e\u4e0d\u662f\u666e\u901a\u56e2\u961f\u80fd\u590d\u73b0\u7684\uff0c\u4e14\u6d41\u5f0f\u6027\u80fd\u4e0a\u6ca1\u6709\u505a\u7ec6\u81f4\u7684\u5ef6\u8fdf\u5206\u6790\uff08\u53ea\u8bf4&#8221;\u652f\u6301\u6d41\u5f0f&#8221;\uff0c\u6ca1\u7ed9\u5177\u4f53 latency \u6570\u5b57\uff09\u3002\u7814\u7a76\u8005\u5fc5\u8bfb\uff1b\u5de5\u7a0b\u5e08\u6ce8\u610f\u6570\u636e\u95e8\u69db\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>5. Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2407.04675<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2024-07-05<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2025<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: ByteDance \/ Seed Team<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2407.04675\">https:\/\/arxiv.org\/abs\/2407.04675<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u5b57\u8282\u8df3\u52a8 Seed \u56e2\u961f\u7684\u5927\u89c4\u6a21\u5de5\u4e1a LLM-ASR \u7cfb\u7edf\u62a5\u544a\u3002\u5c06 LLM \u4e0e\u8bed\u97f3\u7f16\u7801\u5668\u6df1\u5ea6\u878d\u5408\uff0c\u652f\u6301\u4e0a\u4e0b\u6587\u611f\u77e5\u8bc6\u522b\uff08\u70ed\u8bcd\u3001\u573a\u666f\u63d0\u793a\uff09\u3001\u591a\u65b9\u8a00\u3001\u566a\u58f0\u9c81\u68d2\u3002\u91c7\u7528\u5206\u9636\u6bb5\u8bad\u7ec3\uff1a\u9884\u8bad\u7ec3\u5f25\u5408\u6a21\u6001\u5dee\u8ddd\u3001SFT \u5bf9\u9f50\u3001RLHF \u63d0\u5347\u8d28\u91cf\u3002<strong>\u6ce8\uff1a\u8bba\u6587\u672c\u8eab\u4e3a\u79bb\u7ebf\u7cfb\u7edf<\/strong>&nbsp;\uff0c<strong>\u4e0d\u6d89\u53ca<\/strong>&nbsp;\u6d41\u5f0f<strong>\u67b6\u6784\u8bbe\u8ba1\u6216\u6d41\u5f0f\u5b9e\u9a8c<\/strong>&nbsp;\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 \u5927\u89c4\u6a21 Speech Encoder\uff08Conformer\/\u7c7b\u4f3c\u7ed3\u6784\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 Adapter<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;LLM\uff08Decoder-Only\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 \u9884\u8bad\u7ec3\uff1a\u5f25\u5408\u6a21\u6001\u5dee\u8ddd<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 SFT\uff1a\u4efb\u52a1\u5bf9\u9f50<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2514\u2500\u2500 RLHF\uff1a\u8bc6\u522b\u8d28\u91cf\u4e0e\u9c81\u68d2\u6027\u63d0\u5347<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; \u4e0a\u4e0b\u6587 Prompt\uff08\u70ed\u8bcd\/\u9886\u57df\/\u65b9\u8a00\u4fe1\u606f\uff09\u2192 \u6ce8\u5165 LLM \u8f93\u5165<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u79bb\u7ebf\u8f6c\u5f55\u8f93\u51fa\uff08\u8bba\u6587\u4e0d\u6d89\u53ca\u6d41\u5f0f\u63a8\u7406\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u5de5\u4e1a\u7ea7 LLM-ASR \u5168\u6d41\u7a0b\uff1a\u4ece\u9884\u8bad\u7ec3\u5230 RLHF \u7684\u5b8c\u6574 pipeline<\/li><li>\u4e0a\u4e0b\u6587\u611f\u77e5\uff1a\u652f\u6301 prompt \u6ce8\u5165\u70ed\u8bcd\u548c\u9886\u57df\u4fe1\u606f\uff0c\u65e0\u9700\u91cd\u65b0\u8bad\u7ec3<\/li><li>RLHF \u9996\u6b21\u7cfb\u7edf\u5e94\u7528\u4e8e ASR \u8d28\u91cf\u63d0\u5347<\/li><li>\u591a\u65b9\u8a00\u3001\u566a\u58f0\u9c81\u68d2\u5927\u89c4\u6a21\u9a8c\u8bc1\uff08\u8bba\u6587\u4e0d\u6d89\u53ca\u6d41\u5f0f\uff09<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a\u6570\u5341\u4e07\u5c0f\u65f6\u4e2d\u82f1\u53cc\u8bed\uff08\u5b57\u8282\u5185\u90e8\uff0c\u89c4\u6a21\u672a\u5b8c\u5168\u516c\u5f00\uff09<\/li><li>\u5185\u90e8\u591a\u573a\u666f benchmark SOTA\uff0c\u666e\u901a\u8bdd CER \u548c\u82f1\u8bed WER \u5747\u4f18\u4e8e Whisper-v3<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u5b57\u8282\u5728 LLM-ASR \u4e0a\u7684\u7b2c\u4e00\u6b21\u5168\u9762\u4eae\u76f8\uff0c\u5de5\u7a0b\u6df1\u5ea6\u8db3\u3002\u4e0a\u4e0b\u6587 prompt \u6ce8\u5165\u5bf9\u4ea7\u54c1\u573a\u666f\u7279\u522b\u6709\u7528\u2014\u2014\u4f1a\u8bae\u3001\u5782\u76f4\u9886\u57df\u7684\u8bc6\u522b\u8d28\u91cf\u95ee\u9898\u672c\u8d28\u662f&#8221;\u6a21\u578b\u4e0d\u61c2\u8fd9\u4e9b\u8bcd&#8221;\uff0cprompt \u662f\u6027\u4ef7\u6bd4\u6700\u9ad8\u7684\u89e3\u6cd5\u3002\u4f46 RLHF \u5728 ASR \u91cc\u7684 reward \u8bbe\u8ba1\u7ec6\u8282\u62ab\u9732\u4e0d\u591f\u3002\u8bba\u6587\u672c\u8eab\u4e3a\u79bb\u7ebf\u7cfb\u7edf\uff0c\u4e0d\u6d89\u53ca\u6d41\u5f0f\u5185\u5bb9\uff0c\u7eb3\u5165\u672c\u8c03\u7814\u662f\u4f5c\u4e3a\u91cd\u8981\u5de5\u4e1a\u79bb\u7ebf LLM-ASR \u53c2\u8003\u57fa\u7ebf\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>6. Transducer-Llama: Integrating LLMs into Streamable Transducer-based ASR \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2412.16464<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2024-12-21<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2025<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Meta AI<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2412.16464\"><strong>https:\/\/arxiv.org\/abs\/2412.1<\/strong>6464<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u5c06 LLM \u96c6\u6210\u5230 Factorized Transducer\uff08FT\uff09\u6846\u67b6\u4e2d\uff0c\u5929\u7136\u7ee7\u627f RNN-T \u7684\u6d41\u5f0f\u80fd\u529b\u3002\u63d0\u51fa&#8221;\u5f31\u5230\u5f3a LM swap&#8221;\u7b56\u7565\uff1a\u5148\u7528\u5f31 LM \u505a RNN-T \u8bad\u7ec3\uff0c\u518d\u66ff\u6362\u4e3a\u5f3a LLM \u9884\u6d4b\u5668\uff0c\u901a\u8fc7 MWER loss \u5fae\u8c03\u5b8c\u6210\u96c6\u6210\u3002\u8fd8\u5f15\u5165\u8bcd\u6c47\u8868\u9002\u914d\u6280\u672f\u7f13\u89e3 LLM \u5927\u8bcd\u6c47\u8868\u5e26\u6765\u7684\u6570\u636e\u7a00\u758f\u95ee\u9898\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 \u6d41\u5f0f Conformer\/Emformer Encoder<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Factorized Transducer<br>&nbsp; &nbsp; \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510<br>&nbsp; &nbsp; \u2502 &nbsp; Blank Predictor\uff08\u8f7b\u91cf\u7f51\u7edc\uff09 &nbsp; \u2502<br>&nbsp; &nbsp; \u2502 &nbsp; Non-Blank Predictor\uff08LLM\uff09 &nbsp; \u2502\u2190 \u5f31\u2192\u5f3a swap<br>&nbsp; &nbsp; \u2502 &nbsp; Joint Network\uff08sigmoid\/softmax\u6df7\u5408\uff09\u2502<br>&nbsp; &nbsp; \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 MWER \u5fae\u8c03<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u6d41\u5f0f\u8f6c\u5f55\u8f93\u51fa<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>&#8220;\u5f31\u5230\u5f3a LM swap&#8221;\uff1a\u5148\u7528\u5f31 LM \u8bad RNN-T\uff0c\u518d\u6362 LLM\u2014\u2014\u7ed5\u8fc7\u8054\u5408\u8bad\u7ec3\u7684\u4f18\u5316\u9677\u9631<\/li><li>\u8bcd\u6c47\u8868\u9002\u914d\uff1a\u5c06 LLM \u5927\u8bcd\u8868\u6620\u5c04\u5230\u8bed\u97f3\u7cfb\u7edf\u8bcd\u8868\uff0c\u964d\u4f4e\u8bad\u7ec3\u4ee3\u4ef7<\/li><li>MWER loss \u7aef\u5230\u7aef\u8c03\u4f18 LLM \u96c6\u6210\u6548\u679c<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech\uff08960h \u82f1\u8bed\uff09+ MLS \u591a\u8bed\u8a00\uff08en 44.7k h\u3001fr 1.1k h\u3001it 0.2k h\u3001nl 1.6k h\uff09<\/li><li>\u76f8\u5bf9 FT baseline WER -17%\uff1b\u76f8\u5bf9 RNN-T baseline -32%\uff08LibriSpeech\uff09<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8fd9\u7bc7\u65b9\u6cd5\u8bba\u542b\u91d1\u91cf\u6700\u9ad8\u3002&#8221;\u5f31\u5230\u5f3a swap&#8221;\u76f4\u63a5\u51fb\u4e2d RNN-T+LLM \u8054\u5408\u8bad\u7ec3\u6548\u679c\u5dee\u7684\u6838\u5fc3\u539f\u56e0\u2014\u2014\u5f3a LM \u5728 RNN-T loss \u8bad\u7ec3\u671f\u95f4\u4f1a\u8ba9 encoder \u5077\u61d2\u9760\u8bed\u8a00\u5148\u9a8c\u800c\u4e0d\u597d\u597d\u5b66\u58f0\u5b66\u4fe1\u606f\uff0cswap \u540e MWER \u624d\u80fd\u628a LLM \u80fd\u529b\u771f\u6b63\u91ca\u653e\u51fa\u6765\u3002\u8bcd\u6c47\u8868\u9002\u914d\u6280\u5de7\u4e5f\u52a1\u5b9e\uff0c\u5de5\u7a0b\u91cc\u76f4\u63a5\u80fd\u7528\u3002\u4f46 Meta \u7684\u6570\u636e\u8d44\u6e90\uff0844.7k \u5c0f\u65f6\u82f1\u8bed\uff09\u4e0d\u662f\u666e\u901a\u56e2\u961f\u80fd\u6bd4\u7684\uff0c\u4e2d\u6587\u7b49\u5176\u4ed6\u8bed\u7cfb\u6cdb\u5316\u6027\u5b58\u7591\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<h3><strong>7. Multi-token Prediction for Faster Speech LLaMA Decoding<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2409.12116<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2024-09<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: Interspeech 2024 Workshop<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: JHU \/ Meta AI<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <strong><a href=\"https:\/\/arxiv.org\/abs\/2409.12116\">https:\/\/arxiv.org\/abs\/2409.12116<\/a><\/strong><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u9488\u5bf9 decoder-only LLM-ASR \u63a8\u7406\u901f\u5ea6\u6162\u7684\u95ee\u9898\uff0c\u5f15\u5165 multi-token prediction\uff1a\u6bcf\u4e2a\u89e3\u7801\u6b65\u9aa4\u540c\u65f6\u9884\u6d4b\u591a\u4e2a\u672a\u6765 token\u3002\u5229\u7528 ASR \u4efb\u52a1\u7684\u7279\u6b8a\u6027\u2014\u2014\u97f3\u9891\u6761\u4ef6\u5316\u4f7f token \u95f4\u4f9d\u8d56\u6bd4\u7eaf\u8bed\u8a00\u5efa\u6a21\u5f31\u2014\u2014\u4f7f\u591a token \u9884\u6d4b\u63a5\u53d7\u7387\u66f4\u9ad8\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 Encoder \u2192 Embedding\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Decoder-Only LLM\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193\n&nbsp; &nbsp; &nbsp;\u6bcf\u6b65\u9884\u6d4b K \u4e2a\u672a\u6765 token\uff08\u5e76\u884c\u89e3\u7801\u5934\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193\n&nbsp; &nbsp; &nbsp; &nbsp;\u9a8c\u8bc1\u63a5\u53d7 \u2192 \u63a8\u8fdb K \u6b65\uff1b\u62d2\u7edd \u2192 \u56de\u9000\n\n<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>Multi-token prediction \u5e94\u7528\u4e8e LLM-ASR \u89e3\u7801\u52a0\u901f<\/li><li>\u5229\u7528 ASR \u4efb\u52a1\u4e2d\u97f3\u9891\u6761\u4ef6\u5316\u964d\u4f4e token \u95f4\u5f3a\u4f9d\u8d56\u7684\u7279\u6027\uff0c\u4fdd\u8bc1\u63a5\u53d7\u7387<\/li><li>LibriSpeech \u4e0a ~3.2x \u89e3\u7801\u901f\u5ea6\u63d0\u5347\uff0cWER \u65e0\u635f<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e<\/strong><strong>&amp;<\/strong><strong>\u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech\uff08960h\uff09<\/li><li>3.2x \u89e3\u7801\u52a0\u901f\uff0cWER \u4e0d\u53d8<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u548c\u540e\u6765\u7684 SpecASR \u65b9\u5411\u76f8\u8fd1\uff0c\u4f46\u53d1\u5e03\u66f4\u65e9\u3001\u601d\u8def\u66f4\u7b80\u5355\u76f4\u63a5\u3002Multi-token prediction \u6ca1\u6709\u4e13\u95e8\u4e3a ASR \u7279\u6027\u8bbe\u8ba1\uff0c\u66f4\u50cf\u662f\u628a NLP \u9886\u57df speculative decoding \u7684\u524d\u8eab\u76f4\u63a5\u8fc1\u79fb\u3002SpecASR \u540e\u6765\u505a\u5f97\u66f4\u7cfb\u7edf\uff0c\u5de5\u7a0b\u4ef7\u503c\u5df2\u88ab\u8d85\u8d8a\u3002\u8fd9\u7bc7\u7684\u8d21\u732e\u5728\u4e8e&#8221;\u7b2c\u4e00\u4e2a\u5728 LLM-ASR \u4e0a\u60f3\u5230\u5e76\u5b9e\u73b0\u4e86\u8fd9\u4e2a\u65b9\u5411&#8221;\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 6\/10<\/p>\n\n\n\n<p>PS:\u9636\u8dc3 StepAudio 2.5 ASR,\u6a21\u578b\u7684\u6838\u5fc3\u7a81\u7834\u5728\u4e8e\u901f\u5ea6\u4e0e\u7cbe\u5ea6\u7684\u517c\u5f97\u3002\u6211\u4eec\u7387\u5148\u5c06\u5927\u8bed\u8a00\u6a21\u578b\uff08LLM\uff09\u7684\u63a8\u7406\u52a0\u901f\u6280\u672f\u5f15\u5165\u8bed\u97f3\u8bc6\u522b\u9886\u57df\uff0c\u57fa\u4e8e <strong>ASR+MTP-5<\/strong> \u6df1\u5ea6\u878d\u5408\u67b6\u6784\uff0c\u5b9e\u6d4b\u63a8\u7406\u901f\u5ea6\u63d0\u5347<strong>&nbsp;400%<\/strong>\u3001\u65f6\u5ef6\u964d\u4f4e&nbsp;<strong>60%<\/strong>\uff0c\u63a8\u7406\u5cf0\u503c\u8fbe&nbsp;<strong>500 tokens\/s<\/strong>\uff0c\u63a8\u7406\u6210\u672c\u76f4\u964d&nbsp;<strong>80%<\/strong>\u3002\u4f20\u7edf\u8bed\u97f3\u8bc6\u522b\u6a21\u578b\u53d7\u9650\u4e8e<strong>\u81ea\u56de\u5f52\u751f\u6210\u673a\u5236<\/strong>\uff0c\u5fc5\u987b<strong>\u9010\u4e2a Token \u4f9d\u6b21\u8f93\u51fa<\/strong>\uff0c\u5c31\u50cf\u6253\u5b57\u5458\u4e00\u4e2a\u5b57\u4e00\u4e2a\u5b57\u5730\u6572\u952e\u76d8\u3002StepAudio 2.5 ASR \u5c06 Step 3.5 Flash \u540c\u6b3e\u7684<strong>&nbsp;MTP\uff08\u591a Token \u9884\u6d4b\uff09\u6280\u672f\u79fb\u690d\u81f3\u8bed\u97f3\u8bc6\u522b\u9886\u57df\uff0c\u4f7f\u6a21\u578b\u80fd\u591f\u4e00\u6b21\u9884\u6d4b\u591a\u4e2a\u5019\u9009 Token\uff0c\u5e76\u901a\u8fc7\u5e76\u884c\u9a8c\u8bc1\u673a\u5236\u5feb\u901f\u786e\u8ba4\u7ed3\u679c\u3002<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"807\" height=\"529\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2026\/05\/image.png\" alt=\"\" class=\"wp-image-31068\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2026\/05\/image.png 807w, http:\/\/139.9.1.231\/wp-content\/uploads\/2026\/05\/image-300x197.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2026\/05\/image-768x503.png 768w\" sizes=\"(max-width: 807px) 100vw, 807px\" \/><figcaption>StepAudio 2.5 ASR MTP-5<\/figcaption><\/figure>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2><strong>\u258c2025\uff1a\u6210\u719f\u671f\u2014\u2014\u63a8\u7406\u52a0\u901f\u3001\u7aef\u4fa7\u90e8\u7f72\u3001\u591a\u4efb\u52a1\u878d\u5408<\/strong><\/h2>\n\n\n\n<p>2025 \u5e74\u6d41\u5f0f LLM-ASR \u5df2\u7ecf\u6210\u719f\uff0c\u6838\u5fc3\u95ee\u9898\u53d8\u6210\uff1a<strong>\u600e\u4e48\u66f4\u5feb\u3001\u66f4\u7701\u3001\u66f4\u5168\u80fd<\/strong>&nbsp;\u3002\u63a8\u7406\u52a0\u901f\u3001\u7aef\u4fa7\u90e8\u7f72\u3001\u591a\u4efb\u52a1\u8054\u5408\u6210\u4e3a\u4e09\u6761\u4e3b\u7ebf\u3002<\/p>\n\n\n\n<h3><strong>8. MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2506.03722<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2025-06-04<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: Interspeech 2025<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Honor Device Co. \/ \u4e0a\u6d77\u4ea4\u901a\u5927\u5b66<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2506.03722\">https:\/\/arxiv.org\/abs\/2506.03722<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa Streaming-Whisper \u6846\u67b6\uff1a\u5728 Whisper \u4e0a\u901a\u8fc7 LoRA fine-tune \u5b9e\u73b0\u6d41\u5f0f\u8bc6\u522b\uff0c\u65e0\u9700\u4ece\u5934\u8bad\u7ec3\u3002\u6838\u5fc3\u662f\u5c06 CIF\uff08Continuous Integrate-and-Fire\uff09\u673a\u5236\u5f15\u5165 LLM-ASR\uff0c\u8ba9\u6a21\u578b\u81ea\u5df1\u5b66\u4e60&#8221;\u97f3\u9891\u5e27\u5230 token \u7684\u8f6f\u5bf9\u9f50&#8221;\uff0c\u7528 MFLA\uff08Monotonic Finite Look-ahead Attention\uff09\u8ba9 decoder \u6bcf\u4e2a token \u5728\u89e3\u7801\u65f6\u770b\u5230<strong>\u65e0\u9650\u5de6\u4e0a\u4e0b\u6587 + \u6709\u9650\u53f3\u4e0a\u4e0b\u6587<\/strong>&nbsp;\uff0c\u66ff\u4ee3\u4f20\u7edf\u56fa\u5b9a chunk \u5207\u5272\uff0c\u4ece\u6839\u672c\u4e0a\u7f13\u89e3\u5207\u5757\u8fb9\u754c\u622a\u65ad\u95ee\u9898\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 Whisper Encoder\uff08MoChA chunk \u81ea\u6ce8\u610f\u529b\uff0cchunk size \u5747\u5300\u91c7\u6837 &#091;32,128]\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 hidden states H\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CIF Predictor\uff08\u4e24\u5c42\u7ebf\u6027 + ReLU\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u251c\u2500\u2500 \u9884\u6d4b\u6bcf\u5e27\u6743\u91cd \u03b1\uff0c\u7d2f\u79ef\u89e6\u53d1 token \u8fb9\u754c\uff08MRE loss\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2514\u2500\u2500 \u63a8\u7406\u65f6\u8ffd\u8e2a\u89e3\u7801\u8fdb\u5ea6\u3001\u9632\u6b62\u8fb9\u754c\u5e7b\u89c9\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 \u52a8\u6001\u5206\u6bb5\u5bf9\u9f50\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Decoder\uff08Whisper Decoder + MFLA\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u251c\u2500\u2500 \u6bcf\u4e2a token \u53ef\u89c1\uff1a\u65e0\u9650\u5de6\u4e0a\u4e0b\u6587 + \u6709\u9650\u53f3\u4e0a\u4e0b\u6587\uff08look-ahead span ~ Poisson(\u03bb=3)\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u251c\u2500\u2500 \u8bad\u7ec3\uff1ahybrid-attention\uff08full-attention + MFLA \u6df7\u5408\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2514\u2500\u2500 \u63a8\u7406\uff1await-k decoding\uff08wait-3 \u4e3a\u9ed8\u8ba4\uff09\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193\n&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u6d41\u5f0f\u8f6c\u5f55\u8f93\u51fa\uff08\u53ef\u5ef6\u7eed buffer state \u51cf\u5c11\u91cd\u590d\u8ba1\u7b97\uff09\n<strong>\u6269\u5c55 SpeechLLM \u7248\u672c\uff1a\n\u97f3\u9891 \u2192 Whisper-Large-V3 Encoder \u2192 Adapter\uff082\u5c42 cross-attention\uff09\u2192 Qwen2.5-3B-Instruct \u2192 \u6d41\u5f0f\u8f6c\u5f55<\/strong><\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li><strong>CIF-driven \u8f6f\u5bf9\u9f50<\/strong>&nbsp;\uff1a\u7528 CIF predictor \u4f30\u8ba1\u5e27\u7ea7 token \u6743\u91cd\uff0c\u5efa\u7acb\u51c6\u5355\u8c03\u5bf9\u9f50\uff0c\u66ff\u4ee3 fixed-chunk \u786c\u5207\u5272\uff0c\u7f13\u89e3\u8fb9\u754c\u622a\u65ad\u95ee\u9898<\/li><li><strong>MFLA<\/strong>&nbsp;\uff1a\u6709\u9650\u53f3\u4e0a\u4e0b\u6587\u6ce8\u610f\u529b\u673a\u5236\uff0c\u6bcf\u4e2a token \u52a8\u6001\u51b3\u5b9a\u770b\u591a\u5c11\u53f3\u4fa7\u97f3\u9891\u5e27\uff0c\u5b9e\u73b0 prefix-to-prefix \u8bad\u7ec3\u8303\u5f0f<\/li><li><strong>wait-k + buffer state \u5ef6\u7eed<\/strong>&nbsp;\uff1await-3\u2020 \u65b9\u6848\u5728 decoder buffer \u4e2d\u4fdd\u7559\u72b6\u6001\uff0c\u6bd4 Local Agreement baseline \u51cf\u5c11 60.86% \u5197\u4f59\u8ba1\u7b97\uff0c\u5ef6\u8fdf 1.41s<\/li><li><strong>\u7edf\u4e00\u79bb\u7ebf\/\u5728\u7ebf\u6846\u67b6<\/strong>&nbsp;\uff1alook-ahead span\u2192\u221e \u5373\u9000\u5316\u4e3a\u79bb\u7ebf\u7cfb\u7edf\uff0c\u5355\u6a21\u578b\u540c\u65f6\u652f\u6301\u4e24\u79cd\u6a21\u5f0f<\/li><li><strong>SpeechLLM \u6269\u5c55\u9a8c\u8bc1<\/strong>&nbsp;\uff1a\u63a5\u5165 Qwen2.5-3B\uff0c\u5728\u7ebf\u89e3\u7801 WER \u4ec5\u6bd4\u79bb\u7ebf\u9ad8 0.98%<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aWenetSpeech4TTS Premium + LibriSpeech + MLS + VoxPopuli\uff0c\u8986\u76d6\u4e2d\/\u82f1\/\u5fb7\/\u897f\u8bed<\/li><li>Whisper-Large-V3-Turbo\uff1a\u79bb\u7ebf WER 5.63%\uff0c\u5728\u7ebf WER 7.17%\uff081s chunk\uff0cwait-3\uff09\uff0c\u5dee\u8ddd 1.54%<\/li><li>\u5ef6\u8fdf\u5bf9\u6bd4\uff08vs Local Agreement baseline DAL=1.65s\uff09\uff1await-3 DAL=1.41s\uff08-14.5%\uff09\uff0cwait-1 DAL=0.93s\uff08-43.6%\uff09<\/li><li>SpeechLLM \u5728\u7ebf WER\uff1aWenetSpeech4TTS Premium 3.41%\uff0cLibriSpeech test-clean 2.38%<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>CIF + \u6709\u9650\u53f3\u4e0a\u4e0b\u6587\u6ce8\u610f\u529b\u8fd9\u4e2a\u7ec4\u5408\u662f\u5bf9\u7684\uff0c\u6bd4 fixed-chunk \u5207\u5272\u806a\u660e\u2014\u2014\u8ba9\u6a21\u578b\u81ea\u5df1\u5b66\u5bf9\u9f50\u800c\u4e0d\u662f\u6309\u65f6\u949f\u5207\u3002wait-3\u2020 \u7684 buffer state \u5ef6\u7eed\u628a FLOPs \u538b\u5230 12.77G\uff08vs baseline 37.56G\uff09\uff0c\u5de5\u7a0b\u4e0a\u975e\u5e38\u5b9e\u7528\u3002\u4f46\u4e24\u4e2a\u6838\u5fc3\u5c40\u9650\u8bba\u6587\u81ea\u5df1\u4e5f\u627f\u8ba4\uff1aCIF predictor \u592a\u7b80\u5355\uff08\u53ea\u6709\u4e24\u4e2a\u7ebf\u6027\u5c42\uff09\uff0c\u5e27\u7ea7\u6743\u91cd\u4f30\u8ba1\u6709\u504f\u5dee\uff1bLoRA fine-tune \u5bf9 encoder \u7684\u6d41\u5f0f\u9002\u914d\u6548\u679c\u6709\u9650\uff0conline \u548c offline WER \u5dee\u8ddd\uff081.54%\uff09\u8fd8\u662f\u663e\u8457\u3002\u66f4\u6839\u672c\u7684\u95ee\u9898\u662f\uff1aCIF \u611f\u77e5\u7684\u662f&#8221;\u8be5\u8f93\u51fa\u7b2c\u51e0\u4e2a token \u4e86&#8221;\uff0c\u5e76\u4e0d\u662f\u771f\u6b63\u7684\u8bed\u4e49\/\u97f5\u5f8b\u8fb9\u754c\u2014\u2014\u8bf4\u8bdd\u4eba\u505c\u987f\u3001\u91cd\u8bfb\u3001\u6362\u6c14\u8fd9\u4e9b\u4fe1\u606f predictor \u611f\u77e5\u4e0d\u5230\uff0c\u53ea\u662f\u6bd4 fixed-chunk \u968f\u673a\u5207\u5272\u597d\u4e00\u70b9\u800c\u4e0d\u662f\u5f7b\u5e95\u89e3\u51b3\u4e86\u8fb9\u754c\u95ee\u9898\u3002SpeechLLM \u6269\u5c55\u90e8\u5206\u53ea\u7528\u4e86 LibriSpeech + WenetSpeech4TTS \u8bc4\u6d4b\uff0c\u8986\u76d6\u573a\u666f\u6709\u9650\u3002\u6574\u4f53\u662f\u4e00\u7bc7\u628a\u6b63\u786e\u601d\u8def\u505a\u51fa\u6765\u4e86\u4f46\u8fd8\u6ca1\u505a\u5b8c\u7684\u5de5\u4f5c\uff0cpredictor \u5347\u7ea7\u548c encoder \u6d41\u5f0f\u6539\u9020\u662f\u660e\u663e\u7684\u540e\u7eed\u65b9\u5411\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h3><strong>9. SpecASR: Accelerating LLM-based ASR via Speculative Decoding \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2507.18181<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2025-07-24<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: DAC 2025<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: \u53a6\u95e8\u5927\u5b66 \/ \u591a\u6821\u8054\u5408<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2507.18181\">https:\/\/arxiv.org\/abs\/2507.18181<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u9488\u5bf9 LLM-ASR \u7684\u63a8\u6d4b\u89e3\u7801\u6846\u67b6\u3002\u6838\u5fc3\u89c2\u5bdf\uff1aASR \u89e3\u7801\u662f\u97f3\u9891\u6761\u4ef6\u5316\u7684\uff0c\u5c0f\u6a21\u578b\u4e0e\u5927\u6a21\u578b\u8f93\u51fa\u5bf9\u9f50\u7387\u6781\u9ad8\u3002\u63d0\u51fa\u81ea\u9002\u5e94\u8349\u7a3f\u5e8f\u5217\u751f\u6210\uff08\u52a8\u6001\u8c03\u6574\u8349\u7a3f\u957f\u5ea6\uff09\u3001\u8349\u7a3f\u5e8f\u5217\u590d\u7528\u7b56\u7565\uff08\u51cf\u5c11\u8349\u7a3f\u6a21\u578b\u5ef6\u8fdf\uff09\u548c\u4e24\u6b65\u7a00\u758f token \u6811\u751f\u6210\u7b97\u6cd5\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 \u5c0f\u578b Draft LLM-ASR\uff08\u5feb\u901f\u751f\u6210\u5019\u9009 token \u6811\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 \u81ea\u9002\u5e94\u957f\u5ea6\u63a7\u5236<br>&nbsp; &nbsp; &nbsp; &nbsp;\u5927\u578b Target LLM-ASR\uff08\u5e76\u884c\u9a8c\u8bc1 token \u6811\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 \u97f3\u9891\u6761\u4ef6\u5316\u4fdd\u969c\u9ad8\u63a5\u53d7\u7387<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2514\u2500\u2500 \u7a00\u758f token \u6811\u51cf\u5c11 draft \u5f00\u9500<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u52a0\u901f\u540e\u6d41\u5f0f\u8f6c\u5f55\u8f93\u51fa<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>ASR \u4e13\u7528\u63a8\u6d4b\u89e3\u7801\uff1a\u5229\u7528\u97f3\u9891\u6761\u4ef6\u5316\u4fdd\u969c draft\/target \u9ad8\u5bf9\u9f50\u7387<\/li><li>\u81ea\u9002\u5e94\u8349\u7a3f\u957f\u5ea6\uff1a\u52a8\u6001\u8c03\u8282 draft \u957f\u5ea6\uff0c\u5e73\u8861\u9a8c\u8bc1\u5f00\u9500\u4e0e\u63a5\u53d7\u7387<\/li><li>\u4e24\u6b65\u7a00\u758f token \u6811\uff1a\u51cf\u5c11 draft \u6a21\u578b\u7684\u5197\u4f59\u8ba1\u7b97<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech + \u591a\u4e2a\u82f1\u6587\u516c\u5f00 benchmark\uff08\u8bc4\u6d4b\u6570\u636e\u96c6\uff09<\/li><li>3.04x\u20133.79x \u52a0\u901f\uff08vs \u81ea\u56de\u5f52\u57fa\u7ebf\uff09\uff0c1.25x\u20131.84x\uff08vs \u6807\u51c6\u63a8\u6d4b\u89e3\u7801\uff09\uff0c\u7cbe\u5ea6\u96f6\u635f\u5931<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u63a8\u6d4b\u89e3\u7801\u5728 LLM \u63a8\u7406\u52a0\u901f\u91cc\u5df2\u7ecf\u6210\u719f\uff0c\u8fd9\u7bc7\u628a\u5b83\u79fb\u690d\u5230 LLM-ASR \u662f\u987a\u7406\u6210\u7ae0\uff0c\u4f46\u505a\u4e86\u8db3\u591f\u591a\u7684 ASR \u4e13\u6709\u8bbe\u8ba1\u30023.04x\u20133.79x \u52a0\u901f\u662f\u771f\u5b9e end-to-end \u6570\u5b57\uff0c\u4e0d\u662f\u7406\u8bba\u4e0a\u754c\u3002\u4f46\u524d\u63d0\u662f\u4f60\u5df2\u7ecf\u6709\u4e00\u4e2a LLM-ASR \u7cfb\u7edf\uff0c\u4e14\u80fd\u8d1f\u62c5\u540c\u65f6\u8fd0\u884c\u4e00\u5927\u4e00\u5c0f\u4e24\u4e2a LLM\u3002\u8d44\u6e90\u53d7\u9650\u573a\u666f\u5e2e\u52a9\u6709\u9650\uff1bdraft \u6a21\u578b\u9009\u578b\u548c\u8bad\u7ec3\u7b56\u7565\u62ab\u9732\u4e5f\u4e0d\u591f\u7ec6\u81f4\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<h3><strong>10. WhisperKit: On-device Real-time ASR with Billion-Scale Transformers \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2507.10860<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2025-07-14<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICML 2025 On-Device Learning Workshop<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Argmax<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2507.10860\">https:\/\/arxiv.org\/abs\/2507.10860<\/a><\/p>\n\n\n\n<p><strong>\u4ee3\u7801\u94fe\u63a5<\/strong>&nbsp;: https:\/\/github.com\/argmaxinc\/WhisperKit<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u9762\u5411\u7aef\u4fa7\u90e8\u7f72\u7684 Whisper \u5b9e\u65f6 ASR \u63a8\u7406\u4f18\u5316\u7cfb\u7edf\u3002\u5728 Apple \u8bbe\u5907\u672c\u5730\u8fd0\u884c\uff0c\u5339\u914d\u751a\u81f3\u8d85\u8d8a\u4e91\u7aef gpt-4o-transcribe\u3001Deepgram nova-3 \u7684\u7cbe\u5ea6\uff0c\u5ef6\u8fdf\u4f4e\u81f3 0.46s\uff0cWER \u4ec5 2.2%\u3002\u6838\u5fc3\u8d21\u732e\u662f\u5757\u5bf9\u89d2 mask \u81ea\u84b8\u998f\u3001Apple ANE \u6781\u81f4\u4f18\u5316\u548c\u91cf\u5316\u538b\u7f29\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u539f\u59cb Whisper Large v3 Turbo<br>&nbsp; &nbsp; \u2193 \u5757\u5bf9\u89d2 mask \u81ea\u84b8\u998f\uff08d750\uff1a15s block\uff09<br>\u6d41\u5f0f\u5316 Audio Encoder\uff08\u5757\u5bf9\u89d2\u81ea\u6ce8\u610f\u529b\uff0c\u9759\u97f3\u7f13\u5b58\uff09<br>&nbsp; &nbsp; \u2193<br>Text Decoder + LocalAgreement \u6d41\u5f0f\u786e\u8ba4\u7b56\u7565<br>&nbsp; &nbsp; \u2193 \u91cf\u5316\uff081.6GB \u2192 0.6GB\uff09<br>Apple Neural Engine\uff08ANE\uff09\u539f\u751f\u52a0\u901f\u90e8\u7f72<br>&nbsp; &nbsp; \u2193<br>0.46s \u5ef6\u8fdf\u7aef\u4fa7\u5b9e\u65f6\u8f6c\u5f55<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u5757\u5bf9\u89d2 mask \u81ea\u84b8\u998f\uff1a\u539f\u751f\u652f\u6301 Whisper \u6d41\u5f0f\u63a8\u7406\uff0c\u9759\u97f3\u7f13\u5b58\u51cf\u5c11\u65e0\u6548\u524d\u5411<\/li><li>\u91cf\u5316 1.6GB\u21920.6GB\uff0cWER \u635f\u5931 &lt;1%<\/li><li>Apple ANE \u8fd1\u5cf0\u503c\u786c\u4ef6\u5229\u7528\u7387\uff0c\u7aef\u4fa7\u8d85\u8d8a\u4e91\u7aef baseline<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aCommonVoice 17\uff085 \u8bed\u79cd fine-tune\uff09\uff1bLibriSpeech + earnings22 \u8bc4\u6d4b<\/li><li>WER 2.2%\uff0c\u5ef6\u8fdf 0.46s\uff1b\u8d85\u8d8a gpt-4o-transcribe \u548c Deepgram nova-3<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u6700\u63a5\u8fd1\u7eaf\u5de5\u7a0b\u8bba\u6587\u7684\u5f62\u6001\uff0c\u6bcf\u6b65\u90fd\u6709\u5145\u5206 ablation \u652f\u6491\u2014\u2014\u771f\u5b9e\u7684\u5de5\u7a0b\u6210\u5c31\u3002\u4f46\u6574\u4f53\u662f\u5de5\u7a0b\u4f18\u5316\u8bba\u6587\uff0c\u4e0d\u662f\u7b97\u6cd5\u521b\u65b0\u8bba\u6587\uff1a\u5757\u5bf9\u89d2 mask \u5f15\u7528\u81ea\u5218\u7b49\u4eba\u7684\u5148\u9a8c\u5de5\u4f5c\uff0cLocalAgreement \u4e5f\u662f\u65e2\u6709\u65b9\u6cd5\u3002\u4ef7\u503c\u5728\u4e8e&#8221;\u628a\u73b0\u6709\u6280\u672f\u6808\u5728 Apple ANE \u4e0a\u6781\u81f4\u4f18\u5316&#8221;\u3002\u5982\u679c\u4f60\u4e0d\u505a\u82f9\u679c\u7aef\u4fa7\u90e8\u7f72\uff0c\u5feb\u901f\u6d4f\u89c8\u5373\u53ef\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<h3><strong>11. Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization\uff08JEDIS-LLM\uff09\u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2511.16046<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2025-11-20<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2026<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: \u5fae\u8f6f&nbsp;UCLA<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2511.16046\">https:\/\/arxiv.org\/abs\/2511.16046<\/a><\/p>\n\n\n\n<p>JEDIS-LLM\uff1a\u7aef\u5230\u7aef Speech-LLM\uff0c\u652f\u6301\u8054\u5408\u6d41\u5f0f\u8bf4\u8bdd\u4eba\u5206\u79bb\uff08Diarization\uff09+ ASR\u3002\u6a21\u578b\u4ec5\u5728 \u226420s \u77ed\u97f3\u9891\u4e0a\u8bad\u7ec3\uff0c\u4f46\u53ef\u96f6\u6837\u672c\u6cdb\u5316\u5230\u4efb\u610f\u957f\u5ea6\u957f\u97f3\u9891\u6d41\u5f0f\u63a8\u7406\u3002\u901a\u8fc7 Speaker Prompt Cache\uff08SPC\uff09\u673a\u5236\u5b9e\u73b0\u8de8 chunk \u8bf4\u8bdd\u4eba\u4e00\u81f4\u6027\u4f20\u64ad\uff0c\u5e76\u652f\u6301\u9884\u6ce8\u518c\u8bf4\u8bdd\u4eba profile\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 \u6d41\u5f0f Speech Encoder<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u251c\u2500\u2500 Spk-Decoder\uff08Word-level Speaker Supervision\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2514\u2500\u2500 Projector<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; LLM\uff08LoRA \u9002\u914d\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; \u6d41\u5f0f chunk \u63a8\u7406\uff1a<br>&nbsp; &nbsp; Speaker Prompt Cache\uff08SPC\uff09<br>&nbsp; &nbsp; \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510<br>&nbsp; &nbsp; \u2502 \u6bcf\u4e2a\u8bf4\u8bdd\u4eba\u5b58\u50a8\u4ee3\u8868\u97f3\u9891\u7247\u6bb5 &nbsp; &nbsp;\u2502<br>&nbsp; &nbsp; \u2502 \u8de8 chunk \u4f20\u9012\uff0c\u5b9e\u65f6\u66f4\u65b0 &nbsp; &nbsp; &nbsp;\u2502<br>&nbsp; &nbsp; \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u8bf4\u8bdd\u4eba\u5f52\u5c5e\u8f6c\u5f55\uff08\"\u8c01\u8bf4\u4e86\u4ec0\u4e48\"\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u9996\u4e2a\u96f6\u6837\u672c\u6d41\u5f0f\u957f\u97f3\u9891\u8054\u5408 ASR + \u8bf4\u8bdd\u4eba\u5206\u79bb Speech-LLM<\/li><li>SPC\uff1a\u501f\u52a9 LLM \u81ea\u56de\u5f52 KV cache \u673a\u5236\uff0c\u65e0\u9700\u540e\u5904\u7406\u5168\u5c40\u805a\u7c7b\u5373\u53ef\u4fdd\u6301\u8de8 chunk \u8bf4\u8bdd\u4eba\u4e00\u81f4\u6027<\/li><li>Word-level Speaker Supervision\uff1a\u5355\u8bcd\u7ea7\u8bf4\u8bdd\u4eba\u6807\u7b7e\u589e\u5f3a encoder \u8bf4\u8bdd\u4eba\u5224\u522b\u80fd\u529b<\/li><li>\u4ec5\u77ed\u97f3\u9891\uff08\u226420s\uff09\u8bad\u7ec3\uff0c\u96f6\u6837\u672c\u6cdb\u5316\u5230\u4efb\u610f\u957f\u97f3\u9891<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a\u5185\u90e8\u591a\u8bf4\u8bdd\u4eba\u6570\u636e\uff08\u77ed\u97f3\u9891 \u226420s\uff09\uff1bCALLHOME \/ AMI \u6807\u51c6 benchmark \u8bc4\u6d4b<\/li><li>\u8d85\u8d8a Sortformer\u3001Meta-Cat\uff08\u77ed\u97f3\u9891\u573a\u666f\uff09\uff1b\u8d85\u8d8a DiarizationLM\uff08\u957f\u97f3\u9891\u573a\u666f\uff09<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8fd9\u7bc7\u89e3\u51b3\u4e86\u4e00\u4e2a\u771f\u5b9e\u4e14\u68d8\u624b\u7684\u95ee\u9898\u2014\u2014\u6d41\u5f0f\u957f\u97f3\u9891\u591a\u8bf4\u8bdd\u4eba\u8f6c\u5f55\u3002SPC \u8bbe\u8ba1\u4f18\u96c5\uff1a\u7528 LLM \u81ea\u56de\u5f52\u7684 KV cache \u673a\u5236\u5929\u7136\u5ef6\u4f38\u5230\u8bf4\u8bdd\u4eba\u4e00\u81f4\u6027\u8de8 chunk \u4f20\u64ad\uff0c\u4e0d\u9700\u8981\u540e\u5904\u7406\u5168\u5c40\u805a\u7c7b\uff0c\u4e5f\u4e0d\u9700\u8981\u91cd\u65b0\u8bad\u7ec3\u3002&#8221;\u4ec5\u5728 &lt;20s \u77ed\u97f3\u9891\u8bad\u7ec3\u4f46\u96f6\u6837\u672c\u6cdb\u5316\u5230\u957f\u97f3\u9891&#8221;\u5982\u679c\u53ef\u590d\u73b0\uff0c\u975e\u5e38\u6709\u4ef7\u503c\u3002\u4f46\u5b9e\u9a8c\u6570\u636e\u96c6\uff08CALLHOME\u3001AMI\uff09\u5e76\u975e\u6700\u65b0\u6700\u96be\u7684 benchmark\uff0c\u548c DiarizationLM \u7684\u5bf9\u6bd4\u6709\u4e3b\u573a\u4f18\u52bf\u4e4b\u5acc\uff08\u540e\u8005\u662f cascade \u7cfb\u7edf\uff09\u3002chunk size\u3001SPC \u66f4\u65b0\u9891\u7387\u7684 ablation \u8fd8\u4e0d\u591f\u5145\u5206\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<h3><strong>12. Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing\uff08Whisper-LLaDA\uff09<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2509.16622<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2025-09-20<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2026<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: IDIAP Research Institute \/ \u591a\u6821\u8054\u5408<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: https:\/\/arxiv.org\/abs\/2509.16622<\/p>\n\n\n\n<p><strong>\u4ee3\u7801\u94fe\u63a5<\/strong>&nbsp;: https:\/\/github.com\/liuzhan22\/Diffusion-ASR<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u5c06\u6269\u6563 LLM\uff08LLaDA-8B\uff09\u5f15\u5165 ASR\uff0c\u63a2\u7d22\u975e\u81ea\u56de\u5f52\u89e3\u7801\u8def\u5f84\u3002\u9996\u5148\u4f5c\u4e3a Whisper-LLaMA \u8f6c\u5f55\u7684\u5916\u90e8 deliberation \u6a21\u5757\uff0c\u5229\u7528 LLaDA \u7684\u53cc\u5411\u6ce8\u610f\u529b + \u53bb\u566a\u80fd\u529b\u4fee\u6b63\u8f6c\u5f55\u9519\u8bef\u3002\u8fdb\u4e00\u6b65\u9a8c\u8bc1 LLaDA \u4f5c\u4e3a\u72ec\u7acb ASR \u89e3\u7801\u5668\u65f6\uff0c\u6269\u6563\u89e3\u7801\u6bd4\u81ea\u56de\u5f52\u66f4\u5feb\uff0c\u4f46\u7cbe\u5ea6\u7565\u4f4e\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 Whisper-Large-v3 Encoder\uff08\u51bb\u7ed3\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; Q-Former\uff0844 trainable queries\uff0c0.33s window\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 Projection<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; LLaDA-8B-Instruct\uff08LoRA \u5fae\u8c03\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 \u6a21\u5f0f1\uff1aDeliberation\uff08\u4fee\u6b63 Whisper-LLaMA \u521d\u59cb\u8f6c\u5f55\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2502 &nbsp; \u251c\u2500\u2500 \u968f\u673a mask \u7b56\u7565<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2502 &nbsp; \u251c\u2500\u2500 \u6700\u4f4e\u7f6e\u4fe1\u5ea6 mask \u7b56\u7565<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2502 &nbsp; \u2514\u2500\u2500 \u534a\u81ea\u56de\u5f52\u7b56\u7565<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2514\u2500\u2500 \u6a21\u5f0f2\uff1a\u72ec\u7acb ASR \u89e3\u7801\u5668\uff08\u6269\u6563\u89e3\u7801\/\u534a\u81ea\u56de\u5f52\u89e3\u7801\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u9996\u6b21\u7cfb\u7edf\u9a8c\u8bc1\u6269\u6563 LLM \u7528\u4e8e ASR \u4efb\u52a1<\/li><li>\u97f3\u9891\u6761\u4ef6\u5316\u5d4c\u5165\u662f\u5173\u952e\uff1a\u7eaf\u6587\u672c LLaDA\uff08\u65e0\u58f0\u5b66\u7279\u5f81\uff09\u505a deliberation \u65e0\u6548<\/li><li>\u534a\u81ea\u56de\u5f52\u89e3\u7801\u7b56\u7565\uff1a\u5e73\u8861\u6269\u6563\u89e3\u7801\u7684\u901f\u5ea6\u4e0e\u7cbe\u5ea6<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aLibriSpeech\uff08960h \u82f1\u8bed\uff09<\/li><li>\u6700\u4f73\u7ea7\u8054 WER\uff1atest-clean 2.25% \/ test-other 4.94%\uff08vs Whisper-LLaMA baseline -12.3%\uff09<\/li><li>\u72ec\u7acb\u6269\u6563\u89e3\u7801\uff1a\u901f\u5ea6\u5feb\u4e8e AR\uff0c\u4f46\u7cbe\u5ea6\u7565\u4f4e<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u6001\u5ea6\u5f88\u8bda\u5b9e\u7684\u63a2\u7d22\u6027\u8bba\u6587\u2014\u2014\u660e\u786e\u8bf4&#8221;\u6269\u6563 LLM \u505a ASR \u7684\u7cbe\u5ea6\u6bd4\u81ea\u56de\u5f52\u4f4e\uff0c\u4f46\u901f\u5ea6\u66f4\u5feb&#8221;\uff0c\u6ca1\u6709\u7c89\u9970\u7ed3\u679c\u3002\u6838\u5fc3 insight \u6709\u4ef7\u503c\uff1a\u97f3\u9891\u6761\u4ef6\u5316\u5d4c\u5165\u5bf9\u6269\u6563 LLM \u6709\u6548\u8fd0\u4f5c\u662f\u5fc5\u8981\u6761\u4ef6\u3002\u4f46\u5b9e\u9a8c\u53ea\u5728 LibriSpeech \u4e0a\uff08960h \u82f1\u8bed\u6709\u58f0\u4e66\uff0c\u96be\u5ea6\u504f\u4f4e\uff09\uff0c\u65e0\u6cd5\u8bf4\u660e\u566a\u58f0\/\u53e3\u97f3\/\u771f\u5b9e\u5bf9\u8bdd\u573a\u666f\u7684\u9c81\u68d2\u6027\u3002&#8221;\u66f4\u5feb\u4f46\u4e0d\u591f\u597d&#8221;\u5bf9\u751f\u4ea7\u573a\u666f\u5438\u5f15\u529b\u6709\u9650\u3002\u66f4\u9002\u5408\u5b9a\u4f4d\u4e3a&#8221;\u9a8c\u8bc1\u53ef\u884c\u6027\u7684\u6280\u672f\u62a5\u544a&#8221;\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<h2><strong>\u258c2026 Q1\uff1a\u6301\u7eed\u6f14\u8fdb\u671f\u2014\u2014\u7edf\u4e00\u67b6\u6784\u3001\u751f\u4ea7\u843d\u5730\u3001\u5168\u53cc\u5de5<\/strong><\/h2>\n\n\n\n<h3><strong>13. Streaming Speech Recognition with Decoder-Only LLMs and Latency Optimization\uff08MoCha-ASR\uff09\u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2601.22779<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2026-01-30<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2026<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: \u5408\u80a5\u5de5\u4e1a\u5927\u5b66 \/ \u591a\u6821<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2601.22779\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/arxiv.org\/abs\/2601.22779<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa\u5c06 read\/write \u7b56\u7565\u7f51\u7edc\u4e0e MoChA\uff08Monotonic Chunkwise Attention\uff0c\u5355\u8c03\u5206\u5757\u6ce8\u610f\u529b\uff09\u7ed3\u5408\uff0c\u8ba9 Decoder-Only LLM \u652f\u6301\u6d41\u5f0f ASR\u3002\u5f15\u5165\u6700\u5c0f\u5ef6\u8fdf\u8bad\u7ec3\u76ee\u6807\uff08minLT loss\uff09\uff0ctoken \u751f\u6210\u5ef6\u8fdf\u964d\u4f4e 62.5%\uff0c\u65e0\u9700 CTC \u5f3a\u5236\u5bf9\u9f50\uff0c\u7aef\u5230\u7aef\u53ef\u4f18\u5316\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 \u6d41\u5f0f Conformer Encoder\uff08context-sensitive chunking\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 LoRA \u5fae\u8c03<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;MoChA Policy Network\uff08\u51b3\u5b9a&nbsp;read\/write\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500&nbsp;read\uff1a\u7ee7\u7eed\u63a5\u6536\u97f3\u9891\u5e27<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2514\u2500\u2500 write\uff1a\u89e6\u53d1 LLM \u751f\u6210\u4e0b\u4e00 token<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Qwen2.5-1.5B\uff08Decoder-Only LLM\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u97f3\u9891\/\u6587\u672c token \u4ea4\u9519\u8f93\u5165<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; minLT loss \u7ea6\u675f\u5bf9\u9f50\u8fb9\u754c \u2192 \u5ef6\u8fdf\u964d\u4f4e 62.5%<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u7aef\u5230\u7aef\u6d41\u5f0f LLM-ASR\uff0c\u65e0\u9700 CTC \u5f3a\u5236\u5bf9\u9f50<\/li><li>minLT\uff08Minimum Latency Training\uff09\u635f\u5931\u7ea6\u675f\u5bf9\u9f50\u8fb9\u754c\uff0c\u663e\u8457\u538b\u7f29\u751f\u6210\u5ef6\u8fdf<\/li><li>\u6d41\u5f0f\/\u975e\u6d41\u5f0f\u6a21\u578b\u53c2\u6570\u5171\u4eab\uff0c\u8054\u5408\u8bad\u7ec3\u964d\u4f4e\u5f00\u53d1\u6210\u672c<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aAISHELL-1\uff08165h\uff09+ AISHELL-2\uff081000h\uff09+ \u5185\u90e8\u591a\u9886\u57df\u6570\u636e<\/li><li>AISHELL-1 CER 5.1% \/ AISHELL-2 CER 5.5%\uff0c\u4f18\u4e8e\u6240\u6709\u6d41\u5f0f baseline\uff1btoken \u751f\u6210\u5ef6\u8fdf\u964d\u4f4e 62.5%<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u8e0f\u5b9e\u7684\u5de5\u4f5c\u3002\u522b\u4eba\u505a\u6d41\u5f0f LLM-ASR \u8981\u4e48\u9760\u5916\u6302 CTC \u5bf9\u9f50\u3001\u8981\u4e48\u7528 wait-k \u786c\u5207\u5757\uff0c\u5b83\u771f\u7684\u7528 MoChA \u81ea\u9002\u5e94\u5206\u6bb5\u3001\u7aef\u5230\u7aef\u8bad\u7ec3\u3002minLT loss \u628a\u5ef6\u8fdf\u538b\u4e86 62.5% \u8fd9\u4e2a\u6570\u5b57\u6709\u771f\u5b9e\u5de5\u7a0b\u4ef7\u503c\u3002\u4f46\u5b9e\u9a8c\u53ea\u5728\u4e2d\u6587\u6570\u636e\u96c6\uff08AISHELL-1\/2\uff09\u4e0a\u8dd1\uff0c\u57fa\u7ebf\u5217\u8868\u91cc BESTOW \u662f\u4ed6\u4eec\u81ea\u5df1\u590d\u73b0\u7684\uff0c\u5b58\u5728\u9009\u62e9\u6027\u5bf9\u6bd4\u5acc\u7591\u3002MoChA \u672c\u8eab\u5e76\u4e0d\u65b0\uff0c\u6838\u5fc3\u8d21\u732e\u662f\u628a\u5b83\u63a5\u5230 LLM \u4e0a\u2014\u2014\u6709\u4ef7\u503c\uff0c\u4f46\u4e0d\u7b97\u7a81\u7834\u6027\u521b\u65b0\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<h3><strong>14. Chunk-wise Attention Transducers\uff08CHAT\uff09for Fast and Accurate Streaming Speech-to-Text<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2602.24245<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2026-02-27\uff08\u63d0\u4ea4\u4e8e 2025 \u5e74\u5e95\uff09<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: ICASSP 2026<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: Apple \/ Google<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: <a href=\"https:\/\/arxiv.org\/abs\/2602.24245\">https:\/\/arxiv.org\/abs\/2602.24245<\/a><\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa CHAT\uff0c\u5c06 RNN-T \u7684\u9010\u5e27 additive joiner \u66ff\u6362\u4e3a chunk \u5185 cross-attention joiner\u3002\u4fdd\u7559 RNN-T \u6d41\u5f0f\u80fd\u529b\u540c\u65f6\u5f15\u5165\u5c40\u90e8\u5bf9\u9f50\u5efa\u6a21\u7684\u7075\u6d3b\u6027\uff0c\u65e0\u9700\u5bf9\u9f50\u65f6\u95f4\u6233\u4fe1\u606f\u3002\u5bf9\u8bed\u97f3\u7ffb\u8bd1\uff08ST\uff09\u7684\u63d0\u5347\u5c24\u5176\u663e\u8457\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41 \u2192 \u6d41\u5f0f FastConformer Encoder\uff08chunk-aware\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 \u6309\u56fa\u5b9a chunk \u8f93\u51fa<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;CHAT Joiner\uff08\u66ff\u6362\u539f RNN-T joiner\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502 Predictor \u8f93\u51fa\uff08\u6587\u672c\u5386\u53f2\uff09\u2192 Query &nbsp; &nbsp; \u2502<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502 Encoder chunk \u8f93\u51fa\uff08\u97f3\u9891\uff09\u2192 Key\/Value \u2502<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502 &nbsp; \u2193 cross-attention\uff08chunk \u5185\uff09 &nbsp; &nbsp; &nbsp; \u2502<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502 &nbsp; \u2193 + Predictor \u6b8b\u5dee + ReLU &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2502<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502 &nbsp; \u2193 \u2192 \u8bcd\u8868\u7a7a\u95f4\u6982\u7387\u5206\u5e03 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2502<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 blank \u2192 \u4e0b\u4e00 chunk\uff1b\u975e blank \u2192 \u8f93\u51fa token<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>chunk \u5185 cross-attention joiner \u653e\u5bbd RNN-T \u4e25\u683c\u5355\u8c03\u5bf9\u9f50\u7ea6\u675f<\/li><li>\u65e0\u9700\u65f6\u95f4\u6233\u4fe1\u606f\u8bad\u7ec3\uff0c\u6539\u52a8\u6781\u5c0f\u4f46\u6548\u679c\u7a33\u5065<\/li><li>\u5bf9\u8bed\u97f3\u7ffb\u8bd1\uff08ST\uff09\u63d0\u5347\u5c24\u5176\u663e\u8457\uff08+18% BLEU\uff09<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1aNeMo \u591a\u8bed\u79cd\u6570\u636e\uff1b\u8bed\u97f3\u7ffb\u8bd1\uff1aMuST-C v2<\/li><li>ASR WER -6.3%\uff1bST BLEU +18.0%\uff1b\u8bad\u7ec3\u5185\u5b58 -46.2%\uff1b\u8bad\u7ec3\u901f\u5ea6 1.36x\uff1b\u63a8\u7406\u901f\u5ea6 1.69x<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u589e\u91cf\u4f46\u624e\u5b9e\u3002chunk \u5185\u4ea4\u53c9\u6ce8\u610f\u529b\u5728 AED \u6846\u67b6\u91cc\u65e9\u5c31\u505a\u8fc7\u4e86\uff0c\u8fc1\u79fb\u5230 Transducer joiner \u4e0a\u6709\u5de5\u7a0b\u4ef7\u503c\u4f46\u521b\u65b0\u5e45\u5ea6\u6709\u9650\u3002\u5b9e\u9a8c\u5728 NeMo \u6846\u67b6\u5185\u505a\uff0c\u6ca1\u6709\u548c LLM-ASR \u7cfb\u7edf\u6b63\u9762\u5bf9\u6bd4\uff0c\u4e0d\u6e05\u695a\u5728\u6700\u65b0 LLM-based pipeline \u4e2d\u662f\u5426\u8fd8\u6709\u7ade\u4e89\u529b\u3002\u5bf9\u8bed\u97f3\u7ffb\u8bd1\uff08ST\uff09\u7684\u63d0\u5347\uff08+18% BLEU\uff09\u66f4\u60ca\u8273\u2014\u2014RNN-T \u4e25\u683c\u5355\u8c03\u7ea6\u675f\u5bf9\u7ffb\u8bd1\u662f\u771f\u6b63\u7684\u786c\u4f24\uff0c\u8fd9\u7bc7\u6709\u6548\u89e3\u51b3\u4e86\u8fd9\u4e2a\u95ee\u9898\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<h3><strong>15. Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming ASR<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2603.11123<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2026-03-11<\/p>\n\n\n\n<p><strong>\u53d1\u8868\u72b6\u6001<\/strong>&nbsp;: Submitted to Interspeech 2026<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: \u79d1\u5927\u8baf\u98de \/ \u591a\u6821<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: https:\/\/arxiv.org\/abs\/2603.11123<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa Uni-ASR\uff0c\u7528\u7edf\u4e00 LLM \u6846\u67b6\u540c\u65f6\u652f\u6301\u975e\u6d41\u5f0f\u548c\u6d41\u5f0f\u8bed\u97f3\u8bc6\u522b\uff0c\u65e0\u9700\u4efb\u4f55\u67b6\u6784\u6539\u52a8\u5373\u53ef\u5207\u6362\u4e24\u79cd\u6a21\u5f0f\u3002\u5f15\u5165\u4e09\u79cd\u8bad\u7ec3\u8303\u5f0f\u8054\u5408\u8bad\u7ec3\uff08NS\/SS\/CS\uff09\u548c latest-token fallback \u89e3\u7801\u7b56\u7565\uff0c\u5728\u4e0d\u589e\u52a0\u5ef6\u8fdf\u7684\u524d\u63d0\u4e0b\u63d0\u5347\u6d41\u5f0f\u7cbe\u5ea6\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 FireRedASR Conformer Encoder\uff08full + dynamic chunk attention\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193 Adapter<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Qwen3-1.7B\uff08Decoder-Only LLM\uff09<br><br>\u8bad\u7ec3\u65f6\uff1aNS \/ SS \/ CS \u4e09\u8303\u5f0f 1:1:1 \u91c7\u6837<br>&nbsp; &nbsp; \u251c\u2500\u2500 NS\uff1a\u975e\u6d41\u5f0f\uff0c\u5168\u5e8f\u5217\u8f93\u5165<br>&nbsp; &nbsp; \u251c\u2500\u2500 SS\uff1a\u6d41\u5f0f\uff0c\u5f3a\u5236\u5bf9\u9f50\u5207\u5757\uff0cspeech-text interleaved<br>&nbsp; &nbsp; \u2514\u2500\u2500 CS\uff1acontext-aware \u6d41\u5f0f\uff0c\u8f93\u5165\u6700\u540e token \u7f6e &lt;pad&gt;\uff0c\u5b66\u8de8 chunk \u91cd\u89e3\u7801<br><br>\u63a8\u7406\u65f6\uff1a<br>&nbsp; &nbsp; \u6d41\u5f0f\uff1aKV Cache \u8de8 chunk \u589e\u91cf\u590d\u7528<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; latest-token fallback\uff08\u6700\u540e token \u7b49\u4e0b\u4e00 chunk \u518d\u786e\u8ba4\uff09<br>&nbsp; &nbsp; \u975e\u6d41\u5f0f\uff1a\u76f4\u63a5\u5168\u5e8f\u5217\u89e3\u7801<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u5355\u6a21\u578b\u7edf\u4e00\u6d41\u5f0f\/\u975e\u6d41\u5f0f\uff0c\u4e09\u8303\u5f0f 1:1:1 \u8054\u5408\u8bad\u7ec3<\/li><li>context-aware streaming\uff08CS\uff09\u8bad\u7ec3\u8303\u5f0f\u6d88\u9664\u8bad\u7ec3\u63a8\u7406 mismatch<\/li><li>latest-token fallback \u89e3\u7801\u7b56\u7565\uff1a\u8fb9\u754c token \u5ef6\u4e00 chunk \u786e\u8ba4\uff0c\u5b9e\u6d4b\u65e0\u989d\u5916\u5ef6\u8fdf<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a\u4e2d\u82f1\u53cc\u8bed\u6df7\u5408\u2014\u2014WeNetSpeech\uff0810000h+\uff09+ AISHELL + LibriSpeech + GigaSpeech + \u5185\u90e8\u6570\u636e<\/li><li>\u6d41\u5f0f AISHELL-1 CER 2.15% \/ LibriSpeech test-clean WER 2.44%\uff081000ms chunk\uff09<\/li><li>\u8d85\u8d8a Speech ReaLLM\u3001SpeechLLM-XL\u3001MoCha-ASR<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>&#8220;\u5927\u800c\u5168&#8221;\u8def\u7ebf\u7684\u4ee3\u8868\u4f5c\uff0c\u5de5\u7a0b\u7ec6\u5fc3\u5ea6\u9ad8\u3002\u4f46\u672c\u8d28\u662f\u65e2\u6709\u6280\u672f\u7684\u7cbe\u5fc3\u7ec4\u5408\uff1ainterleaved speech-text\uff08\u501f\u9274 CosyVoice2\uff09\u3001hold-n \u7b56\u7565\uff08\u5df2\u6709\uff09\u3001KV cache reuse\uff08\u5df2\u6709\uff09\u3002fallback \u89e3\u7801\u7684 idea \u5c0f\u800c\u5b9e\u7528\uff0c\u4f46\u4e0d\u7b97\u91cd\u5927\u521b\u65b0<strong>\u3002Qwen3-ASR-1.7B \u5728\u4ed6\u7684 streaming benchmark \u91cc\u6570\u5b57\u66f4\u597d\uff0c\u4f46 Qwen3 \u662f\u9760\u91cd\u590d\u975e\u6d41\u5f0f\u89e3\u7801\u51d1\u51fa\u6765\u7684\u6d41\u5f0f\uff0c\u8ba1\u7b97\u91cf\u5dee\u4e86\u4e00\u4e2a\u6570\u91cf\u7ea7\u2014\u2014Uni-ASR \u6ca1\u628a\u8ba1\u7b97\u590d\u6742\u5ea6\u516c\u5e73\u5217\u51fa\u662f\u4e00\u4e2a\u8d25\u7b14\u3002<\/strong><\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 7\/10<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><a href=\"https:\/\/mp.weixin.qq.com\/s\/rSk0WBc4VjW0dkqBspKofA\">https:\/\/mp.weixin.qq.com\/s\/rSk0WBc4VjW0dkqBspKofA<\/a><\/p>\n\n\n\n<h3><strong>16. NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR \u2b50<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2604.18105<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2026-04-20<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: NIO \/ \u851a\u6765\u6c7d\u8f66<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: https:\/\/arxiv.org\/abs\/2604.18105<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u9762\u5411\u751f\u4ea7\u90e8\u7f72\u7684 LLM-ASR \u6846\u67b6\uff0c\u7cfb\u7edf\u89e3\u51b3\u8f7b\u91cf\u5316\u3001\u5e7b\u89c9\u6291\u5236\u3001\u70ed\u8bcd\u5b9a\u5236\u4e09\u5927\u75db\u70b9\u3002\u57fa\u4e8e phoneme-level encoder \u9884\u8bad\u7ec3\u51cf\u5c11\u6a21\u6001\u5dee\u8ddd\uff0c\u5f15\u5165 Iterative Asynchronous SFT\uff08IA-SFT\uff09\u9632\u6b62 representation drift\uff0c\u8bbe\u8ba1 ASR \u4e13\u7528 RL \u63d0\u5347\u8bc6\u522b\u8d28\u91cf\uff0c\u5e76\u4ee5 phoneme RAG \u5b9e\u73b0\u767e\u4e07\u91cf\u7ea7\u70ed\u8bcd\u5b9a\u5236\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891 \u2192 600M Conformer Encoder\uff08phoneme CTC \u9884\u8bad\u7ec3\uff0cCKA \u76d1\u63a7 drift\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u251c\u2500\u2500 \u6d41\u5f0f\uff1adynamic-chunk mechanism\uff08\u9884\u8bad\u7ec3\u671f\u5185\u5d4c\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2514\u2500\u2500 phoneme CTC head \u2192 \u97f3\u7d20\u5047\u8bbe<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;MLP Adapter\uff084x \u4e0b\u91c7\u6837\uff0c160ms\/token\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2193<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Qwen3-1.7B\uff08LLM \u89e3\u7801\u5668\uff09<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; \u2191<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Phoneme RAG\uff1a\u97f3\u7d20\u5047\u8bbe \u2192 \u68c0\u7d22\u70ed\u8bcd\u6570\u636e\u5e93\uff08&lt;1ms\uff09\u2192 Prompt \u6ce8\u5165<br><br>\u8bad\u7ec3 pipeline\uff1a<br>&nbsp; &nbsp; Stage1: Encoder \u9884\u8bad\u7ec3\uff08phoneme CTC\uff0cCR-CTC\uff09<br>&nbsp; &nbsp; Stage2: Alignment\uff08\u4ec5\u8bad\u7ec3 Adapter\uff0c\u51bb\u7ed3\u5176\u4f59\uff09<br>&nbsp; &nbsp; Stage3: IA-SFT\uff08\u5f02\u6b65\u5e76\u884c\uff0cCKA \u76d1\u63a7 encoder \u7a33\u5b9a\u6027\uff09<br>&nbsp; &nbsp; Stage4+5: Late Joint SFT + Context SFT + ASR-RL<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>Phoneme-level encoder \u9884\u8bad\u7ec3\uff1a\u4f4e\u71b5\u8868\u793a\u51cf\u5c11\u6a21\u6001\u5dee\u8ddd\uff0c\u5929\u7136\u652f\u6301\u6d41\u5f0f<\/li><li>IA-SFT\uff1a\u5f02\u6b65 SFT \u5728\u5bf9\u9f50\u9636\u6bb5\u5373\u5f00\u59cb\uff0cCKA \u76d1\u63a7\u9632\u6b62 representation drift<\/li><li>ASR-RL\uff1a\u4e13\u4e3a ASR \u8bbe\u8ba1\u7684\u5f3a\u5316\u5b66\u4e60\uff0c\u8fdb\u4e00\u6b65\u63d0\u5347\u8bc6\u522b\u8d28\u91cf\u548c\u5e7b\u89c9\u9c81\u68d2\u6027<\/li><li>Phoneme RAG\uff1a\u767e\u4e07\u70ed\u8bcd\u5b9a\u5236\uff0c\u68c0\u7d22\u5ef6\u8fdf\u5c0f\u4e8e1ms<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a25 \u4e2a benchmark\uff0815 \u516c\u5f00 + 10 \u5185\u90e8\uff09\uff1b\u4e2d\u82f1\u53cc\u8bed\u5927\u89c4\u6a21\u5185\u90e8\u6570\u636e<\/li><li>2.3B \u53c2\u6570\u8fbe\u5230\u591a\u4e2a\u516c\u5f00 benchmark SOTA\uff1b\u5185\u90e8 entity-intensive \u573a\u666f\u5927\u5e45\u9886\u5148<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>NIO \u8f66\u8f7d\u573a\u666f\u51fa\u53d1\u7684\u5de5\u4e1a\u8bba\u6587\uff0c\u5de5\u7a0b\u8bda\u610f\u5341\u8db3\u3002phoneme-level encoder \u9884\u8bad\u7ec3\u3001IA-SFT \u9632 drift\u3001ASR-RL\u3001\u767e\u4e07\u70ed\u8bcd RAG\u2014\u2014\u6bcf\u4e2a\u6a21\u5757\u90fd\u662f\u771f\u5b9e\u751f\u4ea7\u75db\u70b9\u7684\u89e3\u6cd5\u3002CKA \u52a8\u6001\u76d1\u63a7 encoder \u8868\u793a\u504f\u79fb\u8fd9\u4e2a\u624b\u6bb5\u5f88\u7ec6\u3002\u4f46\u6838\u5fc3\u6570\u636e\u4e0d\u516c\u5f00\uff0c\u5b66\u672f\u53ef\u590d\u73b0\u6027\u4e3a\u96f6\uff1b&#8221;25 \u4e2a benchmark SOTA&#8221;\u8981\u6253\u6298\u2014\u2014\u4e3b\u8981\u8d62\u5728\u5185\u90e8\u5b9e\u4f53\u5bc6\u96c6\u573a\u666f\uff1bStreaming \u652f\u6301\u662f&#8221;\u4f18\u5316\u4e86&#8221;\u800c\u975e&#8221;\u91cd\u65b0\u8bbe\u8ba1\u4e86&#8221;\uff0c\u6280\u672f\u7ec6\u8282\u62ab\u9732\u514b\u5236\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>&nbsp;: 8\/10<\/p>\n\n\n\n<h3><strong>17. UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction<\/strong><\/h3>\n\n\n\n<p><strong>arXiv ID<\/strong>&nbsp;: 2604.19221<\/p>\n\n\n\n<p><strong>\u53d1\u5e03\u65e5\u671f<\/strong>&nbsp;: 2026-04-21<\/p>\n\n\n\n<p><strong>\u673a\u6784<\/strong>&nbsp;: NIO \/ \u851a\u6765\u6c7d\u8f66<\/p>\n\n\n\n<p><strong>\u8bba\u6587\u94fe\u63a5<\/strong>&nbsp;: https:\/\/arxiv.org\/abs\/2604.19221<\/p>\n\n\n\n<p><strong>\ud83d\udccc \u7b80\u4ecb<\/strong><\/p>\n\n\n\n<p>\u63d0\u51fa\u7b2c\u4e00\u4e2a\u9762\u5411\u5168\u53cc\u5de5\u8bed\u97f3\u7cfb\u7edf\u7684\u7edf\u4e00\u97f3\u9891\u524d\u7aef LLM\uff08UAF\uff09\u3002\u5c06 VAD\u3001\u8f6e\u6362\u68c0\u6d4b\uff08TD\uff09\u3001\u8bf4\u8bdd\u4eba\u8bc6\u522b\uff08SR\uff09\u3001ASR\u3001QA \u7b49\u591a\u79cd\u524d\u7aef\u4efb\u52a1\u7edf\u4e00\u4e3a\u5355\u4e00\u81ea\u56de\u5f52\u5e8f\u5217\u9884\u6d4b\u95ee\u9898\uff0c\u4ee5 600ms \u56fa\u5b9a\u65f6\u957f\u6d41\u5f0f\u97f3\u9891\u5757\u4e3a\u8f93\u5165\uff0c\u8f93\u51fa\u63a7\u5236\u72b6\u6001 token \u9a71\u52a8\u7cfb\u7edf\u72b6\u6001\u673a\u3002<\/p>\n\n\n\n<p><strong>\ud83d\udd27 \u67b6\u6784\u793a\u610f<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\u97f3\u9891\u6d41\uff08600ms \u56fa\u5b9a\u5757\uff09<br>&nbsp; &nbsp; \u2193<br>\u97f3\u9891\u7f16\u7801\u5668 \u2192 \u7279\u5f81\u63d0\u53d6<br>&nbsp; &nbsp; \u2193<br>LLM\uff08\u81ea\u56de\u5f52\uff09<br>&nbsp; &nbsp; \u251c\u2500\u2500 \u8bed\u4e49 token\uff08\u8f6c\u5f55\u5185\u5bb9\uff09<br>&nbsp; &nbsp; \u2514\u2500\u2500 \u63a7\u5236 token\uff08VAD\u72b6\u6001\/\u8bf4\u8bdd\u4eba\u5207\u6362\/\u6253\u65ad\u4fe1\u53f7\/QA\u89e6\u53d1\uff09<br>&nbsp; &nbsp; \u2193<br>\u5168\u53cc\u5de5\u7cfb\u7edf\u72b6\u6001\u673a\uff08\u63a5\u6536\u63a7\u5236 token \u9a71\u52a8\uff09<\/code><\/pre>\n\n\n\n<p><strong>\ud83d\udca1 \u5173\u952e\u521b\u65b0<\/strong><\/p>\n\n\n\n<ul><li>\u9996\u4e2a\u7edf\u4e00\u5168\u53cc\u5de5\u524d\u7aef\u4efb\u52a1\u7684 LLM \u65b9\u6848\uff08VAD + TD + SR + ASR + QA\uff09<\/li><li>600ms \u5757\u7ea7\u6d41\u5f0f\u8f93\u5165\uff0c\u8986\u76d6\u6253\u65ad\u68c0\u6d4b\u7b49\u5b9e\u65f6\u63a7\u5236\u573a\u666f<\/li><li>\u63a7\u5236 token \u4e0e\u8bed\u4e49 token \u8054\u5408\u81ea\u56de\u5f52\u751f\u6210\uff0c\u7aef\u5230\u7aef\u964d\u4f4e\u7cfb\u7edf\u5ef6\u8fdf<\/li><\/ul>\n\n\n\n<p><strong>\ud83d\udcca \u8bad\u7ec3\u6570\u636e &amp; \u5b9e\u9a8c\u7ed3\u679c<\/strong><\/p>\n\n\n\n<ul><li>\u6570\u636e\uff1a\u5185\u90e8\u5168\u53cc\u5de5\u7cfb\u7edf\u6570\u636e\uff08\u89c4\u6a21\u672a\u516c\u5f00\uff09<\/li><li>\u5168\u53cc\u5de5\u54cd\u5e94\u5ef6\u8fdf\u548c\u6253\u65ad\u68c0\u6d4b\u7cbe\u5ea6\u663e\u8457\u6539\u5584\uff08\u5177\u4f53\u6570\u503c\u672a\u5b8c\u6574\u62ab\u9732\uff09<\/li><\/ul>\n\n\n\n<p><strong>\u2620\ufe0f \u7280\u5229\u70b9\u8bc4<\/strong><\/p>\n\n\n\n<p>\u65b9\u5411\u6b63\u786e\uff0c\u5168\u53cc\u5de5\u8bed\u97f3\u7cfb\u7edf\u662f\u5f53\u4e0b\u6700\u70ed\u7684\u65b9\u5411\uff0c\u628a VAD\u3001\u8f6e\u6362\u68c0\u6d4b\u3001\u8bf4\u8bdd\u4eba\u8bc6\u522b\u3001ASR \u7edf\u4e00\u6210\u4e00\u4e2a LLM \u5728\u5b9e\u9645\u90e8\u7f72\u91cc\u6700\u7701\u4e8b\u3002600ms \u5757\u7ea7\u8f93\u5165\u8dd1\u6253\u65ad\u68c0\u6d4b\uff0c\u5ef6\u8fdf\u5728\u53ef\u63a5\u53d7\u8303\u56f4\u3002\u4f46\u8fd9\u7bc7\u4fe1\u606f\u5bc6\u5ea6\u504f\u4f4e\uff0c\u5173\u952e\u6027\u80fd\u6570\u5b57\u8bed\u7109\u4e0d\u8be6\uff08&#8221;\u663e\u8457\u6539\u5584&#8221;\u6ca1\u6709\u5177\u4f53\u6570\u503c\uff09\uff0c\u8bad\u7ec3\u6570\u636e\u5b8c\u5168\u4e0d\u900f\u660e\uff0c\u548c Moshi\u3001Mini-Omni2 \u7b49\u5168\u53cc\u5de5\u7cfb\u7edf\u7684\u6a2a\u5411\u5bf9\u6bd4\u7f3a\u5931\u3002\u6682\u65f6\u66f4\u50cf\u4e00\u7bc7\u7cfb\u7edf\u63cf\u8ff0\u62a5\u544a\uff0c\u4e0d\u662f\u4e25\u8c28\u7814\u7a76\u8bba\u6587\u3002\u503c\u5f97\u5173\u6ce8\u65b9\u5411\uff0c\u4e0d\u503c\u5f97\u6df1\u5ea6\u8ddf\u8e2a\u3002<\/p>\n\n\n\n<p><strong>\u2b50 \u8bc4\u5206<\/strong>\u00a0: 7\/10<\/p>\n\n\n\n<h2><strong>\u5168\u666f\u901f\u89c8\u5bf9\u6bd4\u8868\uff0817 \u7bc7\uff09<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>#<\/th><th>\u8bba\u6587 \/ \u7cfb\u7edf<\/th><th>\u5e74\u4efd<\/th><th>\u673a\u6784<\/th><th>\u6838\u5fc3\u65b9\u6cd5<\/th><th>\u5173\u952e\u521b\u65b0<\/th><th>\u6570\u636e\u89c4\u6a21<\/th><th>\u6d41\u5f0f\u652f\u6301<\/th><th>\u4e3b\u8981\u6548\u679c<\/th><th>\u8bc4\u5206<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>Prompting LLMs with Speech (2307.11795)<\/td><td>2023<\/td><td>Meta AI<\/td><td>GPT-style: Conformer prefix + LLaMA-7B<\/td><td>\u9996\u6279\u9a8c\u8bc1 Speech+LLM \u8303\u5f0f\uff1b\u51bb\u7ed3 LLM \u53ef\u5b66\u591a\u8bed\u8a00<\/td><td>MLS 44.5k h<\/td><td>\u274c<\/td><td>WER 4.3%\uff08MLS en\uff09<\/td><td>6\/10<\/td><\/tr><tr><td>2<\/td><td>Chunked AED Streaming (2309.08436)<\/td><td>2023<\/td><td>RWTH\/Google<\/td><td>EOC token \u9a71\u52a8 chunk-wise AED<\/td><td>AED \u2248 Transducer \u7406\u8bba\u7edf\u4e00\uff1b\u957f\u97f3\u9891\u6cdb\u5316<\/td><td>LibriSpeech 960h<\/td><td>\u2705 chunk<\/td><td>WER 2.7%\uff08test-clean\uff09<\/td><td>7\/10<\/td><\/tr><tr><td>3<\/td><td>SLD Decoder-Only ASR (2311.04534)<\/td><td>2023<\/td><td>Alibaba DAMO<\/td><td>\u79bb\u6563 token + KL \u6563\u5ea6 SLD \u8bad\u7ec3\u635f\u5931<\/td><td>\u4f18\u5316\u97f3\u9891 token \u81ea\u56de\u5f52\u5efa\u6a21\u8bad\u7ec3\u76ee\u6807<\/td><td>LibriSpeech 960h<\/td><td>\u274c<\/td><td>\u8d85\u8d8a Loss Masking<\/td><td>6\/10<\/td><\/tr><tr><td>4<\/td><td>BESTOW (2406.19954)<\/td><td>2024<\/td><td>NVIDIA<\/td><td>text query cross-attention + read-write policy<\/td><td>\u9996\u4e2a\u5f00\u6e90\u591a\u4efb\u52a1\u6d41\u5f0f SpeechLLM<\/td><td>87k h \u591a\u8bed\u8a00<\/td><td>\u2705 adaptive<\/td><td>WER 1.9%\uff08LibriSpeech clean\uff09<\/td><td>8\/10<\/td><\/tr><tr><td>5<\/td><td>Seed-ASR (2407.04675)<\/td><td>2024<\/td><td>ByteDance<\/td><td>\u9884\u8bad\u7ec3\u2192SFT\u2192RLHF + \u4e0a\u4e0b\u6587 prompt<\/td><td>RLHF \u7528\u4e8e ASR\uff1bprompt \u6ce8\u5165\u70ed\u8bcd\/\u9886\u57df<\/td><td>\u6570\u5341\u4e07 h \u4e2d\u82f1<\/td><td>\u274c \u79bb\u7ebf<\/td><td>\u5185\u90e8\u591a\u573a\u666f SOTA<\/td><td>7\/10<\/td><\/tr><tr><td>6<\/td><td>Transducer-Llama (2412.16464)<\/td><td>2024<\/td><td>Meta AI<\/td><td>Factorized Transducer + \u5f31\u5230\u5f3a LM swap<\/td><td>swap \u7ed5\u8fc7 RNN-T+LLM \u8054\u5408\u8bad\u7ec3\u9677\u9631<\/td><td>MLS 44.7k h \u591a\u8bed\u8a00<\/td><td>\u2705 Transducer<\/td><td>WER -32% vs RNN-T<\/td><td>8\/10<\/td><\/tr><tr><td>7<\/td><td>Multi-token Prediction (2409.12116)<\/td><td>2024<\/td><td>JHU \/ Meta<\/td><td>\u6bcf\u6b65\u9884\u6d4b\u591a\u4e2a\u672a\u6765 token<\/td><td>ASR \u6761\u4ef6\u5316\u4f7f\u591a token \u63a5\u53d7\u7387\u9ad8<\/td><td>LibriSpeech 960h<\/td><td>\u2705<\/td><td>3.2x \u52a0\u901f\uff0cWER \u65e0\u635f<\/td><td>6\/10<\/td><\/tr><tr><td>8<\/td><td>MFLA (2506.03722)<\/td><td>2025<\/td><td>Honor \/ \u4e0a\u4ea4<\/td><td>CIF predictor + MFLA \u6709\u9650\u53f3\u4e0a\u4e0b\u6587<\/td><td>CIF \u8f6f\u5bf9\u9f50\u66ff\u4ee3 fixed-chunk\uff1b\u7edf\u4e00\u79bb\u7ebf\/\u5728\u7ebf<\/td><td>WenetSpeech4TTS + LibriSpeech + MLS<\/td><td>\u2705 wait-k<\/td><td>\u5728\u7ebf WER 7.17%\uff1b\u5ef6\u8fdf -14.5%<\/td><td>7\/10<\/td><\/tr><tr><td>9<\/td><td>SpecASR (2507.18181)<\/td><td>2025<\/td><td>\u53a6\u5927 \/ \u591a\u6821<\/td><td>Draft+Target LLM \u63a8\u6d4b\u89e3\u7801<\/td><td>\u81ea\u9002\u5e94\u8349\u7a3f\u957f\u5ea6\uff1b\u7a00\u758f token \u6811<\/td><td>\u516c\u5f00 benchmark<\/td><td>\u2705<\/td><td>3.04x\u20133.79x \u52a0\u901f\uff0c\u7cbe\u5ea6\u96f6\u635f\u5931<\/td><td>8\/10<\/td><\/tr><tr><td>10<\/td><td>WhisperKit (2507.10860)<\/td><td>2025<\/td><td>Argmax<\/td><td>\u5757\u5bf9\u89d2 mask \u81ea\u84b8\u998f + ANE \u91cf\u5316<\/td><td>\u7aef\u4fa7\u539f\u751f\u6d41\u5f0f\uff1b\u9759\u97f3\u7f13\u5b58\uff1b1.6G\u21920.6G<\/td><td>CommonVoice 17<\/td><td>\u2705 0.46s<\/td><td>WER 2.2%\uff0c\u8d85\u8d8a\u4e91\u7aef GPT-4o<\/td><td>7\/10<\/td><\/tr><tr><td>11<\/td><td>JEDIS-LLM (2511.16046)<\/td><td>2025<\/td><td>\u963f\u91cc\u5df4\u5df4<\/td><td>SPC + Word-level Speaker Supervision<\/td><td>\u9996\u4e2a\u96f6\u6837\u672c\u6d41\u5f0f\u957f\u97f3\u9891\u8054\u5408 ASR+\u8bf4\u8bdd\u4eba\u5206\u79bb<\/td><td>\u77ed\u97f3\u9891 \u226420s<\/td><td>\u2705 chunk<\/td><td>\u8d85\u8d8a Sortformer\/DiarizationLM<\/td><td>8\/10<\/td><\/tr><tr><td>12<\/td><td>Whisper-LLaDA (2509.16622)<\/td><td>2025<\/td><td>IDIAP \/ \u591a\u6821<\/td><td>Whisper encoder + LLaDA-8B \u6269\u6563\u89e3\u7801<\/td><td>\u9996\u6b21\u9a8c\u8bc1\u6269\u6563 LLM \u7528\u4e8e ASR\uff1b\u97f3\u9891\u6761\u4ef6\u5316\u662f\u5173\u952e<\/td><td>LibriSpeech 960h<\/td><td>\u274c<\/td><td>\u7ea7\u8054 WER 2.25%\/4.94%\uff1b\u6269\u6563\u66f4\u5feb\u4f46\u7cbe\u5ea6\u7565\u4f4e<\/td><td>7\/10<\/td><\/tr><tr><td>13<\/td><td>MoCha-ASR (2601.22779)<\/td><td>2026<\/td><td>\u5408\u5de5\u5927 \/ \u591a\u6821<\/td><td>MoChA \u7b56\u7565\u7f51\u7edc + Qwen2.5 + minLT loss<\/td><td>\u7aef\u5230\u7aef\u65e0 CTC \u5bf9\u9f50\u6d41\u5f0f LLM-ASR<\/td><td>AISHELL-1\/2 + \u5185\u90e8<\/td><td>\u2705 adaptive<\/td><td>AISHELL-1 CER 5.1%\uff1b\u5ef6\u8fdf -62.5%<\/td><td>8\/10<\/td><\/tr><tr><td>14<\/td><td>CHAT (2602.24245)<\/td><td>2026<\/td><td>Apple \/ Google<\/td><td>Chunk \u5185 cross-attention joiner<\/td><td>\u653e\u5bbd RNN-T \u4e25\u683c\u5355\u8c03\u7ea6\u675f\uff1bAST \u663e\u8457\u63d0\u5347<\/td><td>NeMo \u591a\u8bed\u8a00<\/td><td>\u2705 chunk<\/td><td>WER -6.3%\uff1bBLEU +18%\uff1b\u63a8\u7406 1.69x<\/td><td>7\/10<\/td><\/tr><tr><td>15<\/td><td>Uni-ASR (2603.11123)<\/td><td>2026<\/td><td>\u79d1\u5927\u8baf\u98de \/ \u591a\u6821<\/td><td>NS\/SS\/CS \u4e09\u8303\u5f0f\u8054\u5408\u8bad\u7ec3 + fallback \u89e3\u7801<\/td><td>\u5355\u6a21\u578b\u7edf\u4e00\u6d41\u5f0f\/\u975e\u6d41\u5f0f<\/td><td>WeNetSpeech 10k h+<\/td><td>\u2705 \u591a chunk size<\/td><td>AISHELL-1 CER 2.15%\uff081s chunk\uff09<\/td><td>7\/10<\/td><\/tr><tr><td>16<\/td><td>NIM4-ASR (2604.18105)<\/td><td>2026<\/td><td>NIO \/ \u851a\u6765<\/td><td>phoneme CTC \u9884\u8bad\u7ec3 + IA-SFT + RL + RAG<\/td><td>\u767e\u4e07\u70ed\u8bcd RAG\u5c0f\u4e8e1ms\uff1bIA-SFT \u9632 drift<\/td><td>25 benchmark + \u5185\u90e8\u5927\u89c4\u6a21<\/td><td>\u2705 chunk<\/td><td>2.3B \u591a benchmark SOTA<\/td><td>8\/10<\/td><\/tr><tr><td>17<\/td><td>UAF (2604.19221)<\/td><td>2026<\/td><td>NIO \/ \u851a\u6765<\/td><td>600ms chunk LLM + \u591a\u4efb\u52a1\u7edf\u4e00<\/td><td>\u9996\u4e2a\u5168\u53cc\u5de5\u524d\u7aef LLM\uff1b\u63a7\u5236 token \u9a71\u52a8\u72b6\u6001\u673a<\/td><td>\u5185\u90e8\u5168\u53cc\u5de5\u6570\u636e<\/td><td>\u2705 600ms<\/td><td>\u5168\u53cc\u5de5\u5ef6\u8fdf\u548c\u6253\u65ad\u7cbe\u5ea6\u6539\u5584\uff08\u672a\u62ab\u9732\u5177\u4f53\u6570\u503c\uff09<\/td><td>7\/10<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2><strong>\u8d8b\u52bf\u6f14\u53d8\u4e0e\u6280\u672f\u8109\u7edc<\/strong><\/h2>\n\n\n\n<p><strong>\u4e09\u6761\u4e3b\u7ebf\u6f14\u8fdb\u8def\u5f84<\/strong><\/p>\n\n\n\n<p><strong>\u2460 \u89e3\u7801\u6846\u67b6\u8fdb\u5316<\/strong>&nbsp;\uff1aGPT-style prefix\uff082023\uff09\u2192 read-write policy BESTOW\uff082024\uff09\u2192 MoChA adaptive MoCha-ASR\uff082026\uff09\u2192 \u7edf\u4e00 NS\/SS\/CS Uni-ASR\uff082026\uff09<\/p>\n\n\n\n<p><strong>\u2461 \u6548\u7387\u5de5\u7a0b\u5316<\/strong>&nbsp;\uff1aMulti-token prediction\uff082024\uff09\u2192 Speculative Decoding SpecASR\uff082025\uff09\u2192 \u7aef\u4fa7 ANE \u6781\u81f4\u4f18\u5316 WhisperKit\uff082025\uff09\u2192 \u70ed\u8bcd Phoneme RAG NIM4-ASR\uff082026\uff09<\/p>\n\n\n\n<p><strong>\u2462 \u591a\u4efb\u52a1\u878d\u5408<\/strong>&nbsp;\uff1a\u5355\u7eaf ASR\uff082023\uff09\u2192 \u4e0a\u4e0b\u6587\u611f\u77e5 Seed-ASR\uff082024\uff09\u2192 \u8054\u5408\u8bf4\u8bdd\u4eba\u5206\u79bb JEDIS-LLM\uff082025\uff09\u2192 \u5168\u53cc\u5de5\u524d\u7aef\u7edf\u4e00 UAF\uff082026\uff09<\/p>\n\n\n\n<p><strong>\u91cc\u7a0b\u7891\u8282\u70b9<\/strong><\/p>\n\n\n\n<ul><li><strong>2023<\/strong>&nbsp;: LLM-ASR \u8303\u5f0f\u6210\u7acb\uff08Speech Prompt + LLM\uff09\uff0c\u6d41\u5f0f\u662f\u7a7a\u767d<\/li><li><strong>2024<\/strong>&nbsp;: BESTOW \u786e\u7acb read-write policy \u6846\u67b6\uff0cTransducer-Llama \u7ed9\u51fa RNN-T \u6700\u4f18\u89e3\uff0cSeed-ASR \u5de5\u4e1a\u5316\u843d\u5730<\/li><li><strong>2025<\/strong>&nbsp;: \u63a8\u7406\u52a0\u901f\u7206\u53d1\uff08SpecASR 3x+\uff09\uff0c\u7aef\u4fa7\u90e8\u7f72\u6210\u719f\uff08WhisperKit 0.46s\uff09\uff0c\u591a\u4efb\u52a1\u878d\u5408\uff08JEDIS-LLM\uff09<\/li><li><strong>2026<\/strong>&nbsp;: \u7edf\u4e00\u67b6\u6784\uff08Uni-ASR\uff09\uff0c\u751f\u4ea7\u5168\u529f\u80fd\uff08NIM4-ASR\uff09\uff0c\u5168\u53cc\u5de5\u524d\u7aef\uff08UAF\uff09<\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>\u539f\u521b\uff1a\u8d3e\u5f66 \u65f6\u95f4\u8303\u56f4\uff1a2022.01\u20132026.04\uff0c\u5171\u6536\u5f5517 \u7bc7&nbsp;\u4ee3\u8868\u6027\u8bba\u6587\uff0c\u6309\u65f6\u95f4\u987a\u5e8f\u6392\u5217\u3002\u6bcf &hellip; <a href=\"http:\/\/139.9.1.231\/index.php\/2026\/05\/11\/llm-asr-stream-20222026\/\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">\u6d41\u5f0f LLM-ASR \u6a21\u578b\u4f18\u5316\u8bba\u6587\u5168\u666f\uff082022\u20132026\uff09<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[4,9,38,34],"tags":[],"_links":{"self":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/31047"}],"collection":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/comments?post=31047"}],"version-history":[{"count":27,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/31047\/revisions"}],"predecessor-version":[{"id":31075,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/31047\/revisions\/31075"}],"wp:attachment":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/media?parent=31047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/categories?post=31047"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/tags?post=31047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}