{"id":24679,"date":"2025-02-11T15:41:41","date_gmt":"2025-02-11T07:41:41","guid":{"rendered":"http:\/\/139.9.1.231\/?p=24679"},"modified":"2025-02-11T15:41:42","modified_gmt":"2025-02-11T07:41:42","slug":"wetextprocessing","status":"publish","type":"post","link":"http:\/\/139.9.1.231\/index.php\/2025\/02\/11\/wetextprocessing\/","title":{"rendered":"WeTextProcessing-\u6587\u672c[\u9006]\u6b63\u5219\u5316"},"content":{"rendered":"\n<p><strong><em>Github\uff1a<a href=\"https:\/\/github.com\/wenet-e2e\/WeTextProcessing\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/github.com\/wenet-e2e\/WeTextProcessing<\/a><\/em><\/strong><\/p>\n\n\n\n<p><strong><em>\u6458\u81ea\uff1a<a href=\"https:\/\/mp.weixin.qq.com\/s\/q_11lck78qcjylHCi6wVsQ\" target=\"_blank\" rel=\"noreferrer noopener\">https:\/\/mp.weixin.qq.com\/s\/q_11lck78qcjylHCi6wVsQ<\/a><\/em><\/strong><\/p>\n\n\n\n<p><strong><em>Funasr\u4ed3\u5e93\uff1a<\/em><\/strong><\/p>\n\n\n\n<ul><li><a href=\"https:\/\/www.modelscope.cn\/models\/thuduj12\/fst_itn_zh\">https:\/\/www.modelscope.cn\/models\/thuduj12\/fst_itn_zh<\/a><\/li><li><a href=\"https:\/\/github.com\/duj12\/WeTextProcessing\/tree\/master\">https:\/\/github.com\/duj12\/WeTextProcessing\/tree\/master<\/a><\/li><\/ul>\n\n\n\n<h2><strong>Motivation<\/strong><\/h2>\n\n\n\n<p><strong>\u6587\u672c\u6b63\u5219\u5316\uff08Text Normalization\uff0cTN\uff09\u548c\u53cd\u6b63\u5219\u5316\uff08Inverse Text Normalization\uff0cITN\uff09<\/strong>\u662f\u6784\u5efa\u4e00\u4e2a\u5b8c\u6574\u7684\u8bed\u97f3\u4ea4\u4e92\u7cfb\u7edf\u4e0d\u53ef\u6216\u7f3a\u7684\u90e8\u5206\u3002\u524d\u8005\u5e7f\u6cdb\u7528\u4e8e<strong>\u8bed\u97f3\u5408\u6210\u7cfb\u7edf\u7684\u524d\u7aef\u5904\u7406<\/strong>\uff0c\u800c\u540e\u8005\u5219\u5728\u8bed\u97f3\u8bc6\u522b\u7cfb\u7edf\u7684<strong>\u8bc6\u522b\u6587\u672c\u4e0a\u5c4f<\/strong>\u663e\u793a\u65f6\u5f71\u54cd\u7740\u5b57\u5e55\u7684\u89c2\u611f\u4f53\u9a8c\u3002<\/p>\n\n\n\n<p>\u5f53\u524d\u5b66\u672f\u754c\u4e2d\u88ab\u5e7f\u6cdb\u7814\u7a76\u7684 TN \/ ITN \u7cfb\u7edf\u4e3b\u8981\u6709\u4e09\u79cd\u7c7b\u578b\uff1a<\/p>\n\n\n\n<ul><li><strong>\u57fa\u4e8e\u8bed\u6cd5\u89c4\u5219\u7684 WFST<\/strong> [1]\uff1a\u8fd9\u79cd\u7cfb\u7edf\u7531\u5927\u91cf\u7279\u5b9a\u4e8e\u8bed\u8a00\u7684\u8bed\u6cd5\u7ec4\u6210\uff0c\u4f18\u70b9\u662f\u51c6\u786e\u53ef\u63a7\uff0c\u53ef\u4ee5\u5feb\u901f\u4fee bug \uff0c\u7f3a\u70b9\u662f\u5bf9\u4e8e\u5bb9\u6613\u4ea7\u751f\u6b67\u4e49\u7684\u6587\u672c\u4e0d\u591f\u9c81\u68d2\u3002<\/li><li>\u57fa\u4e8e\u795e\u7ecf\u7f51\u7edc\u7684\u7aef\u5230\u7aef\u6a21\u578b [2]\uff1a\u6784\u5efa\u8fd9\u79cd\u6a21\u578b\u65f6\uff0c\u6311\u6218\u4ece\u64b0\u5199\u66f4\u7cbe\u786e\u7684\u8bed\u6cd5\u89c4\u5219\u53d8\u6210\u4e86\u6807\u6ce8\u548c\u6536\u96c6\u8986\u76d6\u8303\u56f4\u66f4\u5e7f\u7684\u6570\u636e\u3002\u7aef\u5230\u7aef\u6a21\u578b\u7684\u4e00\u4e2a\u4e3b\u8981\u7f3a\u70b9\u662f\u4f1a\u4ea7\u751f\u65e0\u6cd5\u6062\u590d\u7684\u9519\u8bef\uff0c\u8fd9\u65f6\u7ecf\u7cfb\u7edf\u8f6c\u6362\u540e\u7684\u6587\u5b57\u53ef\u80fd\u5728\u8bed\u6cd5\u4e0a\u662f\u5408\u7406\u7684\uff0c\u4f46\u5374\u4e0e\u539f\u59cb\u6587\u672c\u7684\u8bed\u4e49\u5927\u76f8\u5f84\u5ead\u3002\u6b64\u5916\uff0c\u5bf9\u4e8e <strong>badcase \u7684\u4fee\u590d\u4e5f\u4e0d\u5982\u89c4\u5219\u7684\u65b9\u5f0f\u5feb\u6377\u3002<\/strong><\/li><li>\u540c\u65f6\u4f7f\u7528<strong>\u89c4\u5219\u8bed\u6cd5\u548c\u795e\u7ecf\u7f51\u7edc\u7684\u6df7\u5408\u7cfb\u7edf<\/strong> [3]\uff1a\u5728\u6df7\u5408\u6846\u67b6\u4e2d\uff0c\u53ea\u6709\u5f53\u7cfb\u7edf\u6ca1\u6709\u627e\u5230\u5339\u914d\u7684\u8bed\u6cd5\u89c4\u5219\u624d\u4f1a\u8f6c\u7528\u795e\u7ecf\u7f51\u7edc\u3002\u8fd9\u79cd\u65b9\u5f0f\u6bd4\u8f83\u597d\u5730\u6743\u8861\u4e86\u89c4\u5219\u548c NN \u7684\u4f18\u52a3\uff0c\u4f46\u662f\u5bf9\u8ba1\u7b97\u8d44\u6e90\u63d0\u51fa\u4e86\u66f4\u9ad8\u7684\u8981\u6c42\u3002<\/li><\/ul>\n\n\n\n<p>\u9274\u4e8e\u4ee5\u4e0a\u4e09\u79cd\u7cfb\u7edf\u7684\u4f18\u52a3\uff0c<strong>WeTextProcessing<\/strong>\u00a0\u9009\u62e9<strong>\u5b9e\u73b0\u57fa\u4e8e\u8bed\u6cd5\u89c4\u5219\u7684WFST \u65b9\u6848<\/strong>\u3002\u5728\u5168\u7403\u8303\u56f4\u5185\u7684\u5f00\u6e90TN\/ITN \u9879\u76ee\u4e2d\uff0c\u76ee\u524d\u53d7\u4f17\u6700\u5e7f\u6cdb\u7684\u662f\u8c37\u6b4c\u516c\u53f8\u63a8\u51fa\u7684C++ \u6846\u67b6<strong>\u00a0Sparrowhawk\u00a0[4]\u00a0<\/strong>\u3002\u8be5\u6846\u67b6\u7684\u4e0d\u8db3\u4e4b\u5904\u662f\u5b83\u4ec5\u4ec5\u662f\u4e00\u4e2a\u89c4\u5219\u6267\u884c\u5f15\u64ce\uff0c\u8c37\u6b4c\u516c\u53f8\u5e76\u6ca1\u6709\u5f00\u6e90\u76f8\u5173\u8bed\u8a00\u7684\u8bed\u6cd5\u89c4\u5219\u3002\u6b64\u5916\uff0cSparrowhawk \u7684\u5b9e\u73b0\u4f9d\u8d56\u4e86\u8bb8\u591a\u7b2c\u4e09\u65b9\u5f00\u6e90\u5e93\uff08\u5305\u62ec OpenFst \u3001Thrax \u3001re2 \u3001protobuf \uff09\uff0c\u5bfc\u81f4\u6574\u4f53\u6846\u67b6\u4e0d\u591f\u7b80\u4fbf\u3001\u8f7b\u91cf\u5316\u3002\u53e6\u4e00\u4e2a\u8f83\u4e3a\u6210\u719f\u7684\u9879\u76ee\u662f\u82f1\u4f1f\u8fbe\u516c\u53f8\u5f00\u6e90\u7684\u00a0<strong>nemo_text_processing [5]<\/strong>\uff0c\u8be5\u9879\u76ee\u4f9d\u65e7\u4f7f\u7528Sparrowhawk \u4f5c\u4e3a\u751f\u4ea7\u73af\u5883\u4e0b\u7684\u90e8\u7f72\u5de5\u5177\u3002\u4e0e\u8c37\u6b4c\u4e0d\u540c\u7684\u662f\uff0c\u8be5\u9879\u76ee\u8fd8\u5f00\u6e90\u4e86\u8bf8\u5982\u82f1\u8bed\u3001\u5fb7\u8bed\u3001\u4fc4\u8bed\u7b49\u591a\u79cd\u8bed\u8a00\u7684\u89c4\u5219\u8bed\u6cd5\u3002\u5728\u4e2d\u6587 TN \/ ITN \u89c4\u5219\u9886\u57df\uff0cJiayu \u7b49\u7b2c\u4e09\u65b9\u4e2a\u4eba\u5f00\u53d1\u8005\u66fe\u5f00\u6e90\u51fa\u4e00\u5957\u5b9a\u5236\u5316\u7684\u4e2d\u6587 TN \/ ITN \u89c4\u5219\u5e93\u00a0<strong>chinese_text_normalization [6]<\/strong>\u3002<\/p>\n\n\n\n<p>\u7ad9\u5728\u8fd9\u4e9b\u4f18\u79c0\u5f00\u6e90\u9879\u76ee\u7684\u80a9\u8180\u4e0a\uff0cWeTextProcessing\u79c9\u627f\u00a0\u7b80\u5355\u6613\u7528\u00a0\u548cProduction First &amp; Production Ready\u00a0\u7684\u539f\u5219\uff0c\u4e3a\u4e2d\u6587\u4e13\u95e8\u8bbe\u8ba1\u548c\u5b9e\u73b0\u4e00\u6b3e\u5f00\u6e90\u6613\u7528\u7684 TN \/ ITN \u5de5\u5177\uff0c\u5b83\u4e0d\u4ec5\u4ec5\u5305\u542b\u4e86\u5305\u542b\u4e00\u5957\u5b8c\u6574\u7684\u4e2d\u6587 TN \/ ITN \u89c4\u5219\u8bed\u6cd5\uff0c\u540c\u65f6\u4e5f\u63d0\u4f9b\u4e86\u4e00\u4e2a\u53ef\u4ee5\u4e00\u952e pip install \u4f7f\u7528\u7684 py\u5de5\u5177\u5305\u4ee5\u53ca\u6bd4Sparrowhawk \u4f9d\u8d56\u9879\u66f4\u5c11\uff08\u751f\u4ea7\u73af\u5883\u4e0b\u4ec5\u4f9d\u8d56 OpenFst \uff09\u7684\u6574\u4f53\u66f4\u8f7b\u91cf\u5316\u7684 C++ \u89c4\u5219\u5904\u7406\u5f15\u64ce\u3002<\/p>\n\n\n\n<h2><strong>\u5feb\u901f\u4e0a\u624b<\/strong><\/h2>\n\n\n\n<p>\u4e00\u952einstall\uff0c\u516d\u884c\u4ee3\u7801\u641e\u5b9a\u6587\u672c\u5904\u7406\uff01<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code><em># install<\/em><\/code>\n<code>pip install WeTextProcessing<\/code>\n\n# tn usage\r\n>>> from tn.chinese.normalizer import Normalizer\r\n>>> normalizer = Normalizer()\r\n>>> normalizer.normalize(\"2.5\u5e73\u65b9\u7535\u7ebf\")\n\r\n# itn usage\r\n>>> from itn.chinese.inverse_normalizer import InverseNormalizer\r\n>>> invnormalizer = InverseNormalizer()\r\n>>> invnormalizer.normalize(\"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\")<\/pre>\n\n\n\n<h2><strong>\u6280\u672f\u7ec6\u8282<\/strong><\/h2>\n\n\n\n<p>TN \u548c ITN \u7684\u6d41\u7a0b\u90fd\u662f\u5305\u542b\u4e09\u4e2a\u90e8\u5206\uff1a<strong>Tagger,\u00a0Reorder\u00a0\u548c\u00a0Verbalizer<\/strong>\u3002Tagger \u8d1f\u8d23\u5bf9\u8f93\u5165\u7684\u6587\u672c\u8fdb\u884c\u89e3\u6790\uff0c\u5f97\u5230\u7ed3\u6784\u5316\u7684\u4fe1\u606f\u3002Reorder \u8d1f\u8d23\u5bf9\u7ed3\u6784\u5316\u4fe1\u606f\u8fdb\u884c\u987a\u5e8f\u7684\u8c03\u6574\u3002\u6700\u7ec8 Verbalizer \u8d1f\u8d23\u5c06\u91cd\u6392\u5e8f\u4e4b\u540e\u7684\u7ed3\u6784\u5316\u4fe1\u606f\u62fc\u63a5\u8d77\u6765\u3002<\/p>\n\n\n\n<h3><strong>TN \u6d41\u7a0b<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"978\" height=\"904\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-36.png\" alt=\"\" class=\"wp-image-24695\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-36.png 978w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-36-300x277.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-36-768x710.png 768w\" sizes=\"(max-width: 978px) 100vw, 978px\" \/><\/figure>\n\n\n\n<h3><strong>ITN \u6d41\u7a0b<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"996\" height=\"901\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-37.png\" alt=\"\" class=\"wp-image-24697\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-37.png 996w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-37-300x271.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-37-768x695.png 768w\" sizes=\"(max-width: 996px) 100vw, 996px\" \/><\/figure>\n\n\n\n<h3><strong>\u8bed\u6cd5\u89c4\u5219\u8bbe\u8ba1<\/strong><\/h3>\n\n\n\n<p>WeTextProcessing \u4f7f\u7528&nbsp;<strong>pynini [7]<\/strong>&nbsp;\u6765\u7f16\u5199\u548c\u7f16\u8bd1\u89c4\u5219\u8bed\u6cd5\uff0c\u89c4\u5219\u8bed\u6cd5\u53ef\u4ee5\u5c06\u4e00\u4e2a\u5b57\u7b26\u4e32\u8f6c\u6362\u4e3a\u53e6\u4e00\u4e2a\u5b57\u7b26\u4e32\u3002\u89c4\u5219\u8bed\u6cd5\u901a\u5e38\u53ef\u4ee5\u8868\u793a\u4e3a\u4e00\u4e2a WFST\uff0cpynini \u7684\u5e95\u5c42\u4f7f\u7528\u4e86 OpenFst \u6765\u5b9e\u73b0 WFST \u76f8\u5173\u7684\u529f\u80fd\u3002\u4f7f\u7528 pynini \u7f16\u5199\u7684\u89c4\u5219\u8bed\u6cd5\u793a\u4f8b\u5982\u4e0b\u56fe\u6240\u793a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"829\" height=\"825\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38.png\" alt=\"\" class=\"wp-image-24700\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38.png 829w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38-300x300.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38-150x150.png 150w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38-768x764.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-38-120x120.png 120w\" sizes=\"(max-width: 829px) 100vw, 829px\" \/><\/figure>\n\n\n\n<ul><li><strong>digits = zero | digit<\/strong>&nbsp;\u7684 | \u64cd\u4f5c\u7b26\u8868\u793a WFST \u7406\u8bba\u4e2d\u7684 union \u64cd\u4f5c\uff1b<\/li><li><strong>cross(&#8216;\u5341&#8217;, &#8216;1&#8217;)<\/strong>&nbsp;\u8868\u793a WFST \u7406\u8bba\u4e2d\u5f27\u4e0a\u7684\u8f93\u5165\u662f\u201c\u5341\u201d\uff0c\u8f93\u51fa\u662f\u201c1\u201d\uff0cWFST \u4ece\u4e00\u4e2a\u72b6\u6001\u8f6c\u5230\u53e6\u4e00\u4e2a\u72b6\u6001\u65f6\u82e5\u7ecf\u8fc7\u8be5\u5f27\u5219\u8bf4\u660e\u7cfb\u7edf\u5339\u914d\u5230\u4e86\u201c\u5341\u201d\u5e76\u6210\u529f\u5c06\u5176\u8f6c\u6362\u4e3a\u4e86\u201c1\u201d\uff1b<\/li><li><strong>delete(&#8216;\u5341&#8217;)&nbsp;<\/strong>\u8868\u793a\u5f27\u4e0a\u7684\u8f93\u5165\u662f\u201c\u5341\u201d\uff0c\u8f93\u51fa\u662f\u7a7a\uff0c\u5373\u7ecf\u8fc7\u8be5\u5f27\u65f6\u4f1a\u5220\u9664\u201c\u5341\u201d\uff1b<\/li><li><strong>digit + delete(&#8216;\u5341&#8217;)<\/strong>&nbsp;\u4e2d + \u8868\u793aWFST\u7406\u8bba\u4e2d\u7684 concat \u64cd\u4f5c\uff0c\u5b83\u5c06\u4e24\u4e2afst\u8fde\u8d77\u6765\uff1b<\/li><li><strong>accep(&#8216;\u5146&#8217;)<\/strong>&nbsp;\u8868\u793a\u5f27\u7684\u8f93\u5165\u548c\u8f93\u51fa\u90fd\u662f\u201c\u5146\u201d\uff0c\u6b64\u65f6 WFST \u76f8\u5f53\u4e8e\u4e00\u4e2a FSA\uff1b<\/li><li><strong>addzero**2<\/strong>\uff0c<strong>addzero**3<\/strong>&nbsp;\u5206\u522b\u8868\u793a\u5c06 addzero \u91cd\u590d\u4e24\u6b21\u548c\u4e09\u6b21\uff1b<\/li><li><strong>digits.ques&nbsp;<\/strong>\u548c&nbsp;<strong>digits.plus<\/strong>&nbsp;\u5219\u5206\u522b\u8868\u793a\u5c06 digits \u91cd\u590d\u96f6\u5230\u4e00\u6b21 \u548c \u91cd\u590d\u4e00\u5230\u65e0\u7a77\u6b21<\/li><\/ul>\n\n\n\n<p>\u6b64\u5916\u8fd8\u6709\u4e00\u4e9b\u8bed\u6cd5\u7279\u6027\uff0c\u6bd4\u5982\u4e0b\u56fe\u4e2d\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"978\" height=\"442\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-39.png\" alt=\"\" class=\"wp-image-24702\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-39.png 978w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-39-300x136.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2025\/02\/image-39-768x347.png 768w\" sizes=\"(max-width: 978px) 100vw, 978px\" \/><\/figure>\n\n\n\n<ul><li><strong>add_weight(Char().tagger, 100)<\/strong>\u00a0\u8868\u793a\u4e3a Char().tagger \u8fd9\u6761\u8def\u5f84\u8d4b\u4e88\u6743\u91cd\uff08\u8def\u5f84\u957f\u5ea6\uff09\u4e3a 100\u3002\u5f53\u6709\u591a\u6761\u8def\u5f84\u90fd\u53ef\u4ee5\u5339\u914d\u5f53\u524d\u8f93\u5165\u65f6\uff0c\u6211\u4eec\u53d6\u6700\u77ed\u8def\u5f84\u4f5c\u4e3a\u7ec8\u9009\u7ed3\u679c\u3002\u4f8b\u5982\u201c\u4e00\u70b9\u96f6\u4e94\u5206\u201d\u6700\u7ec8\u4f1a\u88ab ITN \u6210 \u201c1:05\u201d \u800c\u4e0d\u662f \u201c1.05\u5206\u201d\u3002<\/li><li><strong>insert(&#8216; &#8216;)<\/strong>\u00a0\u8868\u793a\u5f27\u4e0a\u7684\u8f93\u5165\u548c\u8f93\u51fa\u5206\u522b\u662f\u201c\u201d\u548c\u201c \u201d\uff0c\u5373\u7ecf\u8fc7\u8be5\u5f27\u65f6\u4f1a\u5f3a\u5236\u63d2\u5165\u4e00\u4e2a\u7a7a\u683c\u3002<\/li><li><strong>processor @ tagger.optimize()<\/strong>\u00a0\u4e2d @ \u8868\u793a\u5c06\u4e24\u4e2a fst \u8fdb\u884c compose \u64cd\u4f5c\uff0coptimize() \u8868\u793a\u5bf9 tagger \u8fdb\u884c epsilon-removal\uff0cdeterminization<strong>\u00a0<\/strong>\u4ee5\u53ca minimization\u00a0[8]<\/li><li><strong>&#8216;[EOS]&#8217;\u00a0<\/strong>\u8868\u793a\u6b63\u5219\u8868\u8fbe\u5f0f\u4e2d\u5339\u914d\u5230\u7684 string \u7684\u7ed3\u5c3e\uff0c\u540c\u7406\u8fd9\u91cc\u6ca1\u6709\u5217\u51fa\u7684 &#8216;[BOS]&#8217; \u5219\u8868\u793a\u5f00\u5934 [9]<\/li><\/ul>\n\n\n\n<p>\u66f4\u591a\u8be6\u5c3d\u7684\u8bf4\u660e\u8bf7\u53c2\u8003pynini \u7684\u76f8\u5173\u6587\u6863 [7]\u3002\u5bf9\u4e8e\u672c\u6587\u6240\u6784\u5efa\u7684\u6240\u6709WFST\uff0c\u6211\u4eec\u91c7\u7528 OpenFst \u4e2d\u9ed8\u8ba4\u7684\u70ed\u5e26\u534a\u73af\u4f5c\u4e3a\u5176\u7c7b\u578b\uff0c\u505a\u51fa\u8fd9\u4e2a\u9009\u62e9\u7684\u539f\u56e0\u662f\u6b64\u7c7b\u578b\u5bf9\u6c42\u7f51\u683c\u56fe\u4e2d\u7684\u6700\u77ed\u8def\u5f84\u7684\u64cd\u4f5c\u6709\u6548\u7387\u4f18\u52bf\uff0c\u5176\u8def\u5f84\u6743\u91cd\u7684\u8ba1\u7b97\u4ec5\u9700\u5bf9\u6cbf\u8def\u5f84\u7684\u6240\u6709\u5f27\u7684\u6743\u91cd\u8fdb\u884c\u7b80\u5355\u6c42\u548c\u3002<\/p>\n\n\n\n<h2><strong>\u8fdb\u9636\u7528\u6cd5<\/strong><\/h2>\n\n\n\n<h3>\u5982\u4f55\u5feb\u901f\u4fee badcase<\/h3>\n\n\n\n<p>\u5f53\u9047\u5230 badcase \u7684\u65f6\u5019\uff0c\u6211\u4eec\u9996\u5148\u9700\u8981\u786e\u5b9a badcase \u5c5e\u4e8e\u4ec0\u4e48\u7c7b\u578b\uff0c\u65e5\u671f\uff1f\u65f6\u95f4\uff1f\u8fd8\u662f\u5206\u6570\u7b49\u7b49\uff1f\u662f\u6ca1\u6709\u8f6c\u6362\uff0c\u8fd8\u662f\u8f6c\u6362\u6210\u4e86\u5176\u4ed6\u7c7b\u578b\u3002\u7136\u540e\u518d\u53bb\u76f8\u5bf9\u5e94\u7684 rules \u4e2d\u8fdb\u884c\u4fee\u590d\uff0c\u53ef\u80fd\u9700\u8981\u6539\u4ee3\u7801\uff0c\u4e5f\u53ef\u80fd\u9700\u8981\u6539 tsv \u6587\u4ef6\u3002<\/p>\n\n\n\n<p>\u6bd4\u5982\u82e5 ITN \u7cfb\u7edf\u5c06 \u201c\u4e09\u5fc3\u4e8c\u610f\u201d \u9519\u8bef\u8f6c\u6210\u4e86 \u201c3\u5fc32\u610f\u201d \u5219\u6709\u4e24\u79cd\u89e3\u51b3\u65b9\u6848\uff1a<\/p>\n\n\n\n<ol><li>\u5728 whitelist.tsv \u6dfb\u52a0\u76f8\u5173\u7684\u6620\u5c04\u653e\u5f03\u76f8\u5173\u8bcd\u6c47\u7684\u8f6c\u6362<\/li><li>\u5c06enable_standalone_number\u8bbe\u7f6e\u4e3aFalse\uff0c\u6b64\u65f6\u7cfb\u7edf\u5bf9\u4e0d\u5e26\u5355\u4f4d\u7684\u6570\u5b57\u4e0d\u4f1a\u8fdb\u884c\u8f6c\u6362<\/li><\/ol>\n\n\n\n<p>\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0cWeTextProcessing \u5927\u591a\u6570\u5931\u8d25\u6848\u4f8b\u662f\u7531\u4e8e\u4e0a\u4e0b\u6587\u6b67\u4e49\u6216\u7279\u6b8a\u6848\u4f8b\u9020\u6210\u7684\u957f\u5c3e\u95ee\u9898\u3002\u4f8b\u5982\uff0c\u201c\u4e09\u70b9\u4e94\u5206\u201d \u53ef\u4ee5\u662f\u65f6\u95f4 \u201c3:05\u201d \u4e5f\u53ef\u4ee5\u662f\u91cf\u8bcd \u201c3.5 \u5206\u201d \u8868\u793a\u8fd0\u52a8\u5458\u5f97\u5206\u3002\u7f16\u5199\u8bed\u6cd5\u65f6\u82e5\u8003\u8651\u66f4\u591a\u7684\u4e0a\u4e0b\u6587\u53ef\u4ee5\u4e00\u5b9a\u7a0b\u5ea6\u4e0a\u7f13\u89e3\u8fd9\u79cd\u60c5\u51b5\uff0c\u4f8b\u5982\uff0c\u5982\u679c \u201c\u4e09\u70b9\u4e94\u5206\u201d \u524d\u9762\u6709\u5355\u8bcd \u201c\u5f97\u5230\u201d \uff0c\u5219\u5c06\u5176\u68c0\u6d4b\u4e3a\u8fd0\u52a8\u5458\u5f97\u5206\u3002\u5f53\u7136\uff0c\u8fd9\u79cd\u6253\u8865\u4e01\u7684\u65b9\u5f0f\u5e76\u4e0d\u80fd\u9002\u7528\u4e8e\u6240\u6709\u60c5\u51b5\u3002\u51fa\u4e8e\u8fd9\u4e2a\u539f\u56e0\uff0c\u5982\u679c\u60f3\u8981\u8bbe\u8ba1\u4e00\u4e2a\u80fd\u591f\u8986\u76d6 100% \u573a\u666f\u7684\u7cfb\u7edf\uff0c\u8bed\u6cd5\u7684\u6570\u91cf\u5c06\u4e0d\u53ef\u907f\u514d\u5448\u6307\u6570\u7ea7\u589e\u957f\u3002\u5176\u4ed6\u5e38\u89c1\u7684\u5931\u8d25\u6848\u4f8b\u662f\u7531\u4e8e\u5b9a\u4e49\u4e0d\u5b8c\u6574\u3002\u4f8b\u5982\uff0c\u5982\u679c\u6ca1\u6709\u9884\u5b9a\u4e49 \u201c\u5343\u74e6\u65f6\u201d \u5230 \u201ckwh\u201d \u7684\u5ea6\u91cf\u7f29\u5199\u8f6c\u6362\uff0c\u7cfb\u7edf\u5c06\u65e0\u6cd5\u8f6c\u6362 \u201c\u4e24\u767e\u5343\u74e6\u65f6\u201d \u4e3a \u201c200kwh\u201d \u3002\u8fd9\u4e2a\u95ee\u9898\u76f8\u5bf9\u6765\u8bf4\u5bb9\u6613\u89e3\u51b3\uff0c\u4ec5\u9700\u5728\u5df2\u6709\u7684\u91cf\u8bcd\u7c7b\u4e2d\u6dfb\u52a0\u6240\u9700\u7684\u8f6c\u6362\u89c4\u5219\u3002<\/p>\n\n\n\n<h3>\u751f\u4ea7\u73af\u5883\u90e8\u7f72<\/h3>\n\n\n\n<p>\u5bf9\u4e8e\u60f3\u8981\u81ea\u5df1\u5bf9\u89c4\u5219\u8fdb\u884cDIY\u7684\u7528\u6237\uff0c\u53ef\u4ee5\u901a\u8fc7\u4ee5\u4e0b\u65b9\u5f0f\u83b7\u5f97\u81ea\u5df1\u7684\u89c4\u5219\u6587\u4ef6\u5e76\u90e8\u7f72\u5230\u4e0d\u540c\u7684\u73af\u5883\u4e2d\u3002<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>git clone https:\/\/github.com\/wenet-e2e\/WeTextProcessing.git<\/code>\n<code>cd WeTextProcessing<\/code>\n<code><em># `overwrite_cache` will rebuild all rules according to<\/em><\/code>\n<code><em>#   your modifications on tn\/chinese\/rules\/xx.py (itn\/chinese\/rules\/xx.py).<\/em><\/code>\n<code><em>#   After rebuild, you can find new far files at `$PWD\/tn` and `$PWD\/itn`.<\/em><\/code>\n<code>python normalize.py --text \"2.5\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache<\/code>\n<code>python inverse_normalize.py --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\" --overwrite_cache<\/code><\/pre>\n\n\n\n<p>\u5728\u5df2\u7ecfpip\u5b89\u88c5\u597d\u7684\u5de5\u5177\u5305\u4e2d\u4f7f\u7528\u81ea\u5df1\u7684\u89c4\u5219\uff1a<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code><em># tn usage<\/em><\/code>\n<code>>>> from tn.chinese.normalizer import Normalizer<\/code>\n<code>>>> normalizer = Normalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING\/tn\")<\/code>\n<code>>>> normalizer.normalize(\"2.5\u5e73\u65b9\u7535\u7ebf\")<em># itn usage<\/em><\/code>\n<code>>>> from itn.chinese.inverse_normalizer import InverseNormalizer<\/code>\n<code>>>> invnormalizer = InverseNormalizer(cache_dir=\"PATH_TO_GIT_CLONED_WETEXTPROCESSING\/itn\")<\/code>\n<code>>>> invnormalizer.normalize(\"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\")<\/code><\/pre>\n\n\n\n<p>\u5728C++\u4e2d\u4f7f\u7528\u81ea\u5df1\u7684\u89c4\u5219\uff1a<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><code>cmake -B build -S runtime -DCMAKE_BUILD_TYPE=Releasecmake <em>--build build<\/em><\/code>\n<code><em># tn usage<\/em><\/code>\n<code>.\/build\/bin\/processor_main <em>--far PATH_TO_GIT_CLONED_WETEXTPROCESSING\/tn\/zh_tn_normalizer.far --text \"2.5\u5e73\u65b9\u7535\u7ebf\"<\/em><\/code>\n<code><em># itn usage<\/em><\/code>\n<code>.\/build\/bin\/processor_main <em>--far PATH_TO_GIT_CLONED_WETEXTPROCESSING\/itn\/zh_itn_normalizer.far --text \"\u4e8c\u70b9\u4e94\u5e73\u65b9\u7535\u7ebf\"<\/em><\/code><\/pre>\n\n\n\n<h2><strong>\u603b\u7ed3\u548c\u5c55\u671b<\/strong><\/h2>\n\n\n\n<p>\u672a\u6765\uff0cWeTextProcessing \u7684\u5de5\u4f5c\u5c06\u805a\u7126\u5728\u5bf9 Corner Case \u7684\u89c4\u5219\u4fee\u8865\uff1a\u76f8\u6bd4\u4e8e\u89c4\u5219\u64b0\u5199\uff0c\u8bbe\u8ba1\u4e00\u5957\u5408\u7406\u7684\u6d4b\u8bd5\u96c6\u662f\u4e00\u4ef6\u66f4\u4e3a\u56f0\u96be\u7684\u4e8b\u60c5\uff0c\u8fd9\u662f\u56e0\u4e3a\u5b9e\u9645\u751f\u4ea7\u8fc7\u7a0b\u4e2d\u603b\u4f1a\u9047\u5230\u6570\u4e0d\u6e05\u7684 corner case \u3002WeTextProcessing \u4e2d\u867d\u7136\u63d0\u4f9b\u4e86\u4e00\u4e2a\u7b80\u5355\u7684\u5355\u5143\u6d4b\u8bd5\u548c\u793a\u4f8b\u6d4b\u8bd5\uff0c\u4f46\u5176\u8986\u76d6\u573a\u666f\u4ecd\u672a\u80fd\u8fbe\u5230 100% \u3002\u5728\u672a\u6765\uff0cWeTextProcessing \u7684\u91cd\u70b9\u65b9\u5411\u4e4b\u4e00\u5c31\u662f\u8d8a\u6765\u8d8a\u591a\u5730\u6295\u5165\u90e8\u7f72\u5230\u771f\u5b9e\u7684\u7ebf\u4e0a\u73af\u5883\u4e2d\uff0c\u4ee5\u8eab\u8bd5\u9519\uff0ccase by case \u5206\u6790\u5f53\u524d\u89c4\u5219\u5b58\u5728\u7684\u53ef\u80fd\u6f0f\u6d1e\u5e76\u52a0\u4ee5\u5f25\u8865\u3002<\/p>\n\n\n\n<h2><strong>\u53c2\u8003\u8d44\u6599<\/strong><\/h2>\n\n\n\n<p>[1] Peter Ebden and Richard Sproat, \u201cThe kestrel TTS text normalization system,\u201d Nat. Lang. Eng., vol. 21, no. 3, pp. 333\u2013353, 2015.<\/p>\n\n\n\n<p>[2] Courtney Mansfield, Ming Sun, Yuzong Liu, Ankur Gandhe, and Bj\u00f6rn Hoffmeister, \u201cNeural text normalization with subword units,\u201d in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 2 (Industry Papers), Anastassia Loukina, Michelle Morales, and Rohit Kumar, Eds. 2019, pp. 190\u2013196, Association for Computational Linguistics.<\/p>\n\n\n\n<p>[3] Richard Sproat and Navdeep Jaitly, \u201cAn RNN model of text normalization,\u201d in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, Francisco Lacerda, Ed. 2017, pp. 754\u2013758, ISCA.<\/p>\n\n\n\n<p>[4] Peter Ebden and Richar Sproat, \u201cSparrowhawk,\u201d 2022, https:\/\/github.com\/google\/sparrowhawk.<\/p>\n\n\n\n<p>[5] Yang Zhang, \u201cnemo_text_processing,\u201d 2022, https:\/\/github.com\/NVIDIA\/NeMo\/tree\/main\/nemo_text_processing.<\/p>\n\n\n\n<p>[6]&nbsp;Jiayu Du, \u201cchinese_text_normalization,\u201d 2022, https:\/\/github.com\/speechio\/chinese_text_normalization.<\/p>\n\n\n\n<p>[7] K. Gorman. 2016.&nbsp;Pynini: A Python library for weighted finite-state grammar compilation. In&nbsp;<em>Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata<\/em>, pages 75-80.<\/p>\n\n\n\n<p>[8]&nbsp;https:\/\/www.opengrm.org\/twiki\/bin\/view\/GRM\/PyniniOptimizeDoc<\/p>\n\n\n\n<p>[9]&nbsp;https:\/\/www.openfst.org\/twiki\/bin\/view\/GRM\/ThraxQuickTour<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Github\uff1ahttps:\/\/github.com\/wenet-e2e\/WeTextProcessing \u6458\u81ea &hellip; <a href=\"http:\/\/139.9.1.231\/index.php\/2025\/02\/11\/wetextprocessing\/\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">WeTextProcessing-\u6587\u672c[\u9006]\u6b63\u5219\u5316<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/24679"}],"collection":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/comments?post=24679"}],"version-history":[{"count":24,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/24679\/revisions"}],"predecessor-version":[{"id":24707,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/24679\/revisions\/24707"}],"wp:attachment":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/media?parent=24679"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/categories?post=24679"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/tags?post=24679"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}