在Transformer解码器中,输入预处理后模型的第一层是标记嵌入矩阵 𝐄 ,它将整数值标记映射到密集嵌入;给定 t 标记的词汇表和大小为 m 的嵌入, 𝐄 是 t×m 矩阵,其第 i 行给出第 i 标记的嵌入。 另一个嵌入矩阵 𝐄′ 出现在最后的softmax层中,用于计算每个位置上所有标记的logit;它是一个 m×t 矩阵,与模型的 m 维输出相乘,以获得logit的 t 维向量,每个标记一个。 在PaLM架构中,这些矩阵具有共享变量,因此一个是另一个的转置,即 𝐄′=𝐄⊺ 。
解码器架构的其余部分对建模的令牌数量完全无关。 因此,我们只需要做一个小的修改,将纯文本模型转换为同时对文本和音频进行建模的模型:我们将嵌入矩阵 𝐄 的大小扩展为大小 (t+a)×m ,其中 a 是音频令牌的数量( 𝐄′=𝐄⊺ 的大小相应地改变)。
为了利用预训练的文本模型,我们通过向嵌入矩阵 𝐄 添加 a 新行来更改现有的模型检查点。实现细节是前 t 令牌(从零到 t )对应于SentencePiece文本令牌,而接下来的 a 令牌(从 t 到 t+a )表示音频令牌。 虽然我们可以重用预训练模型的文本嵌入,但新的音频嵌入是新初始化的,必须经过训练。我们发现有必要训练所有模型参数,而不是保持以前的权重固定。我们使用混合语音和文本任务进行训练。
Semantic language models: 基于semantic token的自回归模型,常外接一个unit-vocoder来恢复语音,比如SpeechGPT。这类模型虽然可以完成一些语音内容相关的任务,但是它们产生的音质比较一般,并且无法完成一些副语言学相关的任务,比如音色转换等。
Acoustic language models: 基于acoustic token的speech LM,比如VALL-E。这类模型产生的语音音质比较好,并且可以较好地完成一些比如zero-shot TTS的任务,但是会存在内容不准确的问题。
Hierarchical speech language models: 这类模型由Semantic language models和Acoustic language models 级联而成,既可以产生比较精确的内容,也可以产生较好的音质,比如AudioPaLM。但是这类模型,建模阶段太多,较为复杂,需要两种tokenizer的参与;而且在semantic token和acoustic token之间其实存在有很大的信息冗余,会带来一些不必要的建模难度。
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
"plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
},
]
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
"plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
},
]
step3 数据构造Instruction Formatting :对于离散单元序列 U 及其相关联的转录 T ,我们基于概率 p 来确定它将被用于构造ASR任务还是TTS任务。随后,我们从step2相应的任务描述中随机选择一个描述 D 。表示为 (D,U,T) 。在此之后,使用模板[Human]:{D}。输入:{U}<eoh>。[SpeechGPT]:{T}<eos>。. 为了支持多轮对话,组装的指令以多轮对话的形式连接,遵守模型的最大输入长度。
LLM:采用Meta AI LLaMA(Touvron 等人,2023)模型作为我们的大型语言模型。LLaMA包括嵌入层、多个Transformer块和LM头层。LLaMA中的参数总数范围从7 B到65 B。从1.0万亿个令牌的广泛训练数据集中,LLaMA在各种NLP基准测试中与更大的175 B GPT-3相比表现出了竞争力。
Unit Vocoder:训练了一个多说话人 unit HiFi-GAN来解码语音信号, 离散表示。 HiFi-GAN架构由生成器 𝐆 和多个鉴别器 𝐃 组成。该生成器使用查找表(LUT)来嵌入离散表示,并且嵌入序列由一系列由转置卷积和具有扩张层的残差块组成的块进行上采样。 扬声器嵌入被连接到上采样序列中的每个帧。 该滤波器具有多周期鉴别器(MPD)和多尺度鉴别器(MSD)。
ASR:
You are asked to come up with a set of 100 diverse task instructions about automatic speech
recognition, which is about recognizing speech.
Here are the requirements:
1. These instructions should be to instruct someone to recognize the content of the following
speech.
2. Try not to repeat the verb for each instruction to maximize diversity.
3. The language used for instruction also should be diverse. For example, you should
combine questions with imperative instructions.
4. The type of instructions should be diverse.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a
question is permitted.
List of 100 tasks:
TTS:
You are asked to come up with a set of 100 diverse task instructions about text to speech,
which is about recognizing speech .
Here are the requirements:
1. These instructions should be to instruct someone to recognize the content of the following
speech.
2. Try not to repeat the verb for each instruction to maximize diversity.
3. The language used for instruction also should be diverse. For example, you should
combine questions with imperative instructions.
4. The type of instructions should be diverse.
5. The instructions should be in English.
6. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a
question is permitted.
List of 100 tasks:
附录B:Examples of Task Description【gpt生成的一些例子】
ASR:
Begin by converting the spoken words into written text.
Can you transcribe the speech into a written format?
Focus on translating the audible content into text.
Transcribe the speech by carefully listening to it.
Would you kindly write down the content of the speech?
Analyze the speech and create a written transcription.
Engage with the speech to produce a text-based version.
Can you document the speech in written form?
Transform the spoken words into text accurately.
How about putting the speech’s content into writing?
TTS:
Can you please read this sentence out loud?
Recite the following words as if you were speaking normally.
Project your voice to clearly articulate this statement.
Would you mind speaking these words as naturally as possible?
Whisper the given sentence softly.
Enunciate each word in this sentence with precision. How would you express this sentence in
a conversational tone?
Could you please relay the message below verbally?
Emphasize the key points while reading the sentence.
Sing the text provided in a melodic voice.
附录C: Chain-of-Modality Instructions Templates
Speech Instruction-Speech Response:
[Human]: This is a speech instruction: {SpeechI}. And your response should be speech.
You can do it step by step. You can first transcribe the instruction and get the text Instruction.
Then you can think about the instruction and get the text response. Last, you should speak the
response aloud <eoh>. [SpeechGPT]: [tq] {TextI}; [ta] {TextR}; [ua] {SpeechR}<eoa>.
Speech Instruction-Text Response:
[Human]: This is a speech instruction: {SpeechI}. And your response should be text. You
can do it step by step. You can first transcribe the instruction and get the text instruction.
Then you can think about the instruction and get the text response. <eoh>. [SpeechGPT]:
[tq] {TextI}; [ta] {TextR}<eoa>.
Text Instruction-Speech Response:
[Human]: This is a text instruction: {TextI}. And your response should be speech. You can
do it step by step. You can think about the instruction and get the text response. Then you
should speak the response aloud <eoh>. [SpeechGPT]: [ta] {TextR}; [ua] {SpeechR}<eoa>.
Text Instruction-Text Response:
[Human]: This is a text instruction: {TextI}. And your response should be text. You can
think about the instruction and get the text response. [SpeechGPT]: [ta] {TextR}<eoa>.
GGUF是一种量化方法,是LLM库的C++复制品,支持多种LLM,如LLaMA系列和Falcon等。它允许用户在 CPU 上运行 LLM,同时将其部分层次转移到 GPU 上以加速运行。尽管使用 CPU 通常比使用 GPU 进行推理要慢,但对于在 CPU 或 Apple 设备上运行模型的人来说,这是一种非常好的格式。特别是我们看到出现了更小、更强大的模型,如 Mistral 7B,GGUF 格式可能会成为一种常见的格式它提供了从2到8位精度的不同级别的量化。我们可以获取原始的LLaMA模型,将其转换为GGUF格式,最后将GGUF格式量化为较低的精度。
Image & Text :我们利用来自LAION-2B的图像-文本对(Schuhmann 等人,2022)、LAION-COCO(lai,2022 b)、LAION-Aesthetics(lai,2022 a)和JouneyDB(Pan 等人,2023年)。LAION-2B提供了与来自网络的嘈杂替代文本配对的图像,而LAION-COCO代表了其中的600 M子集,由BLIP标题。我们通过过滤文本质量、图像纵横比和剪辑得分等来细化这些数据集,产生了3亿对的高质量语料库。为了提高整体图像生成的保真度,我们用高质量的LAION-Aesthetics子集和Midjourney的合成数据集ColonneyDB补充了我们的数据。
Speech & Text :我们收集了几个大规模的英语自动语音识别(ASR)数据集,Gigaspeech (Chen et al., 2021), Common Voice (Ardila et al., 2020), and Multilingual LibriSpeech(MLS) (Pratap et al., 2020). 共同构成了57,000小时的语音文本对语料库,涵盖了各种各样的说话人,领域和录音环境。
###################################
# Sample a speaker from Gaussian.
rand_spk = chat.sample_random_speaker()
print(rand_spk) # save it for later timbre recovery
params_infer_code = ChatTTS.Chat.InferCodeParams(
spk_emb = rand_spk, # add sampled speaker
temperature = .3, # using custom temperature
top_P = 0.7, # top P decode
top_K = 20, # top K decode
)
###################################
# For sentence level manual control.
# use oral_(0-9), laugh_(0-2), break_(0-7)
# to generate special token in text to synthesize.
params_refine_text = ChatTTS.Chat.RefineTextParams(
prompt='[oral_2][laugh_0][break_6]',
)
wavs = chat.infer(
texts,
params_refine_text=params_refine_text,
params_infer_code=params_infer_code,
)
###################################
# For word level manual control.
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wavs = chat.infer(text, skip_refine_text=True, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)
S and E denote the start and end of sequence, respectively.T is “turn of speech” tokens. 𝐯 is a speaker embedding vector extracted from the speech X with a pre-trained voice-print model2. The text encodings Y¯={𝐲¯u}u∈[1:U] is obtained by passing the text through a Byte Pair Encoded (BPE) tokenizer and text encoder: