同样,还有一些其他的项目是由公司与大学合作开发的。具体来说,Stable Diffsion(Runway、Stability AI 和 LMU MUNICH)、Soundify(Runway 和卡内基梅隆大学)和 DreamFusion(谷歌和加州大学伯克利分校)就是这种情况 。
INTRODUCTION
近年来,人工智能生成内容 (AIGC) 获得了计算机科学界以外的广泛关注,整个社会开始对大型科技公司构建的各种内容生成产品感兴趣 [3],例如 ChatGPT [4]和 DALL-E2 [5]。AIGC 是指使用先进的生成式 AI (GAI) 技术生成的内容,而不是由人类作者创建,它可以在短时间内自动创建大量内容。例如,ChatGPT 是 OpenAI 开发的用于构建对话式 AI 系统的语言模型,它可以有效地理解并以有意义的方式响应人类语言输入。此外,DALL-E-2 是另一个由 OpenAI 开发的最先进的 GAI 模型,它能够在几分钟内根据文本描述创建独特的高质量图像,例如“An astronaut riding a horse ina photorealistic style”,如下图所示。由于AIGC取得的骄人成绩,很多人认为这将是人工智能的新时代,对全世界产生重大影响。
从技术上讲,AIGC 是指在给定可以帮助教导和引导模型完成任务的人类指令后,利用 GAI 算法生成满足指令的内容。这个生成过程通常包括两个步骤:从人类指令中提取意图信息,并根据提取的意图生成内容。然而,正如之前的研究 [6, 7] 所证明的那样,包含上述两个步骤的 GAI 模型范例并不完全新颖,实际上例如text to image的应用,早年的stackgan等就已经在做这类事情了。
DreamFusion 是由 Google Research 开发的文本到 3D 模型,它使用预训练的 2D 文本到图像扩散模型来执行文本到 3D 合成 [24]。特别是,Dreamfusion 将以前的 CLIP 技术替换为从 2D 扩散模型中提取的损失。具体来说,扩散模型可以用作一般连续优化问题中的损失来生成样本。至关重要的是,参数空间中的采样比像素中的采样要困难得多,因为我们想要创建从随机角度渲染时看起来像好的图像的 3D 模型。为了解决这个问题,这个模型使用了一个可微分的生成器。其他方法侧重于对像素进行采样,但是,该模型侧重于创建从随机角度渲染时看起来像好图像的 3D 模型。
该模型由 Google Research 开发,能够在给定一系列文本提示的情况下执行逼真的视频合成 [34]。最有趣的是,我们可以从 GitHub 访问模型的 API。特别是,Phenaki 是第一个可以从开放域时间变量提示生成视频的模型。为了解决数据问题,它对大型图像-文本对数据集以及较少数量的视频-文本示例进行联合训练,可以产生超出视频数据集可用范围的泛化。这主要是由于图像文本数据集有数十亿个输入,而文本视频数据集要小得多。同样,限制来自可变长度视频的计算能力。
由 Meta AI 研究开发的协作语言模型,在编辑历史上进行训练以涵盖整个写作过程 [29]。它基于四个图 14. ChatGPT 的训练步骤,将监督学习与强化学习相结合。步骤:计划、编辑、解释和重复。重复这些步骤,直到文本处于不需要进一步更新的令人满意的状态。该模型允许将撰写论文的任务分解为多个更容易的子任务。此外,该模型还允许人类随时进行干预并将模型引导至任何方向。
Meta AI 开发的模型,用于帮助无法通过语音、打字或手势进行交流的人们 [11]。以前的技术依赖于需要神经外科干预的侵入性大脑记录技术。该模型试图直接从非侵入性大脑记录中解码语言。这将提供更安全、更具可扩展性的解决方案,使更多人受益。这种提出的方法的挑战来自每个人大脑中的噪声和差异以及传感器的放置位置。深度学习模型通过对比学习进行训练,并用于最大限度地对齐非侵入性大脑记录和语音。一种称为 wave2vec 2.0 的自监督学习模型。用于识别听有声读物的志愿者大脑中语音的复杂表征。用于测量神经元活动的两种非侵入性技术是脑电图和脑磁图。
训练数据来自四个开源数据集,代表 169 名志愿者收听有声读物的 150 小时录音。EEG 和 MEG 记录被插入大脑模型,该模型由具有残留连接的标准深度卷积网络组成。这些录音来自个人的大脑。然后,该模型同时具有声音的语音模型和 MEG 数据的大脑模型。结果表明,该算法的几个组成部分有利于解码性能。同样,分析表明该算法随着 EEG 和 MEG 记录的增加而改进。这项研究表明,尽管数据存在噪声和可变性,但经过自我监督训练的 AI 可以解码感知语音。这项研究的最大局限在于它侧重于语音感知,但最终目标是将这项工作扩展到语音生成。
Galactica 是 Meta AI 和 Papers with Code 开发的自动组织科学的新型大型模型。该模型的主要优点是能够对其进行多个时期的训练而不会过度拟合,其中上游和下游性能通过使用重复标记得到改善。数据集设计对该方法至关重要,因为所有数据均以通用降价格式处理,以混合不同来源的知识。引文通过特定的标记进行处理,该标记允许研究人员在给定任何输入上下文的情况下预测引文。模型预测引用的能力随着规模的扩大而提高,并且模型在引用分布方面变得更好。此外,该模型可以执行涉及 SMILES 化学公式和蛋白质序列的多模态任务。具体而言,Galactica 在仅解码器设置中使用 Transformer 架构,并为所有模型尺寸激活 GeLU。
另一个重大改进是分布式训练。在传统的机器学习中,训练通常在使用单个处理器的单个机器上执行。这种方法适用于小型数据集和模型,但在处理大型数据集和复杂模型时,这种方法显然不切实际。在分布式训练中,训练工作负载分配给多个处理器或机器,从而可以更快地训练模型。一些公司还发布了简化深度学习堆栈分布式训练过程的框架 [53-55]。这些框架提供的工具和 API 允许开发人员轻松地将他们的训练工作负载分布在多个处理器或机器上,而无需管理底层基础设施。
text to image generation是指给定一个指令,生成一个与该指令对应的图像。同样,图像生成中常用的模型也遵循编码器-解码器架构,其中编码器更侧重于学习语言信息,解码器更侧重于利用学习到的信息来限制图像合成。一般来说,最近的工作可以分为两类,基于 GAN 的方法和基于扩散的方法(btw在扩散模型之前基本都是走的gan,vae,gan的工作多的一笔)。
In recent years, tremendous amount of progress is being made in the field of 3D Machine Learning, which is an interdisciplinary field that fuses computer vision, computer graphics and machine learning. This repo is derived from my study notes and will be used as a place for triaging new research papers.
I’ll use the following icons to differentiate 3D representations:
📷 Multi-view Images
👾 Volumetric
🎲 Point Cloud
💎 Polygonal Mesh
💊 Primitive-based
To find related papers and their relationships, check out Connected Papers, which provides a neat way to visualize the academic field in a graph representation.
Get Involved
To contribute to this Repo, you may add content through pull requests or open an issue to let me know.
⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ We have also created a Slack workplace for people around the globe to ask questions, share knowledge and facilitate collaborations. Together, I’m sure we can advance this field as a collaborative effort. Join the community with this link. ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐ ⭐
[ECCV 2020] DTVNet: Dynamic Time-lapse Video Generation via Single Still Image [paper][code]
[SIGGRAPH Asia 2019] Animating Landscape: Self-Supervised Learning of Decoupled Motion and Appearance for Single-Image Video Synthesis [paper][code][project page]
[CVPR 2018] Learning to Generate Time-lapse Videos Using Multi-stage Dynamic Generative Adversarial Networks [paper][code][project page]
Some Other Papers
Some other interesting papers for novel view synthesis or cinemagraph.
[arXiv 2022] Make-A-Video: Text-to-Video Generation without Text-Video Data [paper][project page]
[ECCV 2022] SinNeRF: Training Neural Radiance Fields on Complex Scenes from a Single Image [paper][code][project page] 🚕
[CVPR 2022] Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image [paper][code][project page]
Torchaudio is a library for audio and signal processing with PyTorch. It provides I/O, signal and data processing functions, datasets, model implementations and application components.
Recent breakthroughs in generative modeling of images have been predicated on the availability of high-quality and large-scale datasebts such as MNIST, CIFAR and ImageNet. We recognized the need for an audio dataset that was as approachable as those in the image domain.
Audio signals found in the wild contain multi-scale dependencies that prove particularly difficult to model, leading many previous efforts at data-driven audio synthesis to focus on more constrained domains such as texture synthesis or training small parametric models.
We encourage the broader community to use NSynth as a benchmark and entry point into audio machine learning. We also view NSynth as a building block for future datasets and envision a high-quality multi-note dataset for tasks like generation and transcription that involve learning complex language-like dependencies.
Description
NSynth is an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre, and envelope. For 1,006 instruments from commercial sample libraries, we generated four second, monophonic 16kHz audio snippets, referred to as notes, by ranging over every pitch of a standard MIDI pian o (21-108) as well as five different velocities (25, 50, 75, 100, 127). The note was held for the first three seconds and allowed to decay for the final second.
Some instruments are not capable of producing all 88 pitches in this range, resulting in an average of 65.4 pitches per instrument. Furthermore, the commercial sample packs occasionally contain duplicate sounds across multiple velocities, leaving an average of 4.75 unique velocities per pitch.
We also annotated each of the notes with three additional pieces of information based on a combination of human evaluation and heuristic algorithms:
Source: The method of sound production for the note’s instrument. This can be one of acoustic or electronic for instruments that were recorded from acoustic or electronic instruments, respectively, or synthetic for synthesized instruments. See their frequencies below.
Family: The high-level family of which the note’s instrument is a member. Each instrument is a member of exactly one family. See the complete list and their frequencies below.
Qualities: Sonic qualities of the note. See the quality descriptions and their co-occurrences below. Each note is annotated with zero or more qualities.
Format
Files
The NSynth dataset can be download in two formats:
Train [tfrecord | json/wav]: A training set with 289,205 examples. Instruments do not overlap with valid or test.
Valid [tfrecord | json/wav]: A validation set with 12,678 examples. Instruments do not overlap with train.
Test [tfrecord | json/wav]: A test set with 4,096 examples. Instruments do not overlap with train.
Below we detail how the note features are encoded in the Example protocol buffers and JSON files.
Example Features
Each Example contains the following features.
Feature
Type
Description
note
int64
A unique integer identifier for the note.
note_str
bytes
A unique string identifier for the note in the format <instrument_str>-<pitch>-<velocity>.
instrument
int64
A unique, sequential identifier for the instrument the note was synthesized from.
instrument_str
bytes
A unique string identifier for the instrument this note was synthesized from in the format <instrument_family_str>-<instrument_production_str>-<instrument_name>.
pitch
int64
The 0-based MIDI pitch in the range [0, 127].
velocity
int64
The 0-based MIDI velocity in the range [0, 127].
sample_rate
int64
The samples per second for the audio feature.
audio*
[float]
A list of audio samples represented as floating point values in the range [-1,1].
qualities
[int64]
A binary vector representing which sonic qualities are present in this note.
qualities_str
[bytes]
A list IDs of which qualities are present in this note selected from the sonic qualities list.
import time
import os
def long_time_task():
print('当前进程: {}'.format(os.getpid()))
time.sleep(2)
print("结果: {}".format(8 ** 20))
if __name__ == "__main__":
print('当前母进程: {}'.format(os.getpid()))
start = time.time()
for i in range(2):
long_time_task()
end = time.time()
print("用时{}秒".format((end-start)))
from multiprocessing import Pool, cpu_count
import os
import time
def long_time_task(i):
print('子进程: {} - 任务{}'.format(os.getpid(), i))
time.sleep(2)
print("结果: {}".format(8 ** 20))
if __name__=='__main__':
print("CPU内核数:{}".format(cpu_count()))
print('当前母进程: {}'.format(os.getpid()))
start = time.time()
p = Pool(4)
for i in range(5):
p.apply_async(long_time_task, args=(i,))
print('等待所有子进程完成。')
p.close()
p.join()
end = time.time()
print("总共用时{}秒".format((end - start)))
from multiprocessing import Process, Queue
import os, time, random
# 写数据进程执行的代码:
def write(q):
print('Process to write: {}'.format(os.getpid()))
for value in ['A', 'B', 'C']:
print('Put %s to queue...' % value)
q.put(value)
time.sleep(random.random())
# 读数据进程执行的代码:
def read(q):
print('Process to read:{}'.format(os.getpid()))
while True:
value = q.get(True)
print('Get %s from queue.' % value)
if __name__=='__main__':
# 父进程创建Queue,并传给各个子进程:
q = Queue()
pw = Process(target=write, args=(q,))
pr = Process(target=read, args=(q,))
# 启动子进程pw,写入:
pw.start()
# 启动子进程pr,读取:
pr.start()
# 等待pw结束:
pw.join()
# pr进程里是死循环,无法等待其结束,只能强行终止:
pr.terminate()
输出结果如下所示:
Process to write: 3036
Put A to queue...
Process to read:9408
Get A from queue.
Put B to queue...
Get B from queue.
Put C to queue...
Get C from queue.
import threading
import time
def long_time_task(i):
print('当前子线程: {} 任务{}'.format(threading.current_thread().name, i))
time.sleep(2)
print("结果: {}".format(8 ** 20))
if __name__=='__main__':
start = time.time()
print('这是主线程:{}'.format(threading.current_thread().name))
thread_list = []
for i in range(1, 3):
t = threading.Thread(target=long_time_task, args=(i, ))
thread_list.append(t)
for t in thread_list:
t.start()
for t in thread_list:
t.join()
end = time.time()
print("总共用时{}秒".format((end - start)))
import threading
import time
def long_time_task():
print('当子线程: {}'.format(threading.current_thread().name))
time.sleep(2)
print("结果: {}".format(8 ** 20))
if __name__=='__main__':
start = time.time()
print('这是主线程:{}'.format(threading.current_thread().name))
for i in range(5):
t = threading.Thread(target=long_time_task, args=())
t.setDaemon(True)
t.start()
end = time.time()
print("总共用时{}秒".format((end - start)))