MAE–transformer模型预训练

假设我们想从图像中识别出不同种类的椅子，然后将购买链接推荐给用户。一种可能的方法是先找出100种常见的椅子，为每种椅子拍摄1,000张不同角度的图像，然后在收集到的图像数据集上训练一个分类模型。这个椅子数据集虽然可能比Fashion-MNIST数据集要庞大，但样本数仍然不及ImageNet数据集中样本数的十分之一。这可能会导致适用于ImageNet数据集的复杂模型在这个椅子数据集上过拟合。同时，因为数据量有限，最终训练得到的模型的精度也可能达不到实用的要求。

为了应对上述问题，一个显而易见的解决办法是收集更多的数据。然而，收集和标注数据会花费大量的时间和资金。例如，为了收集ImageNet数据集，研究人员花费了数百万美元的研究经费。虽然目前的数据采集成本已降低了不少，但其成本仍然不可忽略。

另外一种解决办法是应用迁移学习（transfer learning），将从源数据集学到的知识迁移到目标数据集上。例如，虽然ImageNet数据集的图像大多跟椅子无关，但在该数据集上训练的模型可以抽取较通用的图像特征，从而能够帮助识别边缘、纹理、形状和物体组成等。这些类似的特征对于识别椅子也可能同样有效。

本节我们介绍迁移学习中的一种常用技术：微调（fine tuning）。如图9.1所示，微调由以下4步构成。

在源数据集（如ImageNet数据集）上预训练一个神经网络模型，即源模型。
创建一个新的神经网络模型，即目标模型。它复制了源模型上除了输出层外的所有模型设计及其参数。我们假设这些模型参数包含了源数据集上学习到的知识，且这些知识同样适用于目标数据集。我们还假设源模型的输出层跟源数据集的标签紧密相关，因此在目标模型中不予采用。
为目标模型添加一个输出大小为目标数据集类别个数的输出层，并随机初始化该层的模型参数。
在目标数据集（如椅子数据集）上训练目标模型。我们将从头训练输出层，而其余层的参数都是基于源模型的参数微调得到的。

当目标数据集远小于源数据集时，微调有助于提升模型的泛化能力。

代码实现微调：

pretrained_net = models.resnet18(pretrained=True)
pretrained_net.load_state_dict(torch.load('/home/kesci/input/resnet185352/resnet18-5c106cde.pth'))

下面打印源模型的成员变量fc。作为一个全连接层，它将ResNet最终的全局平均池化层输出变换成ImageNet数据集上1000类的输出。

print(pretrained_net.fc)

输出：Linear(in_features=512, out_features=1000, bias=True)

可见此时pretrained_net最后的输出个数等于目标数据集的类别数1000。所以我们应该将最后的fc成修改我们需要的输出类别数:

pretrained_net.fc = nn.Linear(512, 2)
print(pretrained_net.fc)

此时，pretrained_net的fc层就被随机初始化了，但是其他层依然保存着预训练得到的参数。由于是在很大的ImageNet数据集上预训练的，所以参数已经足够好，因此一般只需使用较小的学习率来微调这些参数，而fc中的随机初始化参数一般需要更大的学习率从头训练。PyTorch可以方便的对模型的不同部分设置不同的学习参数，我们在下面代码中将fc的学习率设为已经预训练过的部分的10倍。

output_params = list(map(id, pretrained_net.fc.parameters()))
feature_params = filter(lambda p: id(p) not in output_params, pretrained_net.parameters())

lr = 0.01
optimizer = optim.SGD([{'params': feature_params},
                       {'params': pretrained_net.fc.parameters(), 'lr': lr * 10}],
                       lr=lr, weight_decay=0.001)

记录：在MAE的微调训练中，提供了两种微调：

Linear probing: 锁死transformer的参数，只训练CIFAR10的那个Linear层。
Fine-tuning: 接着训练transformer的参数，同时也训练CIFAR10的那个Linear。

论文做了MAE各个部分的不同设置对比实验，这些实验能够揭示MAE更多的特性。首先是masking ratio，从下图可以看到，最优的设置是75%的masking ratio，此时linear probing和finetune效果最好，这比之前的研究要高很多，比如BEiT的masking ratio是40%。另外也可以看到linear probing和finetune的表现不一样，linear probing效果随着masking ratio的增加逐渐提高直至一个峰值后出现下降，而finetune效果在不同making ratio下差异小，masking ratio在40%~80%范围内均能表现较好。

torch.meshgrid（）函数解析

最近看到很多论文里都有这个函数（yolov3 以及最近大火的swin transformer），记录下函数的使用：

https://pytorch.org/docs/stable/generated/torch.meshgrid.html

说明：

　　torch.meshgrid()的功能是生成网格，可以用于生成坐标。

函数输入:

　　输入两个数据类型相同的一维tensor

函数输出：

输出两个tensor（tensor行数为第一个输入张量的元素个数，列数为第二个输入张量的元素个数）

注意：

　　1）当两个输入tensor数据类型不同或维度不是一维时会报错。

　　2）其中第一个输出张量填充第一个输入张量中的元素，各行元素相同；第二个输出张量填充第二个输入张量中的元素各列元素相同。

>>> x = torch.tensor([1, 2, 3])
>>> y = torch.tensor([4, 5, 6])

Observe the element-wise pairings across the grid, (1, 4),
(1, 5), ..., (3, 6). This is the same thing as the
cartesian product.
>>> grid_x, grid_y = torch.meshgrid(x, y, indexing='ij')
>>> grid_x
tensor([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 3]])
>>> grid_y
tensor([[4, 5, 6],
        [4, 5, 6],
        [4, 5, 6]])

# 【1】
import torch
a = torch.tensor([1, 2, 3, 4])
print(a)
b = torch.tensor([4, 5, 6])
print(b)
x, y = torch.meshgrid(a, b)
print(x)
print(y)
 
结果显示：
tensor([1, 2, 3, 4])
tensor([4, 5, 6])
tensor([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 3],
        [4, 4, 4]])
tensor([[4, 5, 6],
        [4, 5, 6],
        [4, 5, 6],
        [4, 5, 6]])
 
 
 
# 【2】
import torch
a = torch.tensor([1, 2, 3, 4, 5, 6])
print(a)
b = torch.tensor([7, 8, 9, 10])
print(b)
x, y = torch.meshgrid(a, b)
print(x)
print(y)
 
结果显示：
tensor([1, 2, 3, 4, 5, 6])
tensor([ 7,  8,  9, 10])
tensor([[1, 1, 1, 1],
        [2, 2, 2, 2],
        [3, 3, 3, 3],
        [4, 4, 4, 4],
        [5, 5, 5, 5],
        [6, 6, 6, 6]])
tensor([[ 7,  8,  9, 10],
        [ 7,  8,  9, 10],
        [ 7,  8,  9, 10],
        [ 7,  8,  9, 10],
        [ 7,  8,  9, 10],
        [ 7,  8,  9, 10]])

PyTorch中model.modules(), model.named_modules(), model.children(), model.named_children(), model.parameters(), model.named_parameters(), model.state_dict()

本文通过一个例子实验来观察并讲解PyTorch中model.modules(), model.named_modules(), model.children(), model.named_children(), model.parameters(), model.named_parameters(), model.state_dict()这些model实例方法的返回值。例子如下：

import torch 
import torch.nn as nn 

class Net(nn.Module):

    def __init__(self, num_class=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=6, kernel_size=3),
            nn.BatchNorm2d(6),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(in_channels=6, out_channels=9, kernel_size=3),
            nn.BatchNorm2d(9),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

        self.classifier = nn.Sequential(
            nn.Linear(9*8*8, 128),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(128, num_class)
        )

    def forward(self, x):
        output = self.features(x)
        output = output.view(output.size()[0], -1)
        output = self.classifier(output)
    
        return output

model = Net()

如上代码定义了一个由两层卷积层，两层全连接层组成的网络模型。值得注意的是，这个Net由外到内有3个层次：

Net:

----features:

------------Conv2d
------------BatchNorm2d
------------ReLU
------------MaxPool2d
------------Conv2d
------------BatchNorm2d
------------ReLU
------------MaxPool2d

----classifier:

------------Linear
------------ReLU
------------Dropout
------------Linear

网络Net本身是一个nn.Module的子类，它又包含了features和classifier两个由Sequential容器组成的nn.Module子类，features和classifier各自又包含众多的网络层，它们都属于nn.Module子类，所以从外到内共有3个层次。
下面我们来看这几个实例方法的返回值都是什么。

In [7]: model.named_modules()                                                                                                       
Out[7]: <generator object Module.named_modules at 0x7f5db88f3840>

In [8]: model.modules()                                                         
Out[8]: <generator object Module.modules at 0x7f5db3f53c00>

In [9]: model.children()                                                        
Out[9]: <generator object Module.children at 0x7f5db3f53408>

In [10]: model.named_children()                                                 
Out[10]: <generator object Module.named_children at 0x7f5db80305e8>

In [11]: model.parameters()                                                     
Out[11]: <generator object Module.parameters at 0x7f5db3f534f8>

In [12]: model.named_parameters()                                               
Out[12]: <generator object Module.named_parameters at 0x7f5d42da7570>

In [13]: model.state_dict()                                                     
Out[13]: 
OrderedDict([('features.0.weight', tensor([[[[ 0.1200, -0.1627, -0.0841],
                        [-0.1369, -0.1525,  0.0541],
                        [ 0.1203,  0.0564,  0.0908]],
                      ……

可以看出，除了model.state_dict()返回的是一个字典，其他几个方法返回值都显示的是一个生成器，是一个可迭代变量，我们通过列表推导式用for循环将返回值取出来进一步进行观察：

In [14]: model_modules = [x for x in model.modules()]                                                                                

In [15]: model_named_modules = [x for x in model.named_modules()]        

In [16]: model_children = [x for x in model.children()]                                                                              

In [17]: model_named_children = [x for x in model.named_children()]                                                                  

In [18]: model_parameters = [x for x in model.parameters()]                                                                          

In [19]: model_named_parameters = [x for x in model.named_parameters()]

1. model.modules()

model.modules()迭代遍历模型的所有子层，所有子层即指nn.Module子类，在本文的例子中，Net(), features(), classifier(),以及nn.xxx构成的卷积，池化，ReLU, Linear, BN, Dropout等都是nn.Module子类，也就是model.modules()会迭代的遍历它们所有对象。我们看一下列表model_modules:

In [20]: model_modules                                                                                                               
Out[20]: 
[Net(
   (features): Sequential(
     (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
     (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (2): ReLU(inplace)
     (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
     (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
     (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
     (6): ReLU(inplace)
     (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   )
   (classifier): Sequential(
     (0): Linear(in_features=576, out_features=128, bias=True)
     (1): ReLU(inplace)
     (2): Dropout(p=0.5)
     (3): Linear(in_features=128, out_features=10, bias=True)
   )
 ), 
Sequential(
   (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
   (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU(inplace)
   (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
   (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (6): ReLU(inplace)
   (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
 ), 
Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1)), 
BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), 
ReLU(inplace), 
MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False), 
Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1)), 
BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True), 
ReLU(inplace), 
MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False), 
Sequential(
   (0): Linear(in_features=576, out_features=128, bias=True)
   (1): ReLU(inplace)
   (2): Dropout(p=0.5)
   (3): Linear(in_features=128, out_features=10, bias=True)
 ), 
Linear(in_features=576, out_features=128, bias=True), 
ReLU(inplace), 
Dropout(p=0.5), 
Linear(in_features=128, out_features=10, bias=True)]

In [21]: len(model_modules)                                                                                                          
Out[21]: 15

可以看出，model_modules列表中共有15个元素，首先是整个Net，然后遍历了Net下的features子层，进一步遍历了feature下的所有层，然后又遍历了classifier子层以及其下的所有层。所以说model.modules()能够迭代地遍历模型的所有子层。

2. model.named_modules()

顾名思义，它就是有名字的model.modules()。model.named_modules()不但返回模型的所有子层，还会返回这些层的名字：

In [28]: len(model_named_modules)                                                                                                    
Out[28]: 15

In [29]: model_named_modules                                                                                                         
Out[29]: 
[('', Net(
    (features): Sequential(
      (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
      (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU(inplace)
      (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
      (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (6): ReLU(inplace)
      (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (classifier): Sequential(
      (0): Linear(in_features=576, out_features=128, bias=True)
      (1): ReLU(inplace)
      (2): Dropout(p=0.5)
      (3): Linear(in_features=128, out_features=10, bias=True)
    )
  )), 
('features', Sequential(
    (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
    (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
    (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU(inplace)
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )), 
('features.0', Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))), 
('features.1', BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)), ('features.2', ReLU(inplace)), 
('features.3', MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)), 
('features.4', Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))), 
('features.5', BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)), ('features.6', ReLU(inplace)), 
('features.7', MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)), 
('classifier',
  Sequential(
    (0): Linear(in_features=576, out_features=128, bias=True)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=128, out_features=10, bias=True)
  )), 
('classifier.0', Linear(in_features=576, out_features=128, bias=True)), 
('classifier.1', ReLU(inplace)), 
('classifier.2', Dropout(p=0.5)), 
('classifier.3', Linear(in_features=128, out_features=10, bias=True))]

可以看出，model.named_modules()也遍历了15个元素，但每个元素都有了自己的名字，从名字可以看出，除了在模型定义时有命名的features和classifier，其它层的名字都是PyTorch内部按一定规则自动命名的。返回层以及层的名字的好处是可以按名字通过迭代的方法修改特定的层，如果在模型定义的时候就给每个层起了名字，比如卷积层都是conv1,conv2…的形式，那么我们可以这样处理：

for name, layer in model.named_modules():
    if 'conv' in name:
        对layer进行处理

当然，在没有返回名字的情形中，采用isinstance()函数也可以完成上述操作：

for layer in model.modules():
    if isinstance(layer, nn.Conv2d):
        对layer进行处理

3. model.children()

如果把这个网络模型Net按层次从外到内进行划分的话，features和classifier是Net的子层，而conv2d, ReLU, BatchNorm, Maxpool2d这些有时features的子层， Linear, Dropout, ReLU等是classifier的子层，上面的model.modules()不但会遍历模型的子层，还会遍历子层的子层，以及所有子层。
而model.children()只会遍历模型的子层，这里即是features和classifier。

In [22]: len(model_children)                                                                                                         
Out[22]: 2

In [22]: model_children                                                                                                              
Out[22]: 
[Sequential(
   (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
   (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (2): ReLU(inplace)
   (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
   (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
   (6): ReLU(inplace)
   (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
 ), 
Sequential(
   (0): Linear(in_features=576, out_features=128, bias=True)
   (1): ReLU(inplace)
   (2): Dropout(p=0.5)
   (3): Linear(in_features=128, out_features=10, bias=True)
 )]

可以看出，它只遍历了两个元素，即features和classifier。

4. model.named_children()

model.named_children()就是带名字的model.children(), 相比model.children()， model.named_children()不但迭代的遍历模型的子层，还会返回子层的名字：

In [23]: len(model_named_children)                                                                                                   
Out[23]: 2

In [24]: model_named_children                                                                                                        
Out[24]: 
[('features', Sequential(
    (0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
    (1): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(6, 9, kernel_size=(3, 3), stride=(1, 1))
    (5): BatchNorm2d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU(inplace)
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )), 
('classifier', Sequential(
    (0): Linear(in_features=576, out_features=128, bias=True)
    (1): ReLU(inplace)
    (2): Dropout(p=0.5)
    (3): Linear(in_features=128, out_features=10, bias=True)
  ))]

对比上面的model.children(), 这里的model.named_children()还返回了两个子层的名称：features 和 classifier .

5. model.parameters()

迭代地返回模型的所有参数。

In [30]: len(model_parameters)                                                                                                       
Out[30]: 12

In [31]: model_parameters                                                                                                            
Out[31]: 
[Parameter containing:
 tensor([[[[ 0.1200, -0.1627, -0.0841],
           [-0.1369, -0.1525,  0.0541],
           [ 0.1203,  0.0564,  0.0908]],
           ……
          [[-0.1587,  0.0735, -0.0066],
           [ 0.0210,  0.0257, -0.0838],
           [-0.1797,  0.0675,  0.1282]]]], requires_grad=True),
 Parameter containing:
 tensor([-0.1251,  0.1673,  0.1241, -0.1876,  0.0683,  0.0346],
        requires_grad=True),
 Parameter containing:
 tensor([0.0072, 0.0272, 0.8620, 0.0633, 0.9411, 0.2971], requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0.], requires_grad=True),
 Parameter containing:
 tensor([[[[ 0.0632, -0.1078, -0.0800],
           [-0.0488,  0.0167,  0.0473],
           [-0.0743,  0.0469, -0.1214]],
           …… 
          [[-0.1067, -0.0851,  0.0498],
           [-0.0695,  0.0380, -0.0289],
           [-0.0700,  0.0969, -0.0557]]]], requires_grad=True),
 Parameter containing:
 tensor([-0.0608,  0.0154,  0.0231,  0.0886, -0.0577,  0.0658, -0.1135, -0.0221,
          0.0991], requires_grad=True),
 Parameter containing:
 tensor([0.2514, 0.1924, 0.9139, 0.8075, 0.6851, 0.4522, 0.5963, 0.8135, 0.4010],
        requires_grad=True),
 Parameter containing:
 tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True),
 Parameter containing:
 tensor([[ 0.0223,  0.0079, -0.0332,  ..., -0.0394,  0.0291,  0.0068],
         [ 0.0037, -0.0079,  0.0011,  ..., -0.0277, -0.0273,  0.0009],
         [ 0.0150, -0.0110,  0.0319,  ..., -0.0110, -0.0072, -0.0333],
         ...,
         [-0.0274, -0.0296, -0.0156,  ...,  0.0359, -0.0303, -0.0114],
         [ 0.0222,  0.0243, -0.0115,  ...,  0.0369, -0.0347,  0.0291],
         [ 0.0045,  0.0156,  0.0281,  ..., -0.0348, -0.0370, -0.0152]],
        requires_grad=True),
 Parameter containing:
 tensor([ 0.0072, -0.0399, -0.0138,  0.0062, -0.0099, -0.0006, -0.0142, -0.0337,
          ……
         -0.0370, -0.0121, -0.0348, -0.0200, -0.0285,  0.0367,  0.0050, -0.0166],
        requires_grad=True),
 Parameter containing:
 tensor([[-0.0130,  0.0301,  0.0721,  ..., -0.0634,  0.0325, -0.0830],
         [-0.0086, -0.0374, -0.0281,  ..., -0.0543,  0.0105,  0.0822],
         [-0.0305,  0.0047, -0.0090,  ...,  0.0370, -0.0187,  0.0824],
         ...,
         [ 0.0529, -0.0236,  0.0219,  ...,  0.0250,  0.0620, -0.0446],
         [ 0.0077, -0.0576,  0.0600,  ..., -0.0412, -0.0290,  0.0103],
         [ 0.0375, -0.0147,  0.0622,  ...,  0.0350,  0.0179,  0.0667]],
        requires_grad=True),
 Parameter containing:
 tensor([-0.0709, -0.0675, -0.0492,  0.0694,  0.0390, -0.0861, -0.0427, -0.0638,
         -0.0123,  0.0845], requires_grad=True)]

6. model.named_parameters()

如果你是从前面看过来的，就会知道，这里就是迭代的返回带有名字的参数，会给每个参数加上带有 .weight或 .bias的名字以区分权重和偏置：

In [32]: len(model.named_parameters)                                                                                                 
Out[32]: 12

In [33]: model_named_parameters                                                                                                      
Out[33]: 
[('features.0.weight', Parameter containing:
  tensor([[[[ 0.1200, -0.1627, -0.0841],
            [-0.1369, -0.1525,  0.0541],
            [ 0.1203,  0.0564,  0.0908]],
           ……
           [[-0.1587,  0.0735, -0.0066],
            [ 0.0210,  0.0257, -0.0838],
            [-0.1797,  0.0675,  0.1282]]]], requires_grad=True)),
 ('features.0.bias', Parameter containing:
  tensor([-0.1251,  0.1673,  0.1241, -0.1876,  0.0683,  0.0346],
         requires_grad=True)),
 ('features.1.weight', Parameter containing:
  tensor([0.0072, 0.0272, 0.8620, 0.0633, 0.9411, 0.2971], requires_grad=True)),
 ('features.1.bias', Parameter containing:
  tensor([0., 0., 0., 0., 0., 0.], requires_grad=True)),
 ('features.4.weight', Parameter containing:
  tensor([[[[ 0.0632, -0.1078, -0.0800],
            [-0.0488,  0.0167,  0.0473],
            [-0.0743,  0.0469, -0.1214]],
           ……
           [[-0.1067, -0.0851,  0.0498],
            [-0.0695,  0.0380, -0.0289],
            [-0.0700,  0.0969, -0.0557]]]], requires_grad=True)),
 ('features.4.bias', Parameter containing:
  tensor([-0.0608,  0.0154,  0.0231,  0.0886, -0.0577,  0.0658, -0.1135, -0.0221,
           0.0991], requires_grad=True)),
 ('features.5.weight', Parameter containing:
  tensor([0.2514, 0.1924, 0.9139, 0.8075, 0.6851, 0.4522, 0.5963, 0.8135, 0.4010],
         requires_grad=True)),
 ('features.5.bias', Parameter containing:
  tensor([0., 0., 0., 0., 0., 0., 0., 0., 0.], requires_grad=True)),
 ('classifier.0.weight', Parameter containing:
  tensor([[ 0.0223,  0.0079, -0.0332,  ..., -0.0394,  0.0291,  0.0068],
          ……
          [ 0.0045,  0.0156,  0.0281,  ..., -0.0348, -0.0370, -0.0152]],
         requires_grad=True)),
 ('classifier.0.bias', Parameter containing:
  tensor([ 0.0072, -0.0399, -0.0138,  0.0062, -0.0099, -0.0006, -0.0142, -0.0337,
           ……
          -0.0370, -0.0121, -0.0348, -0.0200, -0.0285,  0.0367,  0.0050, -0.0166],
         requires_grad=True)),
 ('classifier.3.weight', Parameter containing:
  tensor([[-0.0130,  0.0301,  0.0721,  ..., -0.0634,  0.0325, -0.0830],
          [-0.0086, -0.0374, -0.0281,  ..., -0.0543,  0.0105,  0.0822],
          [-0.0305,  0.0047, -0.0090,  ...,  0.0370, -0.0187,  0.0824],
          ...,
          [ 0.0529, -0.0236,  0.0219,  ...,  0.0250,  0.0620, -0.0446],
          [ 0.0077, -0.0576,  0.0600,  ..., -0.0412, -0.0290,  0.0103],
          [ 0.0375, -0.0147,  0.0622,  ...,  0.0350,  0.0179,  0.0667]],
         requires_grad=True)),
 ('classifier.3.bias', Parameter containing:
  tensor([-0.0709, -0.0675, -0.0492,  0.0694,  0.0390, -0.0861, -0.0427, -0.0638,
          -0.0123,  0.0845], requires_grad=True))]

7. model.state_dict()

model.state_dict()直接返回模型的字典，和前面几个方法不同的是这里不需要迭代，它本身就是一个字典，可以直接通过修改state_dict来修改模型各层的参数，用于参数剪枝特别方便。

ESPCN 图像超分辨率方法

论文地址： https://arxiv.org/abs/1609.05158

代码：https://github.com/leftthomas/ESPCN

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

ESPCN 是在2016年在CVPR上发表的一片论文，中提出的一种实时的基于卷积神经网络的图像超分辨率方法。

这篇论文主要就是提出了一种新的亚像素卷积层(sub-pixel convolutional layer)，以往的方法，为了生成高分辨率的输出，一般是先对输入进行上采样扩大图像分辨率，得到与高分辨率图像同样的大小，再作为网络输入，意味着卷积操作在较高的分辨率上进行，相比于在低分辨率的图像上计算卷积，会降低效率。 ESPCN(Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network，CVPR 2016)提出一种在低分辨率图像上直接计算卷积得到高分辨率图像的高效率方法。

如果想最后的分辨率从 n 到 rn，ESPCN会生成r*r个通道，再进行sub-pixel convolutional，生成高分辨率的图片。假设是9通道混合，这里的通道混合是将每个通道对应位置的元素重新排列成3*3的图像。这个变换虽然被称作sub-pixel convolution, 但实际上并没有卷积操作。

通过使用sub-pixel convolution, 图像从低分辨率到高分辨率放大的过程，插值函数被隐含地包含在前面的卷积层中，可以自动学习到。只在最后一层对图像大小做变换，前面的卷积运算由于在低分辨率图像上进行，因此效率会较高。

ESPCN激活函数采用tanh替代了ReLU。损失函数为均方误差。

pytorch中已经集成了 sub-pixel convolution ：

nn.PixelShuffle(upscale_factor)

以四维输入(N,C,H,W)为例，Pixelshuffle会将为(∗,r 2 C r^2Cr2C,H,W)的Tensor给reshape成(∗,C,rH,rW)的Tensor

Upsample：

对给定多通道的1维（temporal）、2维（spatial）、3维（volumetric）数据进行上采样。

对volumetric输入（3维——点云数据），输入数据Tensor格式为5维：minibatch x channels x depth x height x width
对spatial输入（2维——jpg、png等数据），输入数据Tensor格式为4维：minibatch x channels x height x width
对temporal输入（1维——向量数据），输入数据Tensor格式为3维：minibatch x channels x width

此算法支持最近邻，线性插值，双线性插值，三次线性插值对3维、4维、5维的输入Tensor分别进行上采样（Upsample）。

RepVGG: Making VGG-style ConvNets Great Again

论文下载地址：https://arxiv.org/abs/2101.03697
官方源码（Pytorch实现）：https://github.com/DingXiaoH/RepVGG

这篇论文对于我来说最大的用处是提出了结构的重重参数化：

在推理时将三个并行分支合并成单个分支，并保证输出输出不变。

结构重参数化主要分为两步，第一步主要是将Conv2d算子和BN算子融合以及将只有BN的分支转换成一个Conv2d算子，第二步将每个分支上的3x3卷积层融合成一个卷积层。

1、Conv2d和BN 这个已经是非常常见的，因为卷积核bn都是线性运算，所以可以进行合并。

这里假设输入的特征图（Input feature map）如下图所示，输入通道数为2，然后采用两个卷积核（图中只画了第一个卷积核对应参数）。

接着计算一下输出特征图（Output feature map）通道1上的第一个元素，即当卷积核1在输入特征图红色框区域卷积时得到的值（为了保证输入输出特征图高宽不变，所以对Input feature map进行了Padding）。其他位置的计算过程类似这里就不去演示了。

然后再将卷积层输出的特征图作为BN层的输入，这里同样计算一下输出特征图（Output feature map）通道1上的第一个元素，按照上述BN在推理时的计算公式即可得到如下图所示的计算结果。

代码

Conv2d+BN融合实验(Pytorch)
下面是参考作者提供的源码改的一个小实验，首先创建了一个module包含了卷积和BN模块，然后按照上述转换公式将卷积层的权重和BN的权重进行融合转换，接着载入到新建的卷积模块fused_conv中，最后随机创建一个Tensor（f1）将它分别输入到module以及fused_conv中，通过对比两者的输出可以发现它们的结果是一致的。

from collections import OrderedDict

import numpy as np
import torch
import torch.nn as nn


def main():
    torch.random.manual_seed(0)

    f1 = torch.randn(1, 2, 3, 3)

    module = nn.Sequential(OrderedDict(
        conv=nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=1, padding=1, bias=False),
        bn=nn.BatchNorm2d(num_features=2)
    ))

    module.eval()

    with torch.no_grad():
        output1 = module(f1)
        print(output1)

    # fuse conv + bn
    kernel = module.conv.weight 
    running_mean = module.bn.running_mean
    running_var = module.bn.running_var
    gamma = module.bn.weight
    beta = module.bn.bias
    eps = module.bn.eps
    std = (running_var + eps).sqrt()
    t = (gamma / std).reshape(-1, 1, 1, 1)  # [ch] -> [ch, 1, 1, 1]
    kernel = kernel * t
    bias = beta - running_mean * gamma / std
    fused_conv = nn.Conv2d(in_channels=2, out_channels=2, kernel_size=3, stride=1, padding=1, bias=True)
    fused_conv.load_state_dict(OrderedDict(weight=kernel, bias=bias))

    with torch.no_grad():
        output2 = fused_conv(f1)
        print(output2)

    np.testing.assert_allclose(output1.numpy(), output2.numpy(), rtol=1e-03, atol=1e-05)
    print("convert module has been tested, and the result looks good!")


if __name__ == '__main__':
    main()

repVGG中大量运用conv+BN层，我们知道将层合并，减少层数能提升网络性能，下面的推理是conv带有bias的过程：

这其实就是一个卷积层，只不过权重考虑了BN的参数我们令：

最终的融合结果即为：

相关融合代码如下图所示：

def _fuse_bn_tensor(self, branch):
        if branch is None:
            return 0, 0
        if isinstance(branch, nn.Sequential):
            kernel = branch.conv.weight
            running_mean = branch.bn.running_mean
            running_var = branch.bn.running_var
            gamma = branch.bn.weight
            beta = branch.bn.bias
            eps = branch.bn.eps
        else:
            ...
        std = (running_var + eps).sqrt()
        t = (gamma / std).reshape(-1, 1, 1, 1)
        return kernel * t, beta - running_mean * gamma / std

2、如何将不同分支合并：

作者这里首先将不同分支的卷积核都变成3*3：

2.1 将1×1卷积转换成3×3卷积
这个过程比较简单，如下图所示，以1×1卷积层中某一个卷积核为例，只需在原来权重周围补一圈零就行了，这样就变成了3×3的卷积层，注意为了保证输入输出特征图高宽不变，此时需要将padding设置成1（原来卷积核大小为1×1时padding为0）。最后按照上述2.1中讲的内容将卷积层和BN层进行融合即可。

2.2将BN转换成3×3卷积
对于只有BN的分支由于没有卷积层，所以我们可以先自己构建出一个卷积层来。如下图所示，构建了一个3×3的卷积层，该卷积层只做了恒等映射，即输入输出特征图不变。既然有了卷积层，那么又可以按照上述2.1中讲的内容将卷积层和BN层进行融合。

2.3 多分支融合
在上面的章节中，我们已经讲了怎么把每个分支融合转换成一个3×3的卷积层，接下来需要进一步将多分支转换成一个单路3×3卷积层。

合并的过程其实也很简单，直接将这三个卷积层的参数相加即可，具体推理过程就不讲了，如果不了解的可以自己动手算算。

总的来说，这篇论文的目标是Simple is Fast, Memory-economical, Flexible，提出了很多想法去实现上述目标，对于当前我的工作还是比较有启发的，尤其是最后对网络进行合并以及量化部分。下一步要好好学习下torch的量化QAT (torch.quantization.prepare_qat)

页码： 12

IPython（jupyter）中的常用工具

ipython是一个python的交互式shell，比默认的python shell好用得多，支持变量自动补全，自动缩进，支持bash shell命令，内置了许多很有用的功能和函数。学习ipython将会让我们以一种更高的效率来使用python。同时它也是利用Python进行科学计算和交互可视化的一个最佳的平台。

IPython提供了两个主要的组件：

1.一个强大的python交互式shell
2.供Jupyter notebooks使用的一个Jupyter内核（IPython notebook）

IPython的主要功能如下：

1.运行ipython控制台
2.使用ipython作为系统shell
3.使用历史输入(history)
4.Tab补全
5.使用%run命令运行脚本
6.使用%timeit命令快速测量时间
7.使用%pdb命令快速debug
8.使用pylab进行交互计算
9.使用IPython Notebook

Tab键自动补全

在shell中输入表达式时，只要按下Tab键，当前命名空间中任何与输入的字符串相匹配的变量(对象或者函数等)就会被找出来

内省

在变量的前面或者后面加上一个问号?，就可以将有关该对象的一些通用信息显示出来，这就叫做对象的内省

如果对象是一个函数或者实例方法，则它的docstring也会被显示出来

使用历史命令history

在IPython shell中，使用历史命令可以简单地使用上下翻页键即可，另外我们也可以使用hist命令(或者history命令)查看所有的历史输入。（正确的做法是使用%hist，在这里，%hist也是一个魔法命令）

使用`%run`命令运行脚本

在ipython会话环境中，所有文件都可以通过%run命令当做Python程序来运行，输入%run 路径+python文件名称即可

使用`%timeit`命令快速测量代码运行时间

在一个交互式会话中，我们可以使用%timeit魔法命令快速测量代码运行时间。相同的命令会在一个循环中多次执行，多次运行时长的平均值作为该命令的最终评估时长。-n 选项可以控制命令在单词循环中执行的次数，-r选项控制执行循环的次数。

使用`%debug`命令进行快速debug

ipython带有一个强大的调试器。无论何时控制台抛出了一个异常，我们都可以使用%debug魔法命令在异常点启动调试器。接着你就能调试模式下访问所有的本地变量和整个栈回溯。使用u和d向上和向下访问栈，使用q退出调试器。在调试器中输入?可以查看所有的可用命令列表。

在IPython中使用系统shell

我们可以在IPython中直接使用系统shell，并获取读取结果作为一个Python字符串列表。为了实现这种功能，我们需要使用感叹号!作为shell命令的前缀。比如现在在我的windows系统中，直接在IPython中ping百度

重点：`display` 模块

官方教程 https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

之前想要在jupyter中显示图像、视频 or voice、html，可能不知道该怎么办，有了IP display模块，可以解决该问题。

1、audio

from IPython.display import Audio,display
sound_file = '../taobao427.mp3'
display(Audio(sound_file))

2、ipython.display.image



from IPython.display import display, Image

path = "1.jpg"

display( Image( filename = path) )

3、播放视频

from IPython.display import clear_output,  display, HTML
from PIL import Image
import matplotlib.pyplot as plt
import time
import cv2
import os

def show_video(video_path:str,small:int=2):
    if not os.path.exists(video_path):
        print("视频文件不存在")
    video = cv2.VideoCapture(video_path)
    current_time = 0
    while(True):
        try:
            clear_output(wait=True)
            ret, frame = video.read()
            if not ret:
                break
            lines, columns, _ = frame.shape
            #########do img preprocess##########
            
            # 画出一个框
            #     cv2.rectangle(img, (500, 300), (800, 400), (0, 0, 255), 5, 1, 0)
             # 上下翻转
             # img= cv2.flip(img, 0)
            
            ###################################
            
            if current_time == 0:
                current_time = time.time()
            else:
                last_time = current_time
                current_time = time.time()
                fps = 1. / (current_time - last_time)
                text = "FPS: %d" % int(fps)
                cv2.putText(frame, text , (0,100), cv2.FONT_HERSHEY_TRIPLEX, 3.65, (255, 0, 0), 2)
                
          #     img = cv2.resize(img,(1080,1080))
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frame = cv2.resize(frame, (int(columns / small), int(lines / small)))

            img = Image.fromarray(frame)

            display(img)
            # 控制帧率
            time.sleep(0.02)
        except KeyboardInterrupt:
            video.release()

4、htlm（视频）

# ########## display
from IPython.display import display, HTML

html_str = '''
<video controls width=\"500\" height=\"500\" src=\"{}\">animation</video>
'''.format("./dataset/vid****8726.mp4")
print(html_str)
display(HTML(html_str))

5、插入参考的网页或者论文 iframe

from IPython.display import IFrame IFrame(src='https://www.baidu.com/',width=800,height=500)

from IPython.display import HTMLHTML("""

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

""")

6、插入iframe标签

from IPython.display import HTML HTML('')

linux shell 学习笔记

之前浅浅学习过shell知识，但因为没怎么用过，所以基本上忘光了，所以在重新复习下。

Github 地址：https://github.com/wangdoc/bash-tutorial

在线阅读： https://wangdoc.com/bash/

学习 Bash，首先需要理解 Shell 是什么。Shell 这个单词的原意是“外壳”，跟 kernel（内核）相对应，比喻内核外面的一层，即用户跟内核交互的对话界面。

具体来说，Shell 这个词有多种含义。

首先，Shell 是一个程序，提供一个与用户对话的环境。这个环境只有一个命令提示符，让用户从键盘输入命令，所以又称为命令行环境（command line interface，简写为 CLI）。Shell 接收到用户输入的命令，将命令送入操作系统执行，并将结果返回给用户。本书中，除非特别指明，Shell 指的就是命令行环境。

其次，Shell 是一个命令解释器，解释用户输入的命令。它支持变量、条件判断、循环操作等语法，所以用户可以用 Shell 命令写出各种小程序，又称为脚本（script）。这些脚本都通过 Shell 的解释执行，而不通过编译。

最后，Shell 是一个工具箱，提供了各种小工具，供用户方便地使用操作系统的功能。

1、查看解释器

cat /etc/shells

2、修改用户解释器

usermod -s /bin/bash 用户名
or
chsh -s /bin/sh   #  ch ==change  sh == bash  改变bash

root@Administrator:~# chsh -s /bin/sh
root@Administrator:~# chsh
Changing the login shell for root
Enter the new value, or press ENTER for the default
        Login Shell [/bin/sh]: /bin/bash

4、输入输出重定向


输出重定向常见形式：


command > file  #将标准输出1重定向到 file 里
command 1> file #将标准输出1重定向到 file 里，与上面的写法功能一样
command 2> file #将标准错误输出1重定向到 file 里
command &> file #将标准输出1 与 标准错误输出2 一起重定向到 file 里

不覆盖file的输出重定向 >>

&> 为一起重定向标准输出与标准错误输出的简便写法

输入重定向

与输出重定向类似，将文件内容重定向到标准输入0

command <file #重定向标准输入
command << xxx #here document

<<xxx 这种形式被称为Here document，xxx为任意字符串，作为标签，为Here documen的起始，输入时直接在终端里输入多行内容，完成后再次输入xxx，标记输入完成。

#here document
command <<标签  
>内容
>内容
>...
>标签
 
$ head -v <<abc #<<abc 是指here document的起始
> 123 #内容
> 123 #内容
> 123 #内容
> abc #abc 表示结束
==> standard input <==  #Here document被定向为标准输入
123
123
123

5、管道 |

“|”连接两个命令，shell会将前后两个进程的输入输出用一个管道相连，以便达到进程间通信的目的

用途，可以将多个命令连接，同时将上一个命令的输出作为下个命令的输入，送到下个命令中去

ls |grep *.txt

4、快捷键：

Bash 提供很多快捷键，可以大大方便操作。下面是一些最常用的快捷键，完整的介绍参见《行操作》一章。

Ctrl + L：清除屏幕并将当前行移到页面顶部。
Ctrl + C：中止当前正在执行的命令。
Shift + PageUp：向上滚动。
Shift + PageDown：向下滚动。
Ctrl + U：从光标位置删除到行首。
Ctrl + K：从光标位置删除到行尾。
Ctrl + W：删除光标位置前一个单词。
Ctrl + D：关闭 Shell 会话。
↑，↓：浏览已执行命令的历史记录。

除了上面的快捷键，Bash 还具有自动补全功能。命令输入到一半的时候，可以按下 Tab 键，Bash 会自动完成剩下的部分。比如，输入tou，然后按一下 Tab 键，Bash 会自动补上ch。

除了命令的自动补全，Bash 还支持路径的自动补全。有时，需要输入很长的路径，这时只需要输入前面的部分，然后按下 Tab 键，就会自动补全后面的部分。如果有多个可能的选择，按两次 Tab 键，Bash 会显示所有选项，让你选择。

光标移动

Readline 提供快速移动光标的快捷键。

Ctrl + a：移到行首。
Ctrl + b：向行首移动一个字符，与左箭头作用相同。
Ctrl + e：移到行尾。
Ctrl + f：向行尾移动一个字符，与右箭头作用相同。
Alt + f：移动到当前单词的词尾。
Alt + b：移动到当前单词的词首。

清除屏幕

Ctrl + l快捷键可以清除屏幕，即将当前行移到屏幕的第一行，与clear命令作用相同。

目录堆栈

cd –

Bash 可以记忆用户进入过的目录。默认情况下，只记忆前一次所在的目录，cd -命令可以返回前一次的目录。

pushd，popd

如果希望记忆多重目录，可以使用pushd命令和popd命令。它们用来操作目录堆栈。

pushd命令的用法类似cd命令，可以进入指定的目录。

$ pushd dirname

上面命令会进入目录dirname，并将该目录放入堆栈。

第一次使用pushd命令时，会将当前目录先放入堆栈，然后将所要进入的目录也放入堆栈，位置在前一个记录的上方。以后每次使用pushd命令，都会将所要进入的目录，放在堆栈的顶部。

popd命令不带有参数时，会移除堆栈的顶部记录，并进入新的堆栈顶部目录（即原来的第二条目录）。

下面是一个例子。

# 当前处在主目录，堆栈为空
$ pwd
/home/me

# 进入 /home/me/foo
# 当前堆栈为 /home/me/foo /home/me
$ pushd ~/foo

# 进入 /etc
# 当前堆栈为 /etc /home/me/foo /home/me
$ pushd /etc

# 进入 /home/me/foo
# 当前堆栈为 /home/me/foo /home/me
$ popd

# 进入 /home/me
# 当前堆栈为 /home/me
$ popd

# 目录不变，当前堆栈为空
$ popd

Flowformer: Linearizing Transformers with Conservation Flows （任务通用的主干网络-线性复杂度的transformers）

【导读】近年来，Transformer方兴未艾，但是其内在的二次复杂度阻碍了它在长序列和大模型上的进一步发展。清华大学软件学院机器学习实验室从网络流理论出发，提出任务通用的线性复杂度主干网络Flowformer，在长序列、视觉、自然语言、时间序列、强化学习五大任务上取得优秀效果。

任务通用是基础模型研究的核心目标之一，同时也是深度学习研究通向高级智能的必经之路。
近年来，得益于注意力机制的通用关键建模能力，Transformer在众多领域中表现优异，逐渐呈现出通用架构的趋势。但是随着序列长度的增长，标准注意力机制的计算呈现二次复杂度，严重阻碍了其在长序列建模与大模型中的应用。

为此，来自清华大学软件学院的团队深入探索了这一关键问题，提出了任务通用的线性复杂度主干网络Flowformer，在保持标准Transformer的通用性的同时，将其复杂度降至线性，论文被ICML 2022接受。

作者列表：吴海旭，吴佳龙，徐介晖，王建民，龙明盛

链接：https://arxiv.org/abs/2202.06258

代码：https://github.com/thuml/Flowformer相比于标准Transformer，本文提出的Flowformer模型，具有以下特点：

线性复杂度，可以处理数千长度的输入序列；
没有引入新的归纳偏好，保持了原有注意力机制的通用建模能力；
任务通用，在长序列、视觉、自然语言、时间序列、强化学习五大任务上取得优秀效果。

本文深入研究了注意力机制存在的二次复杂度问题，通过将网络流中的守恒原理引入设计，自然地将竞争机制引入到注意力计算中，有效避免了平凡注意力问题。

我们提出的任务通用的骨干网络Flowformer，实现了线性复杂度，同时在长序列、视觉、自然语言、时间序列、强化学习五大任务上取得优秀效果。

在长序列建模应用上，如蛋白质结构预测、长文本理解等，Flowformer具有良好的应用潜力。此外，Flowformer中“无特殊归纳偏好”的设计理念也对通用基础架构的研究具有良好的启发意义。

Vision MLP系列–MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer是ViT团队的另一个纯MLP架构的尝试。如果MLP-Mixer重新引领CV领域主流架构的话，那么CV领域主流架构的演变过程就是MLP->CNN->Transformer->MLP? 要回到最初的起点了吗？？？( Transformer移除了注意力以后就剩MLP了)

这篇论文提出了一种”纯“MLP结构的视觉架构。

先将输入图片拆分成patches，然后通过Per-patch Fully-connected将每个patch转换成feature embedding，然后送入N个Mixer Layer，最后通过Fully-connected进行分类。

Mixer分为channel-mixing MLP和token-mixing MLP两类。channel-mixing MLP允许不同通道之间进行交流；token-mixing MLP允许不同空间位置(tokens)进行交流。这两种类型的layer是交替堆叠的，方便支持两个输入维度的交流。每个MLP由两层fully-connected和一个GELU构成。

从上图我们可以看出，MLP -Mixer 首先使用图片分成很多个小正方形的patch,每个patch的大小定义为patch_size。论文中实现这一步骤使用的是前面提到的卷积，卷积核的大小和步长均patch_size。论文中给的参数，也是2的幂。
网络不再使用传统的RELU激活函数，而是使用了GELU激活函数。

将图片分成小块后，在将它转换为一维结构。如图：

然后将每一个patch进行转换，如下图所示：

通过这样一种方式呢，就将一张图片转换为了一个大矩阵，就可以输入到Mixer Layer 中进行计算。

MLP 是两个全连接层的感知机,W1,W2,对应token_mixer中两个全连接的权重，W3,W4则表示channel_mixer两个全连接的权重。σ表示GELU激活函数。那么公示就很简单了，输入X经过Layer Normalize,再乘以W1，再经过激活函数后乘以W2，再加上X。第二个公式也是相同的计算过程。
将前面通过编码得到的矩阵经过Layer Norm 在将矩阵进行旋转（T 表示旋转）连接MLP1,MLP1 就是文章token_mixer 用来寻找像素与像素之间的关系，其中，MLP1中的权值共享。计算完之后，再将矩阵旋转回来，通过Layer Norm 后再接一个channel_mixer 用于寻找通道与通道之间的关系。其中MixerLayer 还启用了ResNet中的跨连结构，跨连结构的作用可以参考[ResNet原理讲解和复现]，看到这里，是不是感觉它跟卷积的原理很类似。
从上图可以看出Mixer Layer的输入维度和输出维度相同，并且通过MLP的方式来寻找图片像素与像素，通道与通道的关系。
这就是MLP-MIXER的网络结构了

实现的难点在于，矩阵旋转，我们使用einops中的Rearrange实现矩阵旋转

使用Rearrange 实现旋转

Rearrange(‘b n d -> b d n’) #这里是[batch_size, num_patch, dim] -> [batch_size, dim, num_patch]

#定义多层感知机
import torch
import numpy as np
from torch import nn
from einops.layers.torch import Rearrange
from torchsummary import summary
import torch.nn.functional as F

class FeedForward(nn.Module):
    def __init__(self,dim,hidden_dim,dropout=0.):
        super().__init__()
        self.net=nn.Sequential(
            #由此可以看出 FeedForward 的输入和输出维度是一致的
            nn.Linear(dim,hidden_dim),
            #激活函数
            nn.GELU(),
            #防止过拟合
            nn.Dropout(dropout),
            #重复上述过程
            nn.Linear(hidden_dim,dim),

            nn.Dropout(dropout)
        )
    def forward(self,x):
        x=self.net(x)
        return x


class MixerBlock(nn.Module):
    def __init__(self,dim,num_patch,token_dim,channel_dim,dropout=0.):
        super().__init__()
        self.token_mixer=nn.Sequential(
            nn.LayerNorm(dim),
            Rearrange('b n d -> b d n'),   #这里是[batch_size, num_patch, dim] -> [batch_size, dim, num_patch]
            FeedForward(num_patch,token_dim,dropout),
            Rearrange('b d n -> b n d')    #[batch_size, dim, num_patch] -> [batch_size, num_patch, dim]

         )
        self.channel_mixer=nn.Sequential(
            nn.LayerNorm(dim),
            FeedForward(dim,channel_dim,dropout)
        )
    def forward(self,x):

        x=x+self.token_mixer(x)

        x=x+self.channel_mixer(x)

        return x

class MLPMixer(nn.Module):
    def __init__(self,in_channels,dim,num_classes,patch_size,image_size,depth,token_dim,channel_dim,dropout=0.):
        super().__init__()
        assert image_size%patch_size==0
        self.num_patches=(image_size//patch_size)**2
        #embedding 操作，用卷积来分成一小块一小块的
        self.to_embedding=nn.Sequential(nn.Conv2d(in_channels=in_channels,out_channels=dim,kernel_size=patch_size,stride=patch_size),
            Rearrange('b c h w -> b (h w) c')
        )
        #经过Mixer Layer 的次数
        self.mixer_blocks=nn.ModuleList([])
        for _ in range(depth):
            self.mixer_blocks.append(MixerBlock(dim,self.num_patches,token_dim,channel_dim,dropout))
        self.layer_normal=nn.LayerNorm(dim)

        self.mlp_head=nn.Sequential(
            nn.Linear(dim,num_classes)
        )
    def forward(self,x):
        x=self.to_embedding(x)
        for mixer_block in self.mixer_blocks:
            x=mixer_block(x)
        x=self.layer_normal(x)
        x=x.mean(dim=1)

        x=self.mlp_head(x)

        return x

MLP-Mixer用Mixer的MLP来替代ViT的Transformer，减少了特征提取的自由度，并且巧妙的可以交替进行patch间信息交流和patch内信息交流，从结果上来看，纯MLP貌似也是可行的，而且省去了Transformer复杂的结构，变的更加简洁，有点期待后续ViT和MLP-Mixer如何针锋相对的，感觉大组就是东挖一个西挖一个的，又把尘封多年的MLP给挖出来了

Patches Are All You Need?

———– ConvMixer 网络

论文地址：https://openreview.net/pdf?id=TVHS5Y4dNvM
Github 地址：https://github.com/tmp-iclr/convmixer

ConvMixer is now integrated into the timm framework itself. You can see the PR here.

Conv Mixer 这篇文章提出的初衷是想去弄清楚，ViT系列模型表现优越，到底是图片分块的功劳还是网络中Attention的功劳。于是作者就根据深度可分离卷积，在ViT 和 MLP Mixer 的启发中设计了Conv Mixer。并且在表现上超越了一些ViT （某些ViT结构），MLP Mixer 和 ResNet。文章本身并没去追求模型的速度，和表现能力。

网络结构详解：

1、 Patch embedding

这里的Patch embedding实际上是使用一个卷积层实现的

nn.Conv2d(3,dim,kernel_size=patch_size,stride=patch_size)

其中 kernel_size 就是patch的大小

2、GELU激活函数（高斯误差线性单元）

这个是最近很多模型都在用的函数（dert、高斯误差线性单元激活函数在最近的 Transformer 模型）GELUs正是在激活中引入了随机正则的思想，是一种对神经元输入的概率描述，直观上更符合自然的认识，同时实验效果要比Relus与ELUs都要好。

GELUs其实是 dropout、zoneout、Relus的综合，GELUs对于输入乘以一个0,1组成的mask，而该mask的生成则是依概率随机的依赖于输入。假设输入为X, mask为m，则m服从一个伯努利分布(Φ ( x ) \Phi(x)Φ(x), Φ ( x ) = P ( X < = x ) , X 服从标准正太分布 \Phi(x)=P(X<=x), X服从标准正太分布Φ(x)=P(X<=x),X服从标准正太分布)，这么选择是因为神经元的输入趋向于正太分布，这么设定使得当输入x减小的时候，输入会有一个更高的概率被dropout掉，这样的激活变换就会随机依赖于输入了。

看得出来，这就是某些函数（比如双曲正切函数 tanh）与近似数值的组合。没什么过多可说的。有意思的是这个函数的图形：

可以看出，当 x 大于 0 时，输出为 x；但 x=0 到 x=1 的区间除外，这时曲线更偏向于 y 轴。

优点：

似乎是 NLP 领域的当前最佳；尤其在 Transformer 模型中表现最好；
能避免梯度消失问题。

3、ConvMixerLayer

class ConvMixerLayer(nn.Module):
    def __init__(self,dim,kernel_size = 9):
        super().__init__()
        #残差结构
        self.Resnet =  nn.Sequential(
            nn.Conv2d(dim,dim,kernel_size=kernel_size,groups=dim,padding='same'),
            nn.GELU(),
            nn.BatchNorm2d(dim)
        )
        #逐点卷积
        self.Conv_1x1 = nn.Sequential(
            nn.Conv2d(dim,dim,kernel_size=1),
            nn.GELU(),
            nn.BatchNorm2d(dim)
        )
    def forward(self,x):
        x = x +self.Resnet(x)
        x = self.Conv_1x1(x)
        return

在ConvMixer Layer 中, 使用了深度可分离卷积，GELU 激活函数，逐点卷积。
论文中将图中红色部称为 “channel wise mixing” 蓝色部分称为 “spatial mixing”
论文得到的结论是当深度可分离卷积部分的卷积核越大，模型的性能越好。文章中的使用的是9×9的卷积核，因为卷积核越大表现越好。

文章最后也认为，ViT 表现如此优越是因为patch embedding （图片分块）的原因。
作者认为 patch embedding 操作就能完成神经网络的所有下采样过程，降低了图片的分辨率，增加了感受野，更容易找到远处的空间信息。从而模型表现良好

说明：