{"id":5716,"date":"2022-09-30T06:44:00","date_gmt":"2022-09-29T22:44:00","guid":{"rendered":"http:\/\/139.9.1.231\/?p=5716"},"modified":"2022-09-30T17:01:57","modified_gmt":"2022-09-30T09:01:57","slug":"vit-transformer-for-image","status":"publish","type":"post","link":"http:\/\/139.9.1.231\/index.php\/2022\/09\/30\/vit-transformer-for-image\/","title":{"rendered":"VIT"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"331\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179-1024x331.png\" alt=\"\" class=\"wp-image-5718\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179-1024x331.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179-300x97.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179-768x248.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179-1536x496.png 1536w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-179.png 1643w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Dosovitskiy et al. An image is worth 16\u00d716 words: transformers for image recognition at scale. In ICLR, 2021<\/p>\n\n\n\n<p>step1 \uff1a\u5206\u5272\u56fe\u7247<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"375\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180-1024x375.png\" alt=\"\" class=\"wp-image-5721\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180-1024x375.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180-300x110.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180-768x281.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180-1536x563.png 1536w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-180.png 1990w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>step2 \u5411\u91cf\u5316\uff1a\u4ece\u4e5d\u4e2a\u5feb\u53d8\u6210\u4e5d\u4e2a\u5411\u91cf<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"503\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-181-1024x503.png\" alt=\"\" class=\"wp-image-5723\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-181-1024x503.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-181-300x147.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-181-768x377.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-181.png 1158w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"has-bright-blue-background-color has-background\">step3\uff1a\u5411\u91cf\u7ebf\u6027\u53d8\u6362\uff1a(linear embedding\u7ebf\u6027\u5d4c\u5165\u5c42)<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"925\" height=\"358\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-182.png\" alt=\"\" class=\"wp-image-5726\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-182.png 925w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-182-300x116.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-182-768x297.png 768w\" sizes=\"(max-width: 925px) 100vw, 925px\" \/><\/figure>\n\n\n\n<p>step4\uff1a\u5c06\u4f4d\u7f6e\u7f16\u7801\u6dfb\u52a0\u5230z\u4e0a\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"975\" height=\"565\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-183.png\" alt=\"\" class=\"wp-image-5727\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-183.png 975w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-183-300x174.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-183-768x445.png 768w\" sizes=\"(max-width: 975px) 100vw, 975px\" \/><\/figure>\n\n\n\n<p>step4\uff1a\u6dfb\u52a0\u4e00\u4e2acls\u5411\u91cf\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"600\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-184-1024x600.png\" alt=\"\" class=\"wp-image-5729\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-184-1024x600.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-184-300x176.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-184-768x450.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-184.png 1126w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>step5\uff1a\u53ea\u5229\u7528cls\u7684\u8f93\u51fa<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" width=\"1005\" height=\"641\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-185.png\" alt=\"\" class=\"wp-image-5732\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-185.png 1005w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-185-300x191.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-185-768x490.png 768w\" sizes=\"(max-width: 1005px) 100vw, 1005px\" \/><\/figure>\n\n\n\n<p>\u6309\u7167\u4e0a\u9762\u7684\u6d41\u7a0b\u56fe\uff0c\u4e00\u4e2aViT block\u53ef\u4ee5\u5206\u4e3a\u4ee5\u4e0b\u51e0\u4e2a\u6b65\u9aa4<\/p>\n\n\n\n<p>(1) patch embedding\uff1a\u4f8b\u5982\u8f93\u5165\u56fe\u7247\u5927\u5c0f\u4e3a224&#215;224\uff0c\u5c06\u56fe\u7247\u5206\u4e3a\u56fa\u5b9a\u5927\u5c0f\u7684patch\uff0cpatch\u5927\u5c0f\u4e3a16&#215;16\uff0c\u5219\u6bcf\u5f20\u56fe\u50cf\u4f1a\u751f\u6210224&#215;224\/16&#215;16=196\u4e2apatch\uff0c\u5373\u8f93\u5165\u5e8f\u5217\u957f\u5ea6\u4e3a<strong>196<\/strong>\uff0c\u6bcf\u4e2apatch\u7ef4\u5ea616x16x3=<strong>768<\/strong>\uff0c\u7ebf\u6027\u6295\u5c04\u5c42\u7684\u7ef4\u5ea6\u4e3a768xN (N=768)\uff0c\u56e0\u6b64\u8f93\u5165\u901a\u8fc7\u7ebf\u6027\u6295\u5c04\u5c42\u4e4b\u540e\u7684\u7ef4\u5ea6\u4f9d\u7136\u4e3a196&#215;768\uff0c\u5373\u4e00\u5171\u6709196\u4e2atoken\uff0c\u6bcf\u4e2atoken\u7684\u7ef4\u5ea6\u662f768\u3002\u8fd9\u91cc\u8fd8\u9700\u8981\u52a0\u4e0a\u4e00\u4e2a\u7279\u6b8a\u5b57\u7b26cls\uff0c\u56e0\u6b64\u6700\u7ec8\u7684\u7ef4\u5ea6\u662f<strong>197&#215;768<\/strong>\u3002\u5230\u76ee\u524d\u4e3a\u6b62\uff0c\u5df2\u7ecf\u901a\u8fc7patch embedding\u5c06\u4e00\u4e2a\u89c6\u89c9\u95ee\u9898\u8f6c\u5316\u4e3a\u4e86\u4e00\u4e2aseq2seq\u95ee\u9898<\/p>\n\n\n\n<p>(2) positional encoding\uff08standard learnable 1D position embeddings\uff09\uff1aViT\u540c\u6837\u9700\u8981\u52a0\u5165\u4f4d\u7f6e\u7f16\u7801\uff0c\u4f4d\u7f6e\u7f16\u7801\u53ef\u4ee5\u7406\u89e3\u4e3a\u4e00\u5f20\u8868\uff0c\u8868\u4e00\u5171\u6709N\u884c\uff0cN\u7684\u5927\u5c0f\u548c\u8f93\u5165\u5e8f\u5217\u957f\u5ea6\u76f8\u540c\uff0c\u6bcf\u4e00\u884c\u4ee3\u8868\u4e00\u4e2a\u5411\u91cf\uff0c\u5411\u91cf\u7684\u7ef4\u5ea6\u548c\u8f93\u5165\u5e8f\u5217embedding\u7684\u7ef4\u5ea6\u76f8\u540c\uff08768\uff09\u3002\u6ce8\u610f\u4f4d\u7f6e\u7f16\u7801\u7684\u64cd\u4f5c\u662fsum\uff0c\u800c\u4e0d\u662fconcat\u3002\u52a0\u5165\u4f4d\u7f6e\u7f16\u7801\u4fe1\u606f\u4e4b\u540e\uff0c\u7ef4\u5ea6\u4f9d\u7136\u662f<strong>197&#215;768<\/strong><\/p>\n\n\n\n<p>(3) LN\/multi-head attention\/LN\uff1aLN\u8f93\u51fa\u7ef4\u5ea6\u4f9d\u7136\u662f197&#215;768\u3002\u591a\u5934\u81ea\u6ce8\u610f\u529b\u65f6\uff0c\u5148\u5c06\u8f93\u5165\u6620\u5c04\u5230q\uff0ck\uff0cv\uff0c\u5982\u679c\u53ea\u6709\u4e00\u4e2a\u5934\uff0cqkv\u7684\u7ef4\u5ea6\u90fd\u662f197&#215;768\uff0c\u5982\u679c\u670912\u4e2a\u5934\uff08768\/12=64\uff09\uff0c\u5219qkv\u7684\u7ef4\u5ea6\u662f197&#215;64\uff0c\u4e00\u5171\u670912\u7ec4qkv\uff0c\u6700\u540e\u518d\u5c0612\u7ec4qkv\u7684\u8f93\u51fa\u62fc\u63a5\u8d77\u6765\uff0c\u8f93\u51fa\u7ef4\u5ea6\u662f197&#215;768\uff0c\u7136\u540e\u5728\u8fc7\u4e00\u5c42LN\uff0c\u7ef4\u5ea6\u4f9d\u7136\u662f<strong>197&#215;768<\/strong><\/p>\n\n\n\n<p>(4) MLP\uff1a\u5c06\u7ef4\u5ea6\u653e\u5927\u518d\u7f29\u5c0f\u56de\u53bb\uff0c197&#215;768\u653e\u5927\u4e3a197&#215;3072\uff0c\u518d\u7f29\u5c0f\u53d8\u4e3a<strong>197&#215;768<\/strong><\/p>\n\n\n\n<p>\u4e00\u4e2ablock\u4e4b\u540e\u7ef4\u5ea6\u4f9d\u7136\u548c\u8f93\u5165\u76f8\u540c\uff0c\u90fd\u662f197&#215;768\uff0c\u56e0\u6b64\u53ef\u4ee5\u5806\u53e0\u591a\u4e2ablock\u3002\u6700\u540e\u4f1a\u5c06\u7279\u6b8a\u5b57\u7b26cls\u5bf9\u5e94\u7684\u8f93\u51fa&nbsp;zL0&nbsp;\u4f5c\u4e3aencoder\u7684\u6700\u7ec8\u8f93\u51fa \uff0c\u4ee3\u8868\u6700\u7ec8\u7684image presentation\uff08\u53e6\u4e00\u79cd\u505a\u6cd5\u662f\u4e0d\u52a0cls\u5b57\u7b26\uff0c\u5bf9\u6240\u6709\u7684tokens\u7684\u8f93\u51fa\u505a\u4e00\u4e2a\u5e73\u5747\uff09\uff0c\u5982\u4e0b\u56fe\u516c\u5f0f(4)\uff0c\u540e\u9762\u63a5\u4e00\u4e2aMLP\u8fdb\u884c\u56fe\u7247\u5206\u7c7b<\/p>\n\n\n\n<p>vit\u9700\u8981\u9884\u8bad\u7ec3+\u5fae\u8c03<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"530\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-186-1024x530.png\" alt=\"\" class=\"wp-image-5734\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-186-1024x530.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-186-300x155.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-186-768x397.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-186.png 1096w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"531\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-187-1024x531.png\" alt=\"\" class=\"wp-image-5735\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-187-1024x531.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-187-300x156.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-187-768x399.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-187.png 1031w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>\u2022 Pretrain the model on Dataset A, fine-tune the model on Dataset B,<br>and evaluate the model on Dataset B.<br>\u2022 Pretrained on ImageNet (small), ViT is slightly worse than ResNet.<br>\u2022 Pretrained on ImageNet-21K (medium), ViT is comparable to ResNet.<br>\u2022 Pretrained on JFT (large), ViT is slightly better than ResNet.<\/p>\n\n\n\n<p>\u6548\u679c\uff1a<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"1024\" height=\"339\" src=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-188-1024x339.png\" alt=\"\" class=\"wp-image-5737\" srcset=\"http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-188-1024x339.png 1024w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-188-300x99.png 300w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-188-768x254.png 768w, http:\/\/139.9.1.231\/wp-content\/uploads\/2022\/08\/image-188.png 1125w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Dosovitskiy et al. An image is worth 16\u00d716 words: trans &hellip; <a href=\"http:\/\/139.9.1.231\/index.php\/2022\/09\/30\/vit-transformer-for-image\/\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">VIT<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[21],"tags":[],"_links":{"self":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/5716"}],"collection":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/comments?post=5716"}],"version-history":[{"count":16,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/5716\/revisions"}],"predecessor-version":[{"id":8692,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/posts\/5716\/revisions\/8692"}],"wp:attachment":[{"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/media?parent=5716"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/categories?post=5716"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/139.9.1.231\/index.php\/wp-json\/wp\/v2\/tags?post=5716"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}