Resnet#

Deep Residual Learning from Image Recognition

Paper PDF

Deep Residual Learning for Image Recognition ์ด๋ž€ ์ œ๋ชฉ์˜ ๋…ผ๋ฌธ์œผ๋กœ ์šฐ๋ฆฌ๊ฐ€ ๋งŽ์ด ๋“ค์–ด๋ณธ โ€˜resnetโ€™์— ๋Œ€ํ•ด ๋‚˜์˜จ ๋…ผ๋ฌธ์ด๋‹ค. 2014๋…„์— vgg๋…ผ๋ฌธ์ด ๋‚˜์˜ค๊ณ  ๋ฐ”๋กœ ๋‹ค์Œ ํ•ด์— ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ, ์‚ฌ์‹ค์ƒ vgg์˜ ๊ตฌ์กฐ์— residual mapping์ด๋ผ๋Š” ์•„์ด๋””์–ด๋งŒ์„ ์ถ”๊ฐ€ํ•˜๊ณ ๋„ imagenet classification์—์„œ ๋ˆˆ์— ๋„๋Š” ์ ์ˆ˜ ํ–ฅ์ƒ์„ ๋ณด์˜€๋‹ค. resnet์ด๋ผ๋Š” ๋…ผ๋ฌธ์€ ์•„์ด๋””์–ด ์ž์ฒด๊ฐ€ ์‰ฝ๋‹ค๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์•„์ด๋””์–ด๊ฐ€ ๋‹น์‹œ์˜ ๋ชจ๋ธ์˜ ๊นŠ์ด์™€ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์„ ํ˜•์ ์ธ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ด๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ(degradation problem)๋ฅผ ์‰ฌ์šด ๋…ผ๋ฆฌ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค๋Š” ์ ์—์„œ ์ข‹์€ ๋…ผ๋ฌธ์œผ๋กœ ์ง€๊ธˆ๊นŒ์ง€ ํ‰๊ฐ€๋œ๋‹ค. ์šฐ์„ ์€ resent์—์„œ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์ œ๋ถ€ํ„ฐ ์‚ดํŽด๋ณด์ž.

๋ฌธ์ œ ์„ค์ • problem set-up#

vgg๋ฅผ ํ†ตํ•ด์„œ image classification์ด๋ผ๋Š” ๊ณผ์ œ์—์„œ ๋งŽ์€ breakthrough๊ฐ€ ์žˆ์—ˆ๋‹ค. cnn layer๋ฅผ ๊นŠ์ด ์Œ“์Œ์œผ๋กœ์จ low/mid/high ์ˆ˜์ค€์˜ feature๋“ค์„ ํ†ตํ•ฉํ•˜๊ณ , ๋” ๊นŠ์ด ์Œ“์„์ˆ˜๋ก ๋” ๋งŽ์€ feature๋“ค์„ data์—์„œ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ๋ฌด์กฐ๊ฑด ๊นŠ์ด๋ฅผ ๋งŽ์ด ์Œ“๋Š”๋‹ค๊ณ  ๋˜์—ˆ๋˜ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค. ๋ฐ‘์˜ 2๊ฐœ์˜ ๋ฌธ์ œ๊ฐ€ ๊ทธ๊ฒƒ์ด๋‹ค.

1. vanishing/exploding gradients problem#

Is learning better networks as easy as stacking more layers?

โ€”bro. of course not!

์ฒซ๋ฒˆ์งธ ๋ฌธ์ œ๋Š” gradient๊ฐ€ ์†Œ์‹ค๋˜๊ฑฐ๋‚˜ ํญ๋ฐœํ•ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๋‹ค. layer๊ฐ€ ๋ช‡ ์ธต ๋˜์ง€ ์•Š๋Š” shallowํ•œ network์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๊ฑฐ๋‚˜ ๊ฑฑ์ •์ด ํ•„์š”์—†์„ ์ •๋„ ์ด์ง€๋งŒ, network๊ฐ€ ๊นŠ์–ด์ง€๋ฉด gradient(๊ฒฝ์‚ฌ๋„)๊ฐ€ too small or big for training to work effectivelyํ•˜๊ฒŒ ๋˜๊ณ  ์ด ๋ฌธ์ œ๊ฐ€ vanishing exploding gradient ๋ฌธ์ œ๋‹ค. sigmoid ํ•จ์ˆ˜๋ฅผ ์ƒ๊ฐํ•˜๋ฉด ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ด์• ํ•˜๊ธฐ ์‰ฝ๋‹ค.

when n hidden layers use an activation like the sigmoid function, n small derivatives are multiplied together. Thus, the gradient decreases exponentially as we propagate down to the initial layers.

chain rule์— ๋”ฐ๋ผ ๊ฐ layer์ƒ์—์„œ์˜ derivative๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ๋”ฐ๋ผ์„œ ๊ณฑํ•ด์ง€๊ณ , ๋ฐฉํ–ฅ์„ฑ์€ ๋๋‹จ์—์„œ ๋งจ ์ฒ˜์Œ layer๋กœ ํ–ฅํ•˜๊ฒŒ ๋œ๋‹ค. ๋’ค์—์„œ๋ถ€ํ„ฐ ์•ž์œผ๋กœ ํ–ฅํ•˜๋Š” back propagation์—์„œ sigmoid๋ฅผ non-linear activaiton function์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด, ์Œ์˜ x๊ฐ’(input)๋“ค์€ ์ „๋ถ€ 0์— ํ•œ์—†์ด ๊ฐ€๊นŒ์›Œ์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์„ฑํ™”๊ฐ€ ์ž˜๋˜์ง€ ์•Š๊ณ , ๊ณฑ์…ˆ์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ์•„์ฃผ์•„์ฃผ ์ž‘์•„์ง€๊ฒŒ๋œ๋‹ค. ์ด๋Š” ๊ณง ๋งจ ์•ž๊นŒ์ง€ ์˜ค๋ฉด gradient๊ฐ€ ์‚ฌ๋ผ์ง„ ๊ฒƒ ์ฒ˜๋Ÿผ, ๊ทธ๋ฆฌ๊ณ  ํ™œ์„ฑํ™” ์—ญํ• ์„ ์ œ๋Œ€๋กœ ํ•˜์ง€ ๋ชปํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋Š” ๋…ผ๋ฌธ์ƒ์—์„œ๋Š” ๋งŽ์ด ํ•ด๊ฒฐ๋˜์—ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ํ•™์Šต ์ž์ฒด๊ฐ€ ์•ˆ๋˜๋Š” ๋ฌธ์ œ์ด๊ณ  gradient๋ฅผ ์‚ด๋ฆฌ๋Š” ๊ฒƒ์ด ๋ฌธ์ œ์ž„์œผ๋กœ nomarlized initialization, intermediate normalization layers ์ด ๋‘ ๋ฐฉ๋ฒ•์—์„œ ํ•ด๊ฒฐ๋˜์—ˆ๋‹ค๊ณ  ๋ณธ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ฃผ๋กœ ๋‹ค๋ฃจ๊ณ ์žํ•˜๊ณ  ํ•ด๊ฒฐํ•˜๊ณ  ์‹ถ์€ ๋ฌธ์ œ๋Š” 2๋ฒˆ์จฐ ๋ฌธ์ œ์ด๋‹ค.

2. Degradation problem#

๊นŠ์€ network์ด ์ˆ˜๋ ด์„ ์‹œ์ž‘ํ•œ๋‹ค๊ณ ํ•ด๋„, degradation problem(์„ฑ๋Šฅ ์ €ํ•˜ ๋ฌธ์ œ)๊ฐ€ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค. ์ด ๋ฌธ์ œ๋Š” gradient vanishing/exploding ๋ฌธ์ œ๋ณด๋‹ค ์ข€ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ๋ฌธ์ œ์ด๋‹ค. ์ด ๋ฌธ์ œ์˜ ์ƒํ™ฉ์—์„œ network๋Š” ํ•™์Šต๋„ ๋˜๊ณ , gradient๋„ ์‚ด์•„์žˆ๊ณ , accuracy score๊ฐ€ ์ƒ์Šน์€ ํ•˜๋Š”๋ฐ, ์˜คํžˆ๋ ค depth๊ฐ€ ๋‚ฎ์€ network๋ณด๋‹ค depth๋ฅผ ๋†’์ธ network๊ฐ€ ์ •ํ™•๋„ ๋“ฑ์˜ ํ‰๊ฐ€์ง€ํ‘œ์—์„œ ๋” ๋†’์•„์•ผ ํ•˜๋Š”๋ฐ ๊ทธ๋ ‡์ง€ ๋ชปํ•˜๋Š” ํ˜„์ƒ์„ ๋งํ•œ๋‹ค.

../../_images/resnet_1.png

Fig. 7 56 layer network๊ฐ€ 20 layer network๋ณด๋‹ค error์œจ์ด ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์„ฑ๋Šฅ์ด ๋” ์ž˜ ์•ˆ๋‚˜์˜จ ๊ฒƒ์ด๋‹ค.#

deeper is better๋ฅผ ํ•˜๋‚˜์˜ ์š”์†Œ๋งŒ ๋„ฃ์œผ๋ฉด ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ, ์ฆ‰ degradation problem์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ - ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Redisual mapping์ด๋‹ค.

Residual mapping, Identity mapping#

../../_images/resnet_2.png

Fig. 8 obsidian์œผ๋กœ ๋‚ด๋ง˜๋Œ€๋กœ ๊ทธ๋ ค๋ณธ fig2#

๊ธฐ์กด์˜ vgg์—์„œ์˜ mapping block์„ \(H(\text{x})\)์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด, ์ด๋Ÿฐ block์ด 18๊ฐœ์ •๋„ ์ด์–ด์ ธ ๋ถ™์–ด์žˆ๋Š” ํ˜•ํƒœ์˜€๋‹ค. block ๋‚ด๋ถ€์—๋Š” cnn layer + relu layer + cnn layer + relu layer ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

resent์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์—์„œ block๋งˆ๋‹ค์˜ input(x) \(\to\) output(\(H(\text{x})\)=y) ๊ด€๊ณ„๋ฅผ ๋ถ„ํ•ดํ•œ๋‹ค. input(x) + residual(F(x)) \(\to\) output(\(H(\text{x})\)=y). ๊ฒฐ๊ตญ ํ•˜๋‚˜์˜ ๋ธ”๋Ÿญ ์ƒ์—์„œ ํ•™์Šตํ•ด์•ผํ•˜๋Š” ๋ถ€๋ถ„์€ \(H(\text{x}) - x = F(\text{x})\)๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๊ณ  ์ด๊ฒƒ์ด Residual์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  input์€ y=f(x)=x ์ฒ˜๋Ÿผ input๊ฐ’์ด output๊ณผ ๊ฐ™์€ ๊ฒƒ ์ฒ˜๋Ÿผ mapping๋˜๋Š” ๋ถ€๋ถ„์ž„์œผ๋กœ identity mapping์ด๋ผ๊ณ  ๋ถˆ๋ฆฐ๋‹ค.

\[\begin{split} \begin{gather} \tag{residual mapping in fig2}F = W_2\sigma(W_1 x)\\ \tag{a building block} \text{y} = W_2\sigma(W_1\text{x}) + x\\ \tag{Equ 1}\text{y} = F(\text{x},\{ W_i \}) + \text{x} \\ \tag{Equ 2} \text{y} = F(\text{x},\{ W_i \}) + W_s\text{x} \end{gather} \end{split}\]
  • F : residual function

  • ๋งŒ์•ฝ F๊ฐ€ single layer๋ผ๋ฉด : y = W_1 x + x ๊ฐ€ ๋  ์ˆ˜๋„ ์žˆ๋‹ค.

  • F(x, {W_i})๋Š” multiple convolutional layers๋ฅผ ํ‘œํ˜„

๊ธฐ์กด์˜ output์—๋‹ค input์„ ๋”ํ•ด์ฃผ๋Š” +์˜ ๊ฐœ๋…์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ๋„ ์žˆ๊ณ , ๊ธฐ์กด์˜ mapping์„ ํ•ด์ฒดํ•˜๋Š” -์˜ ๊ฐœ๋…์œผ๋กœ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ๋„ ์žˆ๋‹ค. -์˜ ๊ฐœ๋…์œผ๋กœ ์ ‘๊ทผํ•œ๋‹ค๋ฉด ๊ธฐ์กด์— optimize ํ•ด ์ฃผ์–ด์•ผํ•  ๋ถ€๋ถ„์ด ์ค„์–ด๋“ ๋‹ค๋Š” ๊ด€์ ์œผ๋กœ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๊ณ , ์ด๊ฒƒ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์ •ํ•˜๊ณ  ์ ‘๊ทผํ•œ ์ง€์ ์ด๋‹ค. identity mapping(x)๊ฐ€ ์ด๋ฏธ optimalํ•˜๊ฒŒ mapping์„ ์ง„ํ–‰ํ•ด์™”๋‹ค๋ฉด ๋‚จ์€ residual mapping(F(x))๋งŒ 0์— ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ค๋ฉด ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿผ H(x)๊ฐ€ ๊ฒฐ๊ณผ์ ์œผ๋กœ optimalํ•ด์งˆ ๊ฒƒ์ด๊ณ  output์€ x๋กœ ๋˜์–ด์„œ ๋‹ค์Œ block์˜ input์ด ๋  ๊ฒƒ์ด๋‹ค.

The degradation problem suggests that the solvers might have difficulties in approximating identity mappingsby multiple nonlinear layers.

์—ฌ๊ธฐ์„œ solver๋ž€ optimization algorithm, back-propagation algorithm์„ ๋งํ•œ๋‹ค. gradient๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ณผ์ •์„ ๋งํ•œ๋‹ค. ์ฆ‰ ๊ธฐ์กด์˜ identity mapping์ด ์—†๋˜ network์—์„œ์˜ ์ตœ์ ํ™” ๊ณผ์ •์—์„œ์˜ degradation problem์€ ๋น„์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ํ†ต๊ณผํ•˜๋ฉด์„œ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด ๊ฐ™์€(identity mapping)์˜ ๊ฒฝ์šฐ์— ๋Œ€ํ•œ ๋Œ€์ฒ˜๊ฐ€ ์–ด๋ ค์›Œ ์ง„๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์œ„์˜ residual mapping๊ณผ identity mapping์„ shortcut connection์œผ๋กœ ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ ๊นŠ์–ด์ง€๋Š” network์—์„œ์˜ gradient ํ๋ฆ„์„ ๋ณด์กดํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

๋ฌผ๋ก  identity mapping์ด optimalํ•  ๊ฒฝ์šฐ๋Š” ์‹ค์ œ ํ•™์Šต๊ณผ์ •์—์„œ๋Š” ์ด๋ฃจ์–ด ์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ง๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ธฐ์กด์˜ input๊ฐ’์„ ์ฐธ์กฐํ•˜๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋„ ํ•™์Šต์— ๋„์›€์ด ๋œ๋‹ค๊ณ  ๋…ผ๋ฌธ์—์„œ๋Š” ๋งํ•œ๋‹ค. ๋‹ค์Œ ๋ธ”๋ก์˜ residual mapping์ด ์ด์ „ ๋ธ”๋ก์˜ identity mapping์„ ์ฐธ์กฐํ•˜๋Š” ๊ฒƒ ๋งŒ์œผ๋กœ๋„ ํ•™์Šต์˜ ์š”๋™(?)์ด ์ ์–ด์ง„๋‹ค๊ณ  ๋งํ•œ๋‹ค.

shortcut connection == identity mapping?#

shortcut connection์€ ์ž…๋ ฅ๊ฐ’์„ ๋’ค๋กœ ๋„˜๊ฒจ์„œ ๋”ํ•ด์ค€๋‹ค. ์ด๊ฒƒ์—๋„ ์ข…๋ฅ˜๊ฐ€ ์žˆ๊ณ  identity mapping์€ 1๋ฒˆ์œผ๋กœ ๊ทธ ์ข…๋ฅ˜์ค‘์— ํ•˜๋‚˜๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ์œผ๋กœ ์ •ํ™•ํžˆ๋Š” ์ฐจ์ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. input๊ณผ output์˜ dimension์ด ๋‹ฌ๋ผ์ง€๋ฉด ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ์ด ๋งŽ์•„์ง„๋‹ค.

  1. Identity Shortcut Connection (ํ•ต์‹ฌ): Identity Shortcut Connection์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์„ ํ˜„์žฌ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์— ์ง์ ‘ ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ž….

  2. Projection Shortcut Connection: Projection Shortcut Connection์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์„ ํ˜„์žฌ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์— ์„ ํ˜• ๋ณ€ํ™˜(projection)ํ•˜์—ฌ ํฌ๊ธฐ๋‚˜ ์ฐจ์›์„ ๋งž์ถ˜ ํ›„ ๋”ํ•ด์ฃผ๋Š” ๋ฐฉ์‹. ์ด๋Š” ์ฐจ์›์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ์— ์‚ฌ์šฉ๋˜๋ฉฐ, ์„ ํ˜• ๋ณ€ํ™˜์„ ํ†ตํ•ด ์ฐจ์› ์ผ์น˜๋ฅผ ์œ ์ง€ํ•˜๊ณ  ๊ทธ๋ž˜๋””์–ธํŠธ ํ๋ฆ„์„ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋‹ค.

  3. Dimension Matching Shortcut Connection: Dimension Matching Shortcut Connection์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ๊ณผ ํ˜„์žฌ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์˜ ์ฐจ์›์ด ๋‹ค๋ฅผ ๊ฒฝ์šฐ, ์ฐจ์›์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹. ์ด๋Š” ์ฐจ์›์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ์— ์‚ฌ์šฉ๋˜๋ฉฐ, ์ฐจ์›์„ ์ผ์น˜์‹œ์ผœ ๊ทธ๋ž˜๋””์–ธํŠธ ํ๋ฆ„์„ ๋ณด์กดํ•˜๊ณ  ์ •๋ณด์˜ ์†์‹ค์„ ์ตœ์†Œํ™”.

  4. Skip Connection: Skip Connection์€ ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์„ ํ˜„์žฌ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์œผ๋กœ ๋ฐ”๋กœ ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ์‹. Identity Shortcut Connection์€ Skip Connection์˜ ํ•œ ์ข…๋ฅ˜๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. Skip Connection์€ ๋„คํŠธ์›Œํฌ์˜ ์—ฌ๋Ÿฌ ๋ ˆ์ด์–ด๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด ๊ทธ๋ž˜๋””์–ธํŠธ ํ๋ฆ„์„ ๋” ์งง๊ฒŒ ๋งŒ๋“ค์–ด ์คŒ์œผ๋กœ์จ ๊ทธ๋ž˜๋””์–ธํŠธ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์™„ํ™”์‹œํ‚ค๊ณ , ์ •๋ณด์˜ ์†์‹ค์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

์ด์ „์˜ ์—ฐ๊ตฌ๋“ค์—์„œ shortcut connection์€ โ€˜highway networksโ€™์—์„œ gating function์œผ๋กœ ์ด์šฉ๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์ด gates๋“ค์€ data-dependentํ•˜๊ณ  parameter๊ฐ€ ์žˆ์—ˆ์œผ๋ฉฐ ๋‹ซํž ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ํ•˜์ง€๋งŒ resnet์—์„œ์˜ shortcut connection์€ parameter-free, never closed๋ผ๊ณ  ํ•œ๋‹ค.

code#

# https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
from functools import partial
from typing import Any, Callable, List, Optional, Type, Union

import numpy
import pandas
import torch
import torch.nn as nn
from torch import Tensor
def conv3x3(in_planes: int,
            out_planes: int,
            stride: int = 1,
            groups:int = 1,
            dilation: int = 1) -> nn.Conv2d:
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes,
                     kernel_size=3,
                     stride=stride,
                     padding=dilation,
                     groups=groups,
                     bias=False,
                     dilation=dilation,
                     )

def conv1x1(in_planes: int,
            out_planes: int,
            stride: int = 1,) -> nn.Conv2d:
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

class BasicBlock(nn.Module):
        expansion: int = 1
        
        def __init__(
                self,
                inplanes: int,
                planes: int,
                stride: int = 1,
                downsample: Optional[nn.Module] = None,
                groups: int = 1,
                base_width: int = 64,
                dilation: int = 1,
                norm_layer: Optional[Callable[...,nn.Module]] = None,
        ) -> None:
                super().__init__()
                if norm_layer is None:
                        norm_layer = nn.BatchNorm2d
                if groups != 1 or base_width != 64:
                        raise ValueError("BasicBlock only supports groups=1 and base_width=64")
                if dilation > 1:
                        raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
                # Both self.conv1 and self.downsample layers downsample the input when stride != 1
                self.conv1 = conv3x3(inplanes, planes, stride)
                self.bn1 = norm_layer(planes)
                self.relu = nn.ReLU(inplace=True)
                self.conv2 = conv3x3(planes, planes)
                self.bn2 = norm_layer(planes)
                self.downsample = downsample
                self.stride = stride
                
        def forward(self, x: Tensor) -> Tensor:
                identity = x
                
                out = self.conv1(x)
                out = self.bn1(out)
                out = self.relu(out)
                
                out = self.conv2(out)
                out = self.bn2(out)
                # residual function
                
                if self.downsample is not None:
                        identity = self.downsample(x)
                        
                out += identity # identity mapping
                out = self.relu(out)
                return out
                

Implementation#

../../_images/resnet_3.png

Fig. 9 vgg19(19.6B FLOPs) - plain 34 layers(3.6B FLOPs) - with shortcut 34 layers(3.6B FLOPs). dotted line์ด dimension์ด ๋Š˜์–ด๋‚˜๋Š” ๋ถ€๋ถ„์ด๊ณ  ์ด ๋ถ€๋ถ„์€ equ2 \(W_s\) 1x1 convolutions๋กœ ๋งž์ถฐ์ฃผ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.#

in training

  • he image is resized with its shorter side ran-domly sampled in for scale augmentation.A 224ร—224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted.

  • standard color augmentation

  • conv \(\to\) Batch Normalization \(\to\) non-linear activation f

    • plain ๋„คํŠธ์›Œํฌ์—์„œ๋„ ์‚ฌ์šฉ๋จ์œผ๋กœ์จ ์‹คํ—˜์ž์ฒด๊ฐ€ gradient vanishing problem ๋ณด๋‹ค๋Š” degradation problem์— ์ง‘์ค‘ํ•˜๋„๋ก ํ•จ.

    • ensures forward propagated signals to have non-zero variances.

    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ์กฐ์ •ํ•˜์—ฌ gradient์˜ ํฌ๊ธฐ๋ฅผ ์•ˆ์ •ํ™”ํ•˜์—ฌ gradient vanishing problem์„ ์™„ํ™”ํ•˜๋ฉฐ, ์ž‘์€ ๋ณ€ํ™”์—๋Š” ๋œ ๋ฏผ๊ฐํ•œ ๊ฐ•๊ฒ…ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ , ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋„๋ก ๋•๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

    • ๋˜ํ•œ backward propagated gradients(์—ญ์ „ํŒŒ๋œ ์†์‹คํ•จ์ˆ˜ ์ตœ์†Œํ™” ๊ฐ€์ค‘์น˜ ๋ฏธ๋ถ„๊ฐ’)๊ฐ€ ๊ฑด๊ฐ•ํ•˜๊ณ  ์ ์ ˆํ•œ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๋ฐ ๋„์›€๋œ๋‹ค.

  • weitght initialization

  • SGD with mini-batch 256 size

  • 0.1 lr with 10 error plateaus

  • 60 x 10^4 iter

  • 0.0001 weight decay, momentum 0.9

  • no dropout

in testing

  • standard 10-crop testing

  • fully convolutional form

  • average scores at multiple scales{224,256,384,480,640}

../../_images/resnet_4.png
  • (A) zero-padding shortcuts are usedfor increasing dimensions, and all shortcuts are parameter-free

  • (B) projec-tion shortcuts are used for increasing dimensions, and othershortcuts are identity

  • ยฉ all shortcuts are projection

์œ„์—์„œ ๋ณด์ด๋Š” ABC๋Š” shortcut connection์„ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑํ–ˆ๋Š”์ง€๊ฐ€ ๋‹ค๋ฅด๊ณ , c๋กœ ๊ฐˆ์ˆ˜๋ก ์„ฑ๋Šฅ์€ ๋‚˜์•„์กŒ์ง€๋งŒ, B๋งŒ์œผ๋กœ๋„ ์œ ์˜๋ฏธํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ๋ณด์ž„์œผ๋กœ ๋ชจ๋“  shortcut์ด projection shortcut์ด ๋  ํ•„์š”๋Š” ์—†๋‹ค๊ณ  ๋งํ•œ๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ ์ดํ›„์— ๋‚˜์˜จ fishnet์ด๋ผ๋Š” ๋…ผ๋ฌธ์—์„œ๋Š” C์˜ ๋ฐฉ์‹์„ ์ ๊ทน์ฑ„์šฉํ•ด์„œ resnet๋ณด๋‹ค ์„ฑ๋Šฅ์„ ๋†’์˜€๋‹ค.

Deeper Bottleneck Architectures#

../../_images/resnet_bottleneck.png

Fig. 10 Bottlenect desingn for deeper network architecture#

../../_images/resnet_50.png

Fig. 11 The architecture of ResNet-50-vd. (a) Stem block; (b) Stage1-Block1; ยฉ Stage1-Block2; (d) FC-Block.#

18, 34 layers์—์„œ๋Š” 3x3,3x3๋กœ 2๊ฐœ์˜ conv layer๋“ค์„ ์Œ“์•„์„œ ๋งŒ๋“ค์—ˆ์—ˆ๋‹ค๋ฉด, ๋” ๋‚˜์•„๊ฐ€์„œ 50,101,152 layers๋ฅผ ์œ„ํ•ด์„œ 1x1,3x3,1x1 ๋ฅผ ํ•˜๋‚˜์˜ ๋ธ”๋ก์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๊ฒƒ์„ bottleneck block ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

(linear projection conv)1x1 ์ฒซ๋ฒˆ์งธ filter layer๋Š” input์˜ ์ฐจ์›์„ ์ค„์ด๊ฑฐ๋‚˜ ๋Š˜๋ฆฌ๋Š”๋ฐ(์ฐจ์›์„ ๋งž์ถ”๋Š”๋ฐ) ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ ๋น„์šฉ์„ ์ค„์ด๊ณ , ๋” ์ ์€์ˆ˜์˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.์ด ๊ณผ์ •์—์„œ ์ผ๋ถ€ ์ •๋ณด์˜ ์†์‹ค์ด ๋ฐœ์ƒํ•  ์ˆ˜๋Š” ์žˆ๋‹ค.

๋‘๋ฒˆ์งธ 3x3 filter layer๋Š” bottleneck ์—ญํ• ์„ ์‹ค์งˆ์ ์œผ๋กœ ํ•˜๋Š” ๊ณต๊ฐ„์œผ๋กœ ์ฐจ์›์ด ์ค„์–ด๋“ ๋‹ค. ์ฐจ์›์ด ์ค„์–ด๋“ ๋‹ค๋Š” ๊ฒƒ์€ feature ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์ ์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ด๋ฉฐ, ๊ฐ€์ค‘์น˜๊ฐ€ ํฐ feature์— ์ง‘์ค‘ํ•˜๊ฒŒ ๋œ๋‹ค. ์ ์ฐจ์ ์œผ๋กœ output size๊ฐ€ 112x112 \(\to\) 56x56 \(\to\) 28x28 \(\to\) 14x14 \(\to\) 7x7 ๋กœ ์ค„์–ด๋“ค๋ฉด์„œ cnn์˜ ๊ธฐ๋ณธ์ ์ธ ์—ญํ• (๊ณต๊ฐ„์ ์ธ ํŠน์ง• ํ•™์Šต)์— ์ถฉ์‹คํ•˜๊ฒŒ ๋œ๋‹ค.

๋งˆ์ง€๋ง‰ 1x1 filter layer๋Š” ์ค„์–ด๋“  ์ฐจ์›์„ ๋‹ค์‹œ ๋Š˜๋ ค์ฃผ๋ฉด์„œ ์ฐจ์›์„ ๋ณด์กดํ•œ๋‹ค.

resnet50์˜ ๊ตฌ์กฐ๋„๋ฅผ ์ฐพ์€ ๊ฒƒ์ธ๋ฐ ๋…ผ๋ฌธ์—์„œ๋Š” stage๊ตฌ๋ถ„์ด ์—†์—ˆ๋Š”๋ฐ ์ด๊ฒƒ์€ ๊ตฌ๋ถ„์„ ํ•ด์„œ batch norm๋“ฑ์„ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ํ‘œํ˜„๋˜๊ณ  ์žˆ๋‹ค.

Code#

class Bottleneck(nn.Module):
    expansion: int = 4
    
    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
        ) -> None:
        super().__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.0)) * groups
        
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride
        
    def forward(self, x: Tensor) -> Tensor:
        identity = x
        
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
        
        out = self.conv3(out)
        out = self.bn3(out)
        
        if self.downsample is not None:
            identity = self.downsample(x)
            
        out += identity
        out = self.relu(out)
        return out
class ResNet(nn.Module):
    def __init__(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        layers: List[int],
        num_classes: int = 1000,
        zero_init_residual: bool = False,
        groups: int = 1,
        width_per_group: int = 64,
        replace_stride_with_dilation: Optional[List[bool]] = None,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        # _log_api_usage_once(self)
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError(
                "replace_stride_with_dilation should be None "
                f"or a 3-element tuple, got {replace_stride_with_dilation}"
            )
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2, dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2, dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2, dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode="fan_out", nonlinearity="relu")
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck) and m.bn3.weight is not None:
                    nn.init.constant_(m.bn3.weight, 0)  # type: ignore[arg-type]
                elif isinstance(m, BasicBlock) and m.bn2.weight is not None:
                    nn.init.constant_(m.bn2.weight, 0)  # type: ignore[arg-type]

    def _make_layer(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        planes: int,
        blocks: int,
        stride: int = 1,
        dilate: bool = False,
    ) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(
            block(
                self.inplanes, planes, stride, downsample, self.groups, self.base_width, previous_dilation, norm_layer
            )
        )
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(
                block(
                    self.inplanes,
                    planes,
                    groups=self.groups,
                    base_width=self.base_width,
                    dilation=self.dilation,
                    norm_layer=norm_layer,
                )
            )

        return nn.Sequential(*layers)

    def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

    def forward(self, x: Tensor) -> Tensor:
        return self._forward_impl(x)
temp= ResNet(BasicBlock, [2,2,2,2])

torchinfo#

from torchinfo import summary
summary(temp, input_size=(128,3,224,224))
==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
ResNet                                   [128, 1000]               --
โ”œโ”€Conv2d: 1-1                            [128, 64, 112, 112]       9,408
โ”œโ”€BatchNorm2d: 1-2                       [128, 64, 112, 112]       128
โ”œโ”€ReLU: 1-3                              [128, 64, 112, 112]       --
โ”œโ”€MaxPool2d: 1-4                         [128, 64, 56, 56]         --
โ”œโ”€Sequential: 1-5                        [128, 64, 56, 56]         --
โ”‚    โ””โ”€BasicBlock: 2-1                   [128, 64, 56, 56]         --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-1                  [128, 64, 56, 56]         36,864
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-2             [128, 64, 56, 56]         128
โ”‚    โ”‚    โ””โ”€ReLU: 3-3                    [128, 64, 56, 56]         --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-4                  [128, 64, 56, 56]         36,864
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-5             [128, 64, 56, 56]         128
โ”‚    โ”‚    โ””โ”€ReLU: 3-6                    [128, 64, 56, 56]         --
โ”‚    โ””โ”€BasicBlock: 2-2                   [128, 64, 56, 56]         --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-7                  [128, 64, 56, 56]         36,864
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-8             [128, 64, 56, 56]         128
โ”‚    โ”‚    โ””โ”€ReLU: 3-9                    [128, 64, 56, 56]         --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-10                 [128, 64, 56, 56]         36,864
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-11            [128, 64, 56, 56]         128
โ”‚    โ”‚    โ””โ”€ReLU: 3-12                   [128, 64, 56, 56]         --
โ”œโ”€Sequential: 1-6                        [128, 128, 28, 28]        --
โ”‚    โ””โ”€BasicBlock: 2-3                   [128, 128, 28, 28]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-13                 [128, 128, 28, 28]        73,728
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-14            [128, 128, 28, 28]        256
โ”‚    โ”‚    โ””โ”€ReLU: 3-15                   [128, 128, 28, 28]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-16                 [128, 128, 28, 28]        147,456
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-17            [128, 128, 28, 28]        256
โ”‚    โ”‚    โ””โ”€Sequential: 3-18             [128, 128, 28, 28]        8,448
โ”‚    โ”‚    โ””โ”€ReLU: 3-19                   [128, 128, 28, 28]        --
โ”‚    โ””โ”€BasicBlock: 2-4                   [128, 128, 28, 28]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-20                 [128, 128, 28, 28]        147,456
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-21            [128, 128, 28, 28]        256
โ”‚    โ”‚    โ””โ”€ReLU: 3-22                   [128, 128, 28, 28]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-23                 [128, 128, 28, 28]        147,456
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-24            [128, 128, 28, 28]        256
โ”‚    โ”‚    โ””โ”€ReLU: 3-25                   [128, 128, 28, 28]        --
โ”œโ”€Sequential: 1-7                        [128, 256, 14, 14]        --
โ”‚    โ””โ”€BasicBlock: 2-5                   [128, 256, 14, 14]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-26                 [128, 256, 14, 14]        294,912
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-27            [128, 256, 14, 14]        512
โ”‚    โ”‚    โ””โ”€ReLU: 3-28                   [128, 256, 14, 14]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-29                 [128, 256, 14, 14]        589,824
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-30            [128, 256, 14, 14]        512
โ”‚    โ”‚    โ””โ”€Sequential: 3-31             [128, 256, 14, 14]        33,280
โ”‚    โ”‚    โ””โ”€ReLU: 3-32                   [128, 256, 14, 14]        --
โ”‚    โ””โ”€BasicBlock: 2-6                   [128, 256, 14, 14]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-33                 [128, 256, 14, 14]        589,824
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-34            [128, 256, 14, 14]        512
โ”‚    โ”‚    โ””โ”€ReLU: 3-35                   [128, 256, 14, 14]        --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-36                 [128, 256, 14, 14]        589,824
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-37            [128, 256, 14, 14]        512
โ”‚    โ”‚    โ””โ”€ReLU: 3-38                   [128, 256, 14, 14]        --
โ”œโ”€Sequential: 1-8                        [128, 512, 7, 7]          --
โ”‚    โ””โ”€BasicBlock: 2-7                   [128, 512, 7, 7]          --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-39                 [128, 512, 7, 7]          1,179,648
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-40            [128, 512, 7, 7]          1,024
โ”‚    โ”‚    โ””โ”€ReLU: 3-41                   [128, 512, 7, 7]          --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-42                 [128, 512, 7, 7]          2,359,296
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-43            [128, 512, 7, 7]          1,024
โ”‚    โ”‚    โ””โ”€Sequential: 3-44             [128, 512, 7, 7]          132,096
โ”‚    โ”‚    โ””โ”€ReLU: 3-45                   [128, 512, 7, 7]          --
โ”‚    โ””โ”€BasicBlock: 2-8                   [128, 512, 7, 7]          --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-46                 [128, 512, 7, 7]          2,359,296
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-47            [128, 512, 7, 7]          1,024
โ”‚    โ”‚    โ””โ”€ReLU: 3-48                   [128, 512, 7, 7]          --
โ”‚    โ”‚    โ””โ”€Conv2d: 3-49                 [128, 512, 7, 7]          2,359,296
โ”‚    โ”‚    โ””โ”€BatchNorm2d: 3-50            [128, 512, 7, 7]          1,024
โ”‚    โ”‚    โ””โ”€ReLU: 3-51                   [128, 512, 7, 7]          --
โ”œโ”€AdaptiveAvgPool2d: 1-9                 [128, 512, 1, 1]          --
โ”œโ”€Linear: 1-10                           [128, 1000]               513,000
==========================================================================================
Total params: 11,689,512
Trainable params: 11,689,512
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 232.20
==========================================================================================
Input size (MB): 77.07
Forward/backward pass size (MB): 5087.67
Params size (MB): 46.76
Estimated Total Size (MB): 5211.49
==========================================================================================

torchinfo๋Š” ์š”์ฆ˜์€ torchsummary, torchsummaryX๊ฐ€ update๋ฅผ ํ•˜์ง€์•Š๋Š” ์ƒํ™ฉ์—์„œ ์ข‹์€ ๋Œ€์•ˆ์ด๋‹ค. ๋ฌผ๋ก  ํ•„์š” memory๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ฝค๋‚˜ ๊ฑธ๋ฆฌ๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ, GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ณ ๋ คํ•ด์„œ ์–ผ๋งˆ๋‚˜์˜ batch size๋ฅผ ๋ฏธ๋ฆฌ ์ƒ๊ฐํ•ด๋ณด๋Š”๋ฐ ์ข‹์€ tool๋กœ ๋ณด์ธ๋‹ค.

Netron#

# import torch.onnx
# params = temp.state_dict()
# dummy_data = torch.empty(1,3,224,224,dtype=torch.float32)
# torch.onnx.export(temp, dummy_data,'onnx_test.onnx')
================ Diagnostic Run torch.onnx.export version 2.0.0 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
../../_images/onnx_test.svg

Fig. 12 onnx๋กœ ๊ทธ๋ ค๋ณด๊ณ  svg๋กœ ์ €์žฅํ•œ resnet18.#

์—ฌ๊ธฐ์„œ W๋Š” weight๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์ˆœ์œผ๋กœ ํ‘œ์‹œ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  B๋Š” bias์ด๋‹ค.

Result#

ILSVRCโ€™15#

../../_images/resnet_imagenet.png

Fig. 13 โ€œImageNet Large Scale Visual Recognition Challengeโ€ ILSVRC๋Š” 2010๋…„๋ถ€ํ„ฐ 2017๋…„๊นŒ์ง€ ๋งค๋…„ ๊ฐœ์ตœ๋œ ์ด๋ฏธ์ง€ ์ธ์‹ ๋Œ€ํšŒ#

  1. 2011 XRCE

  2. 2012 AlexNet(8 cnn 3 fc)

  3. 2013 ZFNet(alexnet๋ณด๋‹ค ์ข€๋” ๊นŠ๊ฒŒ, ๋” ์ž‘์€ ํ•„ํ„ฐ)

  4. 2014 GoogleNet(์—ฌ๋Ÿฌ ํ•„ํ„ฐ ๋ณ‘๋ น ์ ์šฉ inception module) - VGG(๊ฐ„๋‹จ alex ๋ณด๋‹ค deep)

  5. 2015 ResNet

  6. 2016 GoogleNet-v4

  7. SENet(squeeze and excitation module channel๊ฐ„์˜ ์˜์กด์„ฑ ๊ฐ•์กฐ)

Reference#

source - title