Linear Regressin & Gradient Descent#

Hypothsis function ๊ฐ€์„ค ํ•จ์ˆ˜#

Hypothesis Definition

To Describe the supervised learining problem slightly more formally, our goal is, given a training set, to learn a function \(h : X \mapsto Y\) that \(h(x)\) is a โ€œgoodโ€ predictor for the corresponding value of \(y\). For historical reasons, this function \(h\) is called a hypothesis.

hypothesis๋ž€ input(feature, x)์™€ output(targer,label, y)์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค. input์€ ๋งŽ์ด ๊ณ ๋ คํ• ์ˆ˜๋ก ๋” ์œ ์˜๋ฏธํ•œ ๊ฐ€์„ค์„ ๋ฝ‘์•„๋‚ผ ํ™•๋ฅ ์ด ๋†’์•„์ง€๊ฒ ์ง€๋งŒ, ๊ทธ์— ๋”ฐ๋ผ ์ง€์ˆ˜์ ์œผ๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋Š˜์–ด๋‚˜๋ฉฐ ์‚ฌ์‹ค์ƒ ๊ทธ๋ ‡๊ฒŒ ํ•  ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ ์œ ์˜๋ฏธํ•ด ๋ณด์ด๋Š” feature๋ฅผ selectionํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ๊ฐ€์ง€๊ณ  ๊ฐ€์„ค์„ ์„ธ์›Œ์•ผ ํ•œ๋‹ค. ์šฐ์„ ์€ ํ•œ ๊ฐœ์˜ ์ข…์†๋ณ€์ˆ˜(x)์™€ ํ•œ ๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜(y)๋ฅผ ๊ฐ€์ง€๊ณ  linear ์„ ํ˜• ํ•จ์ˆ˜๋กœ ๊ฐ€์„ค์„ ์„ธ์›Œ๋ณด์ž.

์˜†์— ๋ณด์ด๋Š” ๊ฒƒ์ด ์ˆ˜์‹ ์ •์˜์— ํ•„์š”ํ•œ ์ˆ˜ํ•™์  ๊ธฐํ˜ธ์˜ ์ •์˜๋“ค์ด๋‹ค.

\[ h_{\theta}(x) = \theta_{0} + \theta_{1}x \]

์šฐ๋ฆฌ๋Š” ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋กœ \(h_{\theta}\)์™€ ์‹ค์ œ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” target๊ฐ’์ธ \(y\)์˜ ์ฐจ์ด ์ฆ‰ \(error = h_{\theta}(x) - y\)๊ฐ€ ๊ฐ€์žฅ ์ž‘์€ \(\theta_{0}\)์™€ \(\theta_{1}\)์„ ์ฐพ๊ณ  ์‹ถ์„ ๊ฒƒ์ด๋‹ค. ์ด error,loss,cost,์ž”์ฐจ ๋ผ๊ณ  ํ†ต์นญํ•˜๋Š” ์ด๊ฒƒ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ cost function์ด๋ผ๊ณ  ํ•œ๋‹ค.

Cost function ๋ชฉ์  ํ•จ์ˆ˜#

์œ„์—์„œ ๋ณด์•˜๋“ฏ์ด ์šฐ๋ฆฌ๋Š” ๊ฐ€์„คํ•จ์ˆ˜ \(h_{\theta}(x)\)์™€ ์‹ค์ œ๊ฐ’ y์™€์˜ ๋น„์šฉ(์ž”์ฐจ error loss)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋จธ์‹ ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค. ๊ฒฐ๊ตญ ์ด cost function์— \(x\) input์„ ๋„ฃ์—ˆ์„ ๋•Œ ๋‚˜์˜ค๋Š” ๊ฐ’ \(J(\theta)\)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด learning algorithm(์—ฌ๊ธฐ์„œ๋Š” supervised learning, linear regression)์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๊ธฐ๋ณธ์ ์ธ cost function์€ LSE(least squared error)์ด๋‹ค. error์— ์ œ๊ณฑ์„ ํ•ด์ฃผ๊ณ  ๊ทธ๊ฒƒ์˜ ํ•ฉ์„ ๋”ํ•ด์ค˜์„œ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์—ฌ๊ธฐ์„œ 1/m์„ ํ•ด์ฃผ๋ฉด mean squared error(MSE)๊ฐ€ ๋œ๋‹ค.

\[\begin{split} \begin{align} \begin{split} J(\theta_0, \theta_1) &= \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2\\ &= \frac{1}{2m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2\\ \end{split}\\ Goal = \min\limits_{\theta_{0}, \theta_{1}} J(\theta_0, \theta_1) \end{align} \end{split}\]

์™œ m์ด ์•„๋‹ˆ๋ผ 2m์ธ๊ฐ€?

ํ‰๊ท ์ด๋ผ๋ฉด m ๋ฐ์ดํ„ฐ์˜ ๊ฐฏ์ˆ˜๋Œ€๋กœ ๋‚˜๋ˆ ์ค˜์•ผํ•˜๋Š”๊ฒŒ ์•„๋‹Œ๊ฐ€ ํ•˜๋Š” ์˜๋ฌธ์ด ๋“ค ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋’ค์— ๋ฏธ๋ถ„์„ ํ•˜๊ฒŒ ๋˜๋ฉด ^2๊ฐ€ ์•ž์œผ๋กœ ํŠ€์–ด๋‚˜์™€ 1/2์™€ ์ƒ์‡„๋˜์–ด 1์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. \(\theta\)๋ฅผ ์–ด๋–ค ์ƒ์ˆ˜๋กœ ๋‚˜๋ˆ ๋„ ์ƒ๊ด€์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•˜๋‹ค.

๊ทธ๋ƒฅ ๋‹จ์ˆœํ•˜๊ฒŒ w = w + learning_rate * error * x ๋ฅผ ์‚ฌ์šฉํ•ด์„œ updateํ•ด๊ฐ€๋ฉด์„œ theta๋“ค์„ ์ง์ ‘ ์ฐพ์•„๊ฐ€๊ณ  cost function์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜์ง€ ์•Š์•„๋„ ๋˜๊ธดํ•œ๋‹ค.

Gradient Descent ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•#

์™œ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

  1. ๊ฐ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ ํ•˜๋‚˜ํ•˜๋‚˜ ๋งˆ๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ด๊ฐ€๋ฉฐ cost function์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹์€ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์ง€๋ฉด mini-batch๋กœ ๋ช‡ ๊ฐœ์˜ dataset example๋งˆ๋‹ค iterativeํ•˜๊ฒŒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๊ฒƒ์ด dataset์ด ๊ฑฐ๋Œ€ํ•ด์งˆ ๊ฒฝ์šฐ์— ๊ณ„์‚ฐ๋Ÿ‰ ์ธก๋ฉด์ด๋‹ค ์ตœ์ ํ™” ์‹œ๊ฐ„ ์ธก๋ฉด์—์„œ ์œ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

  2. closed form solution์ด ์—†๋Š” ๊ฒฝ์šฐ์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

  3. non linearity ํ•จ์ˆ˜๋‚˜ ๋ฏธ๋ถ„๊ณ„์ˆ˜์™€ ๊ทผ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Gradient Descent ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์€ 1์ฐจ ๋ฏธ๋ถ„๊ณ„์ˆ˜๋ฅผ ์ด์šฉํ•ด ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ์•„๋‚ด๋Š” iterativeํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค. Steepest Descent๋ผ๊ณ ๋„ ๋ถˆ๋ฆฐ๋‹ค. ์•ž์ด ํ•˜๋‚˜๋„ ์•ˆ๋ณด์ด๊ณ  ์–ด๋””๊ฐ€ ์œ„์ด๊ณ  ์–ด๋””๊ฐ€ ์•„๋ž˜์ธ์ง€๋งŒ ๋ณด์ด๋Š” ์ƒํ™ฉ์—์„œ ํ•œ ๋ฐœ์”ฉ ์•„๋ž˜๋กœ ๋‚ด๋ ค๊ฐ€๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ํ•˜์‚ฐ์˜ ๋ชฉํ‘œ๊ฐ€ ์‚ฐ์˜ ๋งจ ๋ฐ‘์ด๋“ฏ cost function์˜ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๊ณ , cost function์˜ ์ตœ์†Œ๊ฐ’์˜ ์ง€์ ์€ ๊ณง hypothesis function์˜ ํŒŒ๋ผ๋ฏธํ„ฐ \(\theta\)์˜ ์ตœ์ ๊ฐ’์ด ๋œ๋‹ค.

Algorithm 1 (Gradient Descent Algorithm)

  • start with some \({\theta}_0\), \({\theta}_1\)

  • keep changing \({\theta}_0\), \({\theta}_1\) to reduce \(J({\theta}_0, {\theta}_1)\) until we hopefully end up at a minimum

repeat until convergence{
\({\theta}_j := {\theta}_j - {\alpha}{\frac \partial {\partial{\theta}_j} } J({\theta}_0, {\theta}_1)\space\text {for j=0,j=1}\)
}

์œ„์˜ gradient descent ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์šฐ๋ฆฌ์˜ cost function์— ์ ์šฉํ•ด๋ณด์ž.

\[ \frac \partial {\partial{\theta}_j} J({\theta}_0, {\theta}_1) = \frac \partial {\partial{\theta}_j} \bigg[\frac{1}{2m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2\bigg] \]
\[ j = 0 : \frac \partial {\partial{\theta}_0} J({\theta}_0, {\theta}_1) = \frac{1}{m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}) \]
\[ j = 1 : \frac \partial {\partial{\theta}_1} J({\theta}_0, {\theta}_1) = \frac{1}{m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})x^{(i)} \]

๋‹ค์‹œ ํ‘œํ˜„ํ•˜๋ฉด

\[ {\theta}_0 := {\theta}_0 - {\alpha}\frac 1 m\sum_{i=1}^{m}(h_\theta(x^i) - y^i) \]
\[\begin{split} {\theta}_1 := {\theta}_1 - {\alpha}\frac 1 m\sum_{i=1}^{m}(h_\theta(x^i) - y^i)x^i\\ \end{split}\]