Linear Regressin & Gradient Descent

Linear Regressin & Gradient Descent#

Hypothsis function 가설 함수#

Hypothesis Definition

To Describe the supervised learining problem slightly more formally, our goal is, given a training set, to learn a function \(h : X \mapsto Y\) that \(h(x)\) is a “good” predictor for the corresponding value of \(y\). For historical reasons, this function \(h\) is called a hypothesis.

hypothesis란 input(feature, x)와 output(targer,label, y)의 관계를 나타내는 함수를 말한다. input은 많이 고려할수록 더 유의미한 가설을 뽑아낼 확률이 높아지겠지만, 그에 따라 지수적으로 연산량이 늘어나며 사실상 그렇게 할 수 없다. 따라서 유의미해 보이는 feature를 selection하고 그것을 가지고 가설을 세워야 한다. 우선은 한 개의 종속변수(x)와 한 개의 독립변수(y)를 가지고 linear 선형 함수로 가설을 세워보자.

옆에 보이는 것이 수식 정의에 필요한 수학적 기호의 정의들이다.

\[ h_{\theta}(x) = \theta_{0} + \theta_{1}x \]

우리는 주어진 데이터로 \(h_{\theta}\)와 실제 가지고 있는 target값인 \(y\)의 차이 즉 \(error = h_{\theta}(x) - y\)가 가장 작은 \(\theta_{0}\)와 \(\theta_{1}\)을 찾고 싶을 것이다. 이 error,loss,cost,잔차 라고 통칭하는 이것을 최소화하는 함수를 cost function이라고 한다.

Cost function 목적 함수#

위에서 보았듯이 우리는 가설함수 \(h_{\theta}(x)\)와 실제값 y와의 비용(잔차 error loss)를 최소화하는 방법으로 머신을 학습시키는 것이 목표이다. 결국 이 cost function에 \(x\) input을 넣었을 때 나오는 값 \(J(\theta)\)을 최소화하는 것이 learning algorithm(여기서는 supervised learning, linear regression)의 성능을 높이는 것이라고 볼 수 있다.

기본적인 cost function은 LSE(least squared error)이다. error에 제곱을 해주고 그것의 합을 더해줘서 최소값을 찾는 방식이다. 여기서 1/m을 해주면 mean squared error(MSE)가 된다.

\[\begin{split} \begin{align} \begin{split} J(\theta_0, \theta_1) &= \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}) - y^{(i)})^2\\ &= \frac{1}{2m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2\\ \end{split}\\ Goal = \min\limits_{\theta_{0}, \theta_{1}} J(\theta_0, \theta_1) \end{align} \end{split}\]

왜 m이 아니라 2m인가?

평균이라면 m 데이터의 갯수대로 나눠줘야하는게 아닌가 하는 의문이 들 수 있는데, 뒤에 미분을 하게 되면 ^2가 앞으로 튀어나와 1/2와 상쇄되어 1이 되기 때문이다. \(\theta\)를 어떤 상수로 나눠도 상관없기 때문에 가능하다.

그냥 단순하게 w = w + learning_rate * error * x 를 사용해서 update해가면서 theta들을 직접 찾아가고 cost function의 기울기를 구하지 않아도 되긴한다.

Gradient Descent 경사하강법#

왜 경사하강법을 사용하는가?

각 데이터 샘플 하나하나 마다 파라미터를 업데이트를 해가며 cost function을 최적화하는 방식은 시간이 오래걸리고 데이터 사이즈가 커지면 mini-batch로 몇 개의 dataset example마다 iterative하게 파라미터를 업데이트 하는 것이 dataset이 거대해질 경우에 계산량 측면이다 최적화 시간 측면에서 유용할 수 있다.
closed form solution이 없는 경우에도 사용할 수 있다.
non linearity 함수나 미분계수와 근을 계산하기 어려운 경우에도 사용할 수 있다.

Gradient Descent 경사하강법은 1차 미분계수를 이용해 함수의 최소값을 찾아내는 iterative한 방법이다. Steepest Descent라고도 불린다. 앞이 하나도 안보이고 어디가 위이고 어디가 아래인지만 보이는 상황에서 한 발씩 아래로 내려가는 것과 비슷하기 때문이다. 하산의 목표가 산의 맨 밑이듯 cost function의 최소값을 찾는 것이 목표이고, cost function의 최소값의 지점은 곧 hypothesis function의 파라미터 \(\theta\)의 최적값이 된다.

Algorithm 1 (Gradient Descent Algorithm)

start with some \({\theta}_0\), \({\theta}_1\)
keep changing \({\theta}_0\), \({\theta}_1\) to reduce \(J({\theta}_0, {\theta}_1)\) until we hopefully end up at a minimum

repeat until convergence{
\({\theta}_j := {\theta}_j - {\alpha}{\frac \partial {\partial{\theta}_j} } J({\theta}_0, {\theta}_1)\space\text {for j=0,j=1}\)
}

위의 gradient descent 알고리즘을 우리의 cost function에 적용해보자.

\[ \frac \partial {\partial{\theta}_j} J({\theta}_0, {\theta}_1) = \frac \partial {\partial{\theta}_j} \bigg[\frac{1}{2m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})^2\bigg] \]

\[ j = 0 : \frac \partial {\partial{\theta}_0} J({\theta}_0, {\theta}_1) = \frac{1}{m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)}) \]

\[ j = 1 : \frac \partial {\partial{\theta}_1} J({\theta}_0, {\theta}_1) = \frac{1}{m}\sum_{i=1}^{m}(\theta_{0} + \theta_{1}x^{(i)} - y^{(i)})x^{(i)} \]

다시 표현하면

\[ {\theta}_0 := {\theta}_0 - {\alpha}\frac 1 m\sum_{i=1}^{m}(h_\theta(x^i) - y^i) \]

\[\begin{split} {\theta}_1 := {\theta}_1 - {\alpha}\frac 1 m\sum_{i=1}^{m}(h_\theta(x^i) - y^i)x^i\\ \end{split}\]

Linear Regressin & Gradient Descent

Contents

Linear Regressin & Gradient Descent#

Hypothsis function 가설 함수#

Cost function 목적 함수#

Gradient Descent 경사하강법#