practice - torch sklearn numpy#

sklearn, numpy for linear regression and gradient descent

kaggle House Prices - Advanced Regression Techniques์—์„œ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ, Linear Regression์„ ๊ตฌํ˜„ํ•ด๋ณด์ž.

์šฐ๋ฆฌ์˜ SalesPrice๊ฐ€ ๊ตฌํ•˜๊ธฐ๋ฅผ ์›ํ•˜๋Š” y์ด๊ณ  ์ด๊ฒƒ์€ ์—ฐ์†์ ์ธ(continuous)ํ•œ value์ด๊ธฐ ๋•Œ๋ฌธ์— linear regression์„ ์‚ฌ์šฉํ•˜๋Š” ๊ณผ์ œ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. GriLivArea(Above grade(ground) living area square feet)์€ cs229์—์„œ ๋งํ•˜๋Š” size(feet^2)์™€ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ column์ด๋ผ๊ณ  ์ƒ๊ฐ๋˜์–ด์„œ ๋ฝ‘์•˜๋‹ค. ๋‹จ์ˆœํ•˜๊ฒŒ scatter plot์„ ํ•ด๋ด๋„ ์‚ฌ์ด๋“œ๋กœ ๋งŽ์ด ๋น ์ง„ ๋ช‡ outlier๋“ค์„ ์ œ์™ธํ•˜๋ฉด ์–ด๋Š ์ •๋„์˜ linear ๊ด€๊ณ„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์„ ๊ฑฐ๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from IPython.display import display, Markdown

train = pd.read_csv('./files/train.csv')
train
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows ร— 81 columns

train[['SalePrice','GrLivArea']].plot.scatter(x='GrLivArea', y='SalePrice')
<Axes: xlabel='GrLivArea', ylabel='SalePrice'>
../../_images/184e93a7308b6cd2c24649a1102013af9c8f3cb9c1d326ab306cf92ac26926b4.png

standardization#

\[\begin{equation*} z = \frac{x - \mu}{\sigma} \end{equation*}\]
  • \(\mu\)๋Š” ํ‰๊ท , \(\sigma\)๋Š” ํ‘œ์ค€ํŽธ์ฐจ

  • \(z\)๋Š” ํ‘œ์ค€ํ™”๋œ ๊ฐ’์œผ๋กœ, ํ‰๊ท ์œผ๋กœ๋ถ€ํ„ฐ ์–ผ๋งˆ๋‚˜ ๋–จ์–ด์ ธ ์žˆ์œผ๋ฉฐ, ๊ทธ ๊ฑฐ๋ฆฌ๋ฅผ ํ‘œ์ค€ํŽธ์ฐจ์˜ ๋ช‡ ๋ฐฐ์ˆ˜๋งŒํผ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

  • ๋ฐ์ดํ„ฐ๋ฅผ ํ‰๊ท ์ด 0์ด๊ณ , ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ธ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ.

  • ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ์กฐ์ •ํ•˜๊ณ , ๋‹ค์–‘ํ•œ ์Šค์ผ€์ผ์„ ๊ฐ€์ง„ ๋ณ€์ˆ˜๋“ค๊ฐ„ ๋น„๊ต ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“ฆ.

  • outlier์— ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์— ๋”ฐ๋ผ (๊ฐ€์šฐ์‹œ์•ˆ normal distribution์ด ์•„๋‹ ๊ฒฝ์šฐ) ๋‹ค๋ฅธ scaler๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์ž…๋ ฅ๋ณ€์ˆ˜ X๋ฅผ standardization ํ•˜์ง€ ์•Š๊ณ  ํ•™์Šตํ•  ๊ฒฝ์šฐ์— ๊ฐ€์ค‘์น˜์˜ ๊ฐ’์ด ์ œ๋Œ€๋กœ ํ•™์Šต๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ์Šค์ผ€์ผ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—.

X = train['GrLivArea'].values.reshape(-1,1)
y = train['SalePrice'].values.reshape(-1,1)
X = (X - X.mean()) / X.std()
y = (y - y.mean()) / y.std()

Sklearn linear regression#

lr = LinearRegression()
lr.fit(X, y)
y_pred = lr.predict(X)

plt.scatter(X, y)
plt.plot(X, y_pred, color='red')
plt.show()

# print("$h(\Theta)$" f"= {lr.coef_[0][0]:.2f}x + {lr.intercept_[0]:.2f}")
result_str =  r"$h(\Theta)$ = {:.2f}x + {:.2f}".format(lr.coef_[0][0], lr.intercept_[0])
display(Markdown(result_str))
../../_images/3a267dd826cff982539876f3baddaeea105fcc73f945d4fb5ed675c8c8d0e9b0.png

$h(\Theta)$ = 0.71x + 0.00

Numpy implementation#

lr = 1e-1
n_epochs = 5000
a = np.random.randn(1)
b = np.random.randn(1)

for epoch in range(n_epochs):
    y_hat = a + b * X
    error = y - y_hat
    loss = (error**2).mean()

    a_grad = -2 *error.mean()
    b_grad = -2 * (X * error).mean()

    a = a - lr * a_grad
    b = b - lr * b_grad
    
result_str =  r"$h(\Theta)$ = {:.2f}x + {:.2f}".format(b[0], a[0])
display(Markdown(result_str))

$h(\Theta)$ = 0.71x + 0.00

manim test#

# from manim import *
# from manim import config; config.media_embed=True
# %%manim -v WARNING  --progress_bar None -r 400,200 --format=gif --disable_caching HelloManim

# class HelloManim(Scene):
#     def construct(self):
#         self.camera.background_color = "#ece6e2"
#         banner_large = ManimBanner(dark_theme=False).scale(0.7)
#         self.play(banner_large.create())
#         self.play(banner_large.expand())