# 2.1: Linear Regression Using SGD

08 Sep 2017Linear relationships are some of the simplest possible connections between two variables — yet they show up surprisingly often in nature. If the mass of an object doubles, then the force of gravity on it doubles as well. If the current through a copper wire is halved, the voltage is halved as well.

Linear relationships are also relatively easy to see and characterize in data. Anyone can draw a line through points on a scatterplot. But how do we train a program to do the same?

### Data

We’re going to tackle a fairly standard machine learning problem: predicting
**house prices** from their area/square footage.

While the footage a house spans is important to its price (mansions sell for far more than apartments), size isn’t everything. Two houses with the same square footage could have very different prices due to school zoning, amenities, and other factors. Thus, our model of house pricing will likely be highly predictive, but imperfect.

The housing dataset we use is a 200-house subset of data from the Kaggle housing competition. It is hosted in a CSV file on the course github.

We load the CSV in a new Python file using Pandas, then create a scatterplot using Seaborn.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
url = "https://raw.githubusercontent.com/nikcheerla/deeplearningschool/master/examples/data/housing.csv"
data = pd.read_csv(url)
area = data["Square Feet (Millions)"]
price = data["Price ($, Millions)"]
sns.jointplot(area, price);
plt.show()
```

### Modeling

Now, we develop a linear model to predict house price from house area. We define:

Thus, the price of a house is equal to the area of the house, multiplied by a parameter $m$. For some ideal value of $m$, the resultant line should fit the data points well. We choose $m=1$ as a reasonable initial starting point, and we can find the best value of $m$ through learning.

This linear prediction is represented in a simple `predict()`

function:

```
M = 1
# Takes in M + numpy array of areas, returns price predictions
def predict(M, input_areas):
return M*input_areas
```

### Goal Evaluation

Our goal is to accurately predict housing prices. One way to measure how far
away our predictions are from the actual values is using the **mean-squared error**
or MSE: the mean of the squared differences between our predictions and
the actual values.

We can write a function `evaluate()`

that computes this mean-squared error,
comparing the predicted price for a given value of $M$ with the actual price.

```
# Evaluates MSE of model y = Mx
def evaluate(M):
price_predicted = predict(M, area)
MSE = ((price - price_predicted)**2).mean()
return MSE
print (evaluate(M=1)) # Error is 0.3258 when M=1
```

### Learning

We could find the optimal value of $m$ by trying all possible values (uniform
search), like we did in Section 1.2. But we want the answers found to be
*accurate* and *fast* — and uniform search necessarily sacrifices one of these
constraints. After all, even testing all possible values of $m$ from 0–1, with
precision to the fourth decimal place, involves trying **10,000** values!

Clearly, we need to try a different approach.

#### Gradient Descent

Imagine that you could visualize the **loss function** or **loss landscape** for
this problem; i.e, the MSE error for every possible model configuration. It
might look something like this:

What you’d note is that, in general, the loss landscape tends to be “smooth” and
decreasing until you get to the **global cost minimum**, which represents the
optimal model with least error. There are some “bumps” or local minima that
break the pattern, but these tend to be shallow and surpassable. If you **rolled
a ball** down the “hill” that represents the error, the ball would probably end
up at the global minimum. Not for certain. But probably.

This inspires an algorithm called **gradient descent**, in which, like a ball
rolling down a hill, we update the model parameters to change in the **steepest
downhill direction** of the error. This works whether there is only one
parameter (like in this example, $m$) or when there are many parameters.

If you’ve taken calculus, you likely recognize what the “steepest downhill
direction” is — mathematically, it’s equivalent to the **negative of the
derivative** of the loss or error. (The negative of the *gradient* when we’re
talking about multivariable parameter spaces). Shifting the model to go in the
steepest downhill direction would be the equivalent of **subtracting** the
negative derivative of the loss, times some constant. Thus, we can formalize
gradient descent for this problem as an update rule:

#### Stochastic Gradient Descent (SGD)

Most machine learning/deep learning applications use a variant of gradient
descent called stochastic gradient descent (SGD), in which instead of updating
parameters based on the derivative of the *dataset* on each step, you update
based on the derivative of a randomly chosen *sample*. The update rule for SGD
is similar to gradient descent, but involves random sampling at each iteration:

Not only is this much less computationally taxing, research has shown that the randomness involved in SGD allows it to converge and overcome local minima faster.

#### Implementation

To use stochastic gradient descent in our code, we first have to compute the derivative of the loss function with respect to a random sample. The math is shown below:

Now, we write a `learn()`

function that chooses a random housing sample and
modifies $m$ in the direction of that sample, returning the new (modified) value
of $m$. We use a relatively small learning rate of $0.005$ to ensure convergence.

```
def learn(M):
j = random.randint(0, len(area) - 1) # choosing a random sample
deriv = 2*(M*area[j] - price[j])*M # derivative calculation
M = M - 0.005*deriv # SGD update step
return M
```

Lastly, we create a training loop that applies `learn()`

$2000$ times and shows
the convergence to the best value of $m$. Every 100 iterations, we print out the
MSE loss, which should ideally decrease as the model becomes better at fitting
the data.

```
for i in range(0, 2000):
M = learn(M)
if i % 100 == 0:
print ("Loss: ", evaluate(M), "(M =", M, ")")
```

### Putting It All Together

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
url = "https://raw.githubusercontent.com/nikcheerla/deeplearningschool/master/examples/data/housing.csv"
data = pd.read_csv(url)
area = data["Square Feet (Millions)"]
price = data["Price ($, Millions)"]
sns.jointplot(area, price);
plt.show()
M = 1
# Takes in M + numpy array of areas, returns price predictions
def predict(M, input_areas):
return M*input_areas
# Evaluates MSE of model y = Mx
def evaluate(M):
price_predicted = predict(M, area)
MSE = ((price - price_predicted)**2).mean()
return MSE
print (evaluate(M=1)) # Error is 0.3258 when M=1
def learn(M):
j = random.randint(0, len(area) - 1) # random sample
deriv = 2*(M*area[j] - price[j])*M # derivative calc
M = M - 0.005*deriv # SGD update
return M
for i in range(0, 2000):
M = learn(M)
if i % 100 == 0:
print ("Loss: ", evaluate(M), "(M =", M, ")")
```

We run the program, training it to predict house price given house area. It
seems to converge on a final slope of around **24**, no matter what the initial
guess/value of M was.

We compare our predicted value of $m$ to the true line-of-best-fit (linear
regression can also be solved analytically to find an mathematically optimal
solution). The line-of-best-fit also has an intercept term, so it doesn’t
exactly match our solution, but it also achieves a slope of around **24**.

Success! We managed to develop a linear model that accurately predicts house prices using Stochastic Gradient Descent.