Wednesday, July 23, 2014

The Linear Regression Model y = a + bx

This is part of Mike's Big Data, Data Mining, and Analytics Tutorial 

Linear Regression of the form \( y = a + b x \) is the typical "go to" regression method that people generally use. It is often taught in many basic statistics and non-statistical mathematics courses. There are a large number of problems where linear regression of the form \( y = a + bx \) provides a correct answer and a large number of problems where it provides an acceptable answer. In future posts, we'll look at other models that may fit other data better.

By definition, the line \( y = a + b x \) is a straight line with the following characteristics:

  • The y axis intercept (the model evaluated at \( x = 0 \), in mathematical notation \( y = a +bx |_{x=0}  \) is equal to  \(a\).
  • The x axis intercept (the model evaluated at \( y = 0 \) and solved for x, in mathematical notation \( y = a + bx|_{y=0} \) is equal to \( \frac{-a}{b} \).
  • The slope of the line is equal to \( b \). This can be shown with either the typical "rise over run" or using the first derivative (there's really not a lot of distinction here, but I recognize that some readers may not have a calculus background).
    • Let's look at the "rise over run" part first. To make the math easy, let's assume that we want to compare the change in y (call the change in y \( \sigma \) when we make some arbitrary change in x ( let's say we add \(\delta\) )$$ y = a + b x $$ $$ y + \sigma = a + b(x + \delta) $$ Now let's look at the change: $$ y + \sigma - y = a + b (x + \delta) - (a + bx  ) $$ Simplifying, we get to $$ \sigma = b \delta $$ . The "rise over run" is equal to $$ \frac{\sigma}{\delta} = b$$. To take this a step further, assume \(\delta = 1\), the change is exactly \(b\)
    • Going back to basic first semester calculus, this can be shown using the first derivative (for non-calculus readers, the derivative is a measure of how quickly the slope of a particular curve changes) $$ \frac{d}{dx} a + bx = b $$

How do I calculate \( a \) and \( b \) for the line \( y =  a + b x \)?
 

Let's get into the calculation of the \( a \) and \( b \) values for the \( y= a  + b x \) model. Firstly, we need to set up a matrix \( A \)  with the relevant transformations of our input data. We'll get to that in a minute below. First, let's answer the question "How do I find a line between 2 points in the \( (x,y) \) plane?"

For a second, let's consider two of our points in our data: \( (x_1, y_1 )\) and \( (x_2, y_2 )\).  If we used just the two points, we can calculate \( a \) and \( b \) directly. First, let's define a couple of equations:

$$ y_1 = a + b x_1 $$
$$ y_2 = a + b x_2 $$

Doing a little bit of reorganization, let's solve for b first:

$$ y_2 - y_1  = a + b x_2 - (a + b x_1) $$

\( a \) cancels out and the right side simplifies to  \( b x_2 - b x_1 = b ( x_2 - x_1 ) \). Solving for \( b \):

$$ b = \frac{y_2 - y_1}{x_2 - x_1} $$

Either equation can be used to solve for \( a \). Using the first, \( a = y_1 - b x_1 \), using the second equation, \( a = y_2 - b x_2 \). Now, let's consider a matrix solution to the same problem. We'll set up matrices to solve the equation
 $$ A z = B $$

Here, lets define \( z \) and \( B \). \( z \) contains our unknowns... namely \( a \) and \( b \). \( B \) contains our Y values, namely \( y_1 \)  and \( y_2 \)

$$ z = \begin{pmatrix} a \\ b \end {pmatrix} \quad \quad B = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix} $$

Let's take a little bit of extra time to talk about \( A \). Each column in \( A \) has to be a function of the data \( x_i \). Let's go back to our original equations and rewrite them slightly...

$$ y_1 = a + b x_1 \iff a \mathbf{x_1^0} + b x_1^{\mathbf{1}} $$
$$ y_2 = a + b x_2 \iff a \mathbf{x_2^0} + b x_2^{\mathbf{1}} $$

We know generally that almost anything raised to the "zero" power is equal to 1. Anything raised to the first power is equal to itself. Let's put our rewritten equations into their equivalent matrix format:

$$ A = \begin{pmatrix} x_1^0 & x_1^1 \\ x_2^0 & x_2^1 \end{pmatrix} \iff \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \end{pmatrix} $$

Our resulting matrix \( A \) contains all of the data. The first column contains the data raised to the 0 power and the second column contains the data raised to the first power. Let's write our system of equations out in matrix form:

$$ A z = B $$
$$ \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix} $$

This is a 2x2 system, so we can look up the inverse of a 2x2 matrix on my post here: http://mikemstech.blogspot.com/2014/07/inverse-of-2x2-matrix.html. We'll use the fact that

$$(A^{-1}) A z = (A^{-1}) B $$
$$ z = (A^{-1}) B $$

Calculating \( A^{-1} B \) yields

$$ \begin{pmatrix} \frac{y_1 x_2 - y_2 x_1}{ x_2 - x_1 } \\  \frac{y_2 - y_1}{x_2-x_1} \end{pmatrix} $$

Namely, \( a = \frac{y_1 x_2 - y_2 x_1}{ x_2 - x_1 } \) and \( b = \frac{y_2 - y_1}{x_2-x_1} \) for our two point example. A fair amount of algebra can be used to show the equivalence of the answers above and the answers to the matrix equations for \( a \) ( \( b \) is the same with either approach).

How to find \( a \) and \( b \) with more than two points.

We used the two point example as a conceptual introduction to how we set up the matrices, and now we want to consider the case with more than 2 points. We set up our system of equations using the least squares approach (minimizing the total sum of squared error for the model generated).

$$ A^T A z = A^T B $$

In this case,
$$  A = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix} \quad z = \begin{pmatrix} a \\ b \end{pmatrix} \quad B = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} $$








Now for the calculation of \( A^T A \) and \( A^T B \)

$$ A^T A = \begin{pmatrix} \sum \limits _{i = 1}^n 1 & \sum \limits _{i=1}^n x_i \\ \sum \limits _{i=1}^n x_i & \sum \limits _{i=1}^n x_i^2 \end{pmatrix} = \begin{pmatrix} n & \sum \limits _{i=1}^n x_i \\ \sum \limits _{i=1}^n x_i & \sum \limits _{i=1}^n x_i^2 \end{pmatrix} \quad \quad A^T B = \begin{pmatrix} \sum \limits _{i=1}^n y_i \\ \sum \limits _{i=1}^n x_i \cdot y_i \end{pmatrix} $$


This is a 2x2 system, so we can look up the inverse of a 2x2 matrix on my post here: http://mikemstech.blogspot.com/2014/07/inverse-of-2x2-matrix.html. Again, we'll use the following:

$$ (A^TA)^{-1} A^TA z = (A^TA)^{-1} B $$
$$ z = (A^TA)^{-1} B $$

Finding the inverse of \( A^T A \) yields

$$ (A^T A)^{-1} = \frac { 1 } { n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \begin{bmatrix} \sum \limits_{i=1}^n x_i^2 & -1 \cdot \sum \limits _{i=1}^n x_i \\ -1 \cdot \sum \limits _{i=1}^n x_i & n \end{bmatrix}$$

Calculating \( (A^T A)^{-1} B \) yields

$$ \begin{pmatrix} (A^T A)^{-1} B = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i -  \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 }  \\ \frac{n \sum \limits_{i=1}^n x_i y_i  -  \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \end{pmatrix} $$

So, for the regression model \( y = a + bx \)
$$ a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i -  \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \quad \quad  b = \frac{n \sum \limits_{i=1}^n x_i y_i  -  \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 }   $$

Example: Calculate A Regression Line for 3 Collinear Points

Problem statement: Calculate a line in the form \( y = a + b x \) that goes through the points \( (1,5),(2,7),(3,9) \).

 

We derived the formula above, so now we need to focus on calculation.

$$ a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i -  \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } \quad \quad  b = \frac{n \sum \limits_{i=1}^n x_i y_i  -  \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } $$

If calculating by hand, the easiest way is to organize the calculationsin a table.

Point \( x_i \) \( y_i \) \( x_i^2 \) \( x_i y_i \)
\((1,5)\) 1 5 1 5
\((2,7)\) 2 7 4 14
\((3,9)\) 3 9 9 27

\( \sum \limits_{i=1}^3 x_i = 1 + 2 + 3 = 6 \) \( \sum \limits_{i=1}^3 y_i = 5 + 7 + 9 = 21 \) \( \sum \limits_{i=1}^3 x_i^2 = 1 + 4 + 9 = 14 \) \( \sum \limits_{i=1}^3 x_i y_i = 5 + 14 + 27 = 46 \)
Now, for the calculation of \( a \)
$$ a = \frac{\sum \limits_{i=1}^n x_i^2 \sum \limits_{i=1}^n y_i -  \sum \limits_{i=1}^n x_i \sum \limits_{i=1}^n x_i y_i}{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } = \frac{ 14 \cdot 21 - 6 \cdot 46 }{ 3 \cdot 14 - 6^2 } = \frac{18}{6} = 3 $$

Now, for the calculation of \( b \)

$$ b = \frac{n \sum \limits_{i=1}^n x_i y_i  -  \sum \limits_{i=1}^n y_i \sum \limits_{i=1}^n x_i }{n \sum \limits _{i=1}^n x_i^2 - \left ( \sum \limits_{i=1}^n x_i \right )^2 } = \frac{3 \cdot 46 - 21 \cdot 6}{3 \cdot 14 - 6^2 } = \frac{12}{6} = 2 $$

The ending solution for \( y= a + b x \) that fits these three points is \( y = 3 + 2 x \)

Back to Mike's Big Data, Data Mining, and Analytics Tutorial 

No comments:

Post a Comment