If you like us, please share us on social media.
The latest UCD Hyperlibrary newsletter is now complete, check it out.

logo.png
 
StatWiki: The Dynamic Statistics E-textbook > Regression Analysis > Simple linear regression > Analysis of variance approach to regression

MindTouch
Copyright (c) 2006-2014 MindTouch Inc.
http://mindtouch.com

This file and accompanying files are licensed under the MindTouch Master Subscription Agreement (MSA).

At any time, you shall not, directly or indirectly: (i) sublicense, resell, rent, lease, distribute, market, commercialize or otherwise transfer rights or usage to: (a) the Software, (b) any modified version or derivative work of the Software created by you or for you, or (c) MindTouch Open Source (which includes all non-supported versions of MindTouch-developed software), for any purpose including timesharing or service bureau purposes; (ii) remove or alter any copyright, trademark or proprietary notice in the Software; (iii) transfer, use or export the Software in violation of any applicable laws or regulations of any government or governmental agency; (iv) use or run on any of your hardware, or have deployed for use, any production version of MindTouch Open Source; (v) use any of the Support Services, Error corrections, Updates or Upgrades, for the MindTouch Open Source software or for any Server for which Support Services are not then purchased as provided hereunder; or (vi) reverse engineer, decompile or modify any encrypted or encoded portion of the Software.

A complete copy of the MSA is available at http://www.mindtouch.com/msa

Analysis of variance approach to regression

 

We divide the total variability in the observe data into two parts - one coming from the errors, the other coming from the predictor. 

ANOVA decomposition

The following decomposition 
 
\( \large Y_i - \overline{Y}\)   = \( \large (\widehat{Y_i} - \overline{Y}) \)  +  \( \large (Y_i - \widehat{Y_i} )\) ,  \(   i=1,2,...,n.   \)
 

represents the deviation of the observed response from the mean response in terms of the sum ofthe deviation of the fitted value from the mean plus the residual.

 

Taking the sum of squares, and after some algebra we have:

 

\( \large \sum_{i=1}^n (Y_i - \overline{Y})^2 = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2 + \sum_{i=1}^n (Y_i - \widehat{Y_i})^2,                       \)                (1).

or

\( \large SSTO = SSR +SSE\)

where \(SSTO = \sum_{i=1}^n (Y_i - \overline{Y})^2 \) and \(SSR = \sum_{i=1}^n (\widehat{Y_i} -\overline{Y})^2. \)  (1) is referred to as the ANOVA decomposition to the varitation in the response. Note that

   \( \large SSR = b_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 .\) 

Degrees of freedom

The degrees of freedom of different terms in the decomposition (1) are  

d.f.( SSTO ) = n - 1,        d.f.( SSR ) = 1,        d.f( SSE ) = n - 2. 

So, d.f.( SSTO ) = d.f.( SSR ) + d.f.( SSE ). 

 

Expected value and distribution

\( \large E ( SSE ) = ( n - 2)  \sigma^2, \) and \(\large E ( SSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2. \) Also, under the normal regression model, and under \( \large H_0 : \beta_1 = 0, \)

\( \large SSR \sim \sigma^2 \chi_1^2,        SSE \sim \sigma^2 \chi_{n-2}^2, \)

and these two are independent. 

 

Mean squares

\( \large MSE = \frac{SSE}{d.f.(SSE)} = \frac{SSE}{n-2},        MSR = \frac{SSR}{d.f.(SSR)} = \frac{SSR}{1}. \)

 

Also, \( \large E ( MSE ) = \sigma^2 , E ( MSR ) = \sigma^2 + \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2. \)

 

F ratio

For testing \( \large H_0 : \beta_1 = 0 \) versus \(\large  H_1 : \beta_1 \neq 0, \) the following test statistics, called the F ratio, can be used:

\( \large F^* = \frac{MSR}{MSE}. \)

 

The reason is that \( \frac{MSR}{MSE} \) fluctuates around 1 + \( \frac{ \beta_1^2 \sum_{i=1}^n (X_i - \overline{X})^2 }{\sigma^2}. \) So, a significantly large value of \(F^*\) provides evidence against \(H_0\) and for \(H_1.\)

 

Under \(H_0, F^* \) has the \(F\) distribution with paired degrees of freedom (d.f.( SSR ), d.f.( SSE )) = (1, n - 2 ), (written \(F^* \sim F_{1, n - 2}). \) Thus,

 

the test rejects \(H_0\) at level of significance \(\alpha\) if \(F^* > F( 1 - \alpha; 1, n - 2 ), \)

 

where  \(F( 1 - \alpha; 1, n - 2 ) \) is the \( (1 - \alpha ) \) quantile of \(F_{1; n - 2} \) distribution. 

Relation between F-test and t-test

Check that \(\large F^* =  ( t^* )^2. \) where \( \large t^* = \frac{b_1}{s ( b_1 )} \) is the test statistic for testing \(H_0 : \beta_1 = 0 \) versus \(H_1 : \beta_1 \neq 0. \) So, the F-test is equivalent to the t-test in this case. 

 

ANOVA table

It is a table that gives the summary of the various objects used in testing \(H_0 : \beta_1 = 0 \) against \(H_1 : \beta_1 \neq 0.\) It is of the form:

 

Source df SS MS F*
Regression d.f.(SSR) = 1  SSR MSR \(\frac{MSR}{MSE} \)
Error d.f.(SSE) = n - 2 SSE MSE  
Total d.f.(SSTO) = n - 1 SSTO    

Example: housing price data

We consider a data set on housing prices. Here Y = selling price of houses (in $1000), and X = size of houses (100 square feet). The summary statistics are given below: 

$$ \large n = 19,     \overline{X}  = 15.719,     \overline{Y} = 75.211, $$

\( \large \sum_i ( X_i - \overline{X} )^2 = 40.805,    \sum_i ( Y_i - \overline{Y} )^2 = 556.078,    \sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) = 120.001. \)

(Example) - Estimates of \(\beta_1 \) and \(\beta_0\)

\(\large b_1 = \frac{\sum_i ( X_i - \overline{X} ) ( Y_i - \overline{Y} ) }{\sum_i ( X_i - \overline{X} )^2} = \frac{120.001}{40.805} = 2.941. \)       

             and

\( \large b_0 = \overline{Y} - b_1 \overline{X} = 75.211 - (2.941)(15.719) = 28.981. \)         

(Example) - MSE

            The degrees of freedom (d.f.)  = \(\large  n -2 = 17. SSE = \sum_i (Y_i - \overline{Y} )^2 - b_1^2 \sum_i ( X_i - \overline{X} )^2 = 203.17.\) So, 

\( \large MSE = \frac{SSE}{n - 2} = {203.17}{17} = 11.95. \)

             Also, SSTO = 556.08 and SSR = SSTO - SSE = 352.91, MSR = SSR/1 = 352.91.

             \(F^* = \frac{MSR}{MSE} = 29.529 = (t^* )^2,\) where \(t^* = \frac{b_1}{s ( b_1 )} = \frac{2.941}{0.5412} = 5.434.\) Also, F( 0.95; 1, 17 ) = 4.45, t( 0.975; 17) = 2.11. So, we reject \(H_0 : \beta_1 = 0. \) The ANOVA table is given below.

 

Source df SS MS F*
Regression 1 352.91 352.91 29.529
Error 17 203.17 11.95  
Total 18 556.08    

Contributors

  • Valerie Regalia
  • Debashis Paul

You must to post a comment.
Last Modified
00:06, 21 Nov 2013

Tags

This page has no custom tags.

Classifications

This page has no classifications.

Creative Commons License UC Davis GeoWiki by University of California, Davis is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Permissions beyond the scope of this license may be available at copyright@ucdavis.edu. Terms of Use