STAT 2223-002 Project
Due on 11:59 pm, May 7.
Submit your project through Canvas before due date
1 The Project
This project concerns a problem of interest of buying a used car. The calling price of used
cars will vary depending on the year of production and the mileage, any specific kind
(brand and body type, e.g. Honda Civic Sedan) of cars,…including some random factors.
The purpose of this project is to examine the relationship between the mean calling
price E(y) (the price asked for by the owner) of a specific kind of car and the following
1 X1 (quantitative): The number of years since production; e.g. If a car is produced
on 2010, then X1 = 2017 − 2010 = 7.
2 X2 (quantitative): The original price of the car when it is brand new.
3 X3 (quantitative): The Current Mileage of the car.
4 X4 (qualitative): Title (Clean vs not clean)
Choose specific type of car (for example: Honda Civic sedan; Chevrolet Malibu,
etc. ), and collect your sample data from: http://charlotte.craigslist.org/ or
https://www.kaggle.com/orgesleka/used-cars-database with sample size n ≥ 30 (
you can decide how many observations to be included but that number must be greater
or equal to 30). Make sure your data contains the above quantitative and qualitative
The objectives of this project are as follows:
1. Hypothesis a model for calling price and predictors (if necessary you need to consider
the interactive effects)
2. Run variable selection procedure to choose most important x’s (stepwise regression,
all possible regression selection procedure)
3. For the selected x’s in step 2, fit regression model you proposed in step 1. Conduct
T-test on important β
s; comparing adjusted R2
; compare 2s values.
4. Propose and fit other candidate models. Determine a best model for E(y) by checking
nested model F-test (hint using anova() function in R for nested-F test);
5. Based on the best model you selected in step 4, perform residue analysis to check
assumption on (whether or not ’s are independently from N(0,σ
normality assumption, use both Q-Q plot residual plots (code will be provided in
6. (Optional,Bonus)Remedy your model if you do detect some violation of assumption
on and redo step 1, 2, 3, 4 and 5
7. Assess adequacy of best model by checking global F-test significant; adjusted R2
high; 2s value small
2 Format of Your Work
Your work should be clear and easy to understand, follow the following format:
1. Statement of The Problem: You need to state your research question here.
That is, tell us what your study is about and your purpose of the study (around
2 The Data: You need to specify how you collect the data and summarize your
sample data using the methods we learned in descriptive statistics in Chapter 1.
(1)The following table must be included.( 30pts)
(2)Scatter plots( 30pts): X1 versus Y; X2 versus Y; X3 versus Y.
Histogram( 10pts): the histogram of the calling price Y
Obs X1(Years) X2(New Price) X3(Mileage) X4(Title) Y(Calling Price)
· · · · · · · · · · · · · · · · · ·
3 The Models: Specify the hypothesized models you want to apply. In this part,
you are expected to finish the first four objectives stated above. Hint: When you
proposed a model, the first fitting might not an ideal model, you might need to improve
your model by selecting variables, change the order of your model, considering
interactive effect, ect. You need to compare all the models you fitted and explain
why it is the best by checking the nested model F-test; T-test on important β
comparing adjuested R2
; comparing 2s values. (150pts)
4 Assumption check: In this part, you can do item 5 of objectives stated above,
and write down your conclusion.(60pts)
5 Model Remedy(optional): In this part, do some transformation for y or x to
make the model assumptions be satisfied and write down the new model and conclusion.(30pts)
6 Model adequency: In this part, you are expected to finish the last objectives
7 Conclusion: Give a brief summary of your study.(40pts)
This project is composed of 7 parts(see 2 Format of Your work). Part 5 is optional
and you will gain 30 pts bonus if you doing a good job there. The total points is 400pts
+ bonus 30pts.
All the analysis should be done by applying software R. You need to work independently.
Students are encouraged to work on their own, yet helping each other understanding
the concepts is fine. You should collect your data and write your code by
yourself. Project report does not include analysis and R code can only have
a maximum 200 pts. If you are contemplating an ethical failure please read the code
of student academic integrity: https://legal.uncc.edu/policies/up-407, so you can
plan for the consequences.
4 Submit your project
Submit your work through canvas before the due time 11:59pm, May/07/2018. If you
missed the project, your grade will be automatically F. I highly suggest you upload
your report to the Canvas at least 1 hour before the deadline to prevent submission failure.
If you are not able to submit through canvas, email me your report as soon as possible.
If you email me your project report after 24 hours of deadline, your project will not be
graded. Your work should be a pdf file which contains your analysis and R code.
STAT 2223-002 Project