# STAT 2223-002 Project

STAT 2223-002 Project
Spring 2018
Due on 11:59 pm, May 7.
Submit your project through Canvas before due date
1 The Project
This project concerns a problem of interest of buying a used car. The calling price of used
cars will vary depending on the year of production and the mileage, any specific kind
(brand and body type, e.g. Honda Civic Sedan) of cars,…including some random factors.
The purpose of this project is to examine the relationship between the mean calling
price E(y) (the price asked for by the owner) of a specific kind of car and the following
independent variables.:
1 X1 (quantitative): The number of years since production; e.g. If a car is produced
on 2010, then X1 = 2017 − 2010 = 7.
2 X2 (quantitative): The original price of the car when it is brand new.
3 X3 (quantitative): The Current Mileage of the car.
4 X4 (qualitative): Title (Clean vs not clean)
Choose specific type of car (for example: Honda Civic sedan; Chevrolet Malibu,
etc. ), and collect your sample data from: http://charlotte.craigslist.org/ or
1
https://www.kaggle.com/orgesleka/used-cars-database with sample size n ≥ 30 (
you can decide how many observations to be included but that number must be greater
or equal to 30). Make sure your data contains the above quantitative and qualitative
variables.
The objectives of this project are as follows:
1. Hypothesis a model for calling price and predictors (if necessary you need to consider
the interactive effects)
2. Run variable selection procedure to choose most important x’s (stepwise regression,
all possible regression selection procedure)
3. For the selected x’s in step 2, fit regression model you proposed in step 1. Conduct
T-test on important β
; compare 2s values.
4. Propose and fit other candidate models. Determine a best model for E(y) by checking
nested model F-test (hint using anova() function in R for nested-F test);
5. Based on the best model you selected in step 4, perform residue analysis to check
assumption on  (whether or not ’s are independently from N(0,σ
2
)).(Hint: for
normality assumption, use both Q-Q plot residual plots (code will be provided in
later chapters)).
6. (Optional,Bonus)Remedy your model if you do detect some violation of assumption
on  and redo step 1, 2, 3, 4 and 5
7. Assess adequacy of best model by checking global F-test significant; adjusted R2
high; 2s value small
Your work should be clear and easy to understand, follow the following format:
2
1. Statement of The Problem: You need to state your research question here.
100 words).(40pts)
2 The Data: You need to specify how you collect the data and summarize your
sample data using the methods we learned in descriptive statistics in Chapter 1.
(1)The following table must be included.( 30pts)
(2)Scatter plots( 30pts): X1 versus Y; X2 versus Y; X3 versus Y.
Histogram( 10pts): the histogram of the calling price Y
Obs X1(Years) X2(New Price) X3(Mileage) X4(Title) Y(Calling Price)
1
2
3
· · · · · · · · · · · · · · · · · ·
n
3 The Models: Specify the hypothesized models you want to apply. In this part,
you are expected to finish the first four objectives stated above. Hint: When you
proposed a model, the first fitting might not an ideal model, you might need to improve
your model by selecting variables, change the order of your model, considering
interactive effect, ect. You need to compare all the models you fitted and explain
why it is the best by checking the nested model F-test; T-test on important β
s;
; comparing 2s values. (150pts)
4 Assumption check: In this part, you can do item 5 of objectives stated above,
5 Model Remedy(optional): In this part, do some transformation for y or x to
make the model assumptions be satisfied and write down the new model and conclusion.(30pts)
6 Model adequency: In this part, you are expected to finish the last objectives
stated above.(40pts)
3
7 Conclusion: Give a brief summary of your study.(40pts)
3 Others
This project is composed of 7 parts(see 2 Format of Your work). Part 5 is optional
and you will gain 30 pts bonus if you doing a good job there. The total points is 400pts
+ bonus 30pts.
All the analysis should be done by applying software R. You need to work independently.
Students are encouraged to work on their own, yet helping each other understanding
the concepts is fine. You should collect your data and write your code by
yourself. Project report does not include analysis and R code can only have
a maximum 200 pts. If you are contemplating an ethical failure please read the code
of student academic integrity: https://legal.uncc.edu/policies/up-407, so you can
plan for the consequences.