13 Dummy Variables Part 1
This chapter serves as a bridge between working with continuous explanatory variables and binary explanatory variables, often called indicator variables or dummy variables. A dummy variable is a variable that can have only two values, 0 and 1. A value of 1 indicates that the condition is true, while a value of 0 indicates that the condition is false. The “condition” indicated by the variable can be based on qualitative information (e.g., being a particular race, ethnicity, gender, or nationality), quantitative information (e.g., being over the age of 65, living in a house with more than 5 people), or some mix of the two (e.g., having a particular degree instead of years of education). As an example, if we have data on individuals from different US States, we might have a dummy variable WI
, where WI=1
indicates the person is from Wisconsin and WI=0
indicates they are NOT from Wisconsin. It is always important to understand the data to understand what the “NOT” condition means. For example, if the data includes people from Illinois and Wisconsin, WI=0
indicates the person is from Illinois. If the data contains people from all 50 states, than WI=0
indicates they are from any state other than Wisconsin.
This chapter uses the data in HousePriceDummies.csv
with price of the house in dollars (price
), size of the hose in square feet (size
), number of bathrooms (baths
), and number of bedrooms (beds
). This chapter serves as a bridge between working with continuous explanatory variables and dummy variables because you will construct a dummy variable from a variable you have previously used as a continuous variable (the number of bathrooms).
From this chapter you will learn about intercept dummies that allow the average \(y\) value to differ by group by the same amount for all values of the other explanatory variables (this should sound familiar from week 1 with houses that have a garage). Intercept dummies are the most common use of dummy variables. You’ll create an intercept dummy from a numerical variable (baths
) that only has two possible values, so while you’re working with a dummy variable, you’re actually just working with regression models you should already understand. This hopefully helps you better understand how to work with all dummy variables, including those based on qualitative information, not just quantitative information like the number of bathrooms. You will also learn about slope dummies that allow the slope to vary by group (slope dummies are actually just a dummy variable multiplied by a continuous variable).
Wherever you see qCnt()
in the RMD file you’ll see a number in the HTML output. qCnt()
is a counter (defined in the code chunk above) that is putting what are essentially question numbers in the HTML output. This makes it easier to discuss with others because we can refer to the different questions by number.
As you go through the file, you need to do something (add code, short answers) for each question everywhere it says YOUR ANSWER HERE. This is what I will be checking for when I check off your work. We will start out working on this together in class, although you maybe need to finish on your own.
Let’s jump in…
13.1 Data
## Load Data
<- read.csv("https://raw.githubusercontent.com/LhostEcon/380data/main/HousePriceDummies.csv") mydata
Summary statistics
stargazer(mydata, type = "html",summary.stat = c("mean","sd", "min", "p25", "median", "p75", "max"))
Statistic | Mean | St. Dev. | Min | Pctl(25) | Median | Pctl(75) | Max |
price | 348,436.100 | 94,150.230 | 109,904 | 285,144 | 347,862 | 408,484.2 | 611,257 |
size | 1,938.569 | 493.908 | 877 | 1,536.8 | 1,910.5 | 2,323.8 | 2,965 |
beds | 2.630 | 0.999 | 1 | 2 | 3 | 3 | 4 |
baths | 1.528 | 0.500 | 1 | 1 | 2 | 2 | 2 |
13.2 Size only
1) In this first code chunk, first estimate the simple linear regression model of price on size and store it in modelS
(that’s “model” and a capital letter “S” for “Size”). Then display modelS
using pander. Next, add a variable yHatS
to the mydata
data frame with the predicted price from modelS
. Yes, R has a fitted()
function to do this for you, but I want you to calculate it manually using:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size \]
Finally, create a scatter plot of the data using the number of bathrooms (baths
) as a factor for the color (we need factor()
so it treats baths
as distinct integers, 1 and 2, instead of a continuous variable that could have values like 1.234). We also include yHatS
as a scatterplot and as a line (also using yHatS
). Make sure you understand why all of the yHatS
points are on the yHatS
line. Throughout this chapter we’ll use an “x” symbol (ggplot’s shape=4
) to display the data and dots (i.e., filled-in circles, ggplot’s shape=19
, which is also it’s default) to display predicted prices (i.e., yHatS
).
I filled this first one in for you (feel free to copy/paste and use this as a basis for what you do).
<- lm(price~size,data=mydata)
modelS pander(summary(modelS))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 49174 | 15291 | 3.216 | 0.001502 |
size | 154.4 | 7.645 | 20.19 | 1.826e-51 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
216 | 55363 | 0.6558 | 0.6542 |
$yHatS <- coef(modelS)["(Intercept)"] + coef(modelS)["size"] * mydata$size
mydataggplot(mydata) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatS,x=size,col=factor(baths))) +
geom_line(aes(y=yHatS,x=size),col="black")
Note how we used coef(modelS)["(Intercept)"]
to get the value of the intercept and coef(modelS)["size"]
to get the value of the coefficient on size
. We do NOT look at the output and type the value. Hard-coding the value (i.e., typing it) leads to errors, both from typos and from not updating values if the data or model change (thus changing the coefficients). Make sure to do this throughout anywhere you need to use the values of regression coefficients.
2) A big part of our focus in this chapter is the regression lines, so let’s be more explicit about plotting the line. The line we plotted above used geom_line(aes(y=yHatS,x=size))
. We could plot this same line using geom_smooth
, but later we’re going to plot lines that don’t work easily with geom_smooth
. Instead, we’re going to use use geom_abline()
to plot a line using its intercept and slope. Recall that yHatS
is:
\[
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size
\]
so we need to use \(\hat{\beta}_0=\) 49173.68 for geom_abline
’s intercept
argument and \(\hat{\beta}_1=\) 154.37 for geom_abline
’s slope
argument. Note that when you use coef(modelS)["(Intercept)"]
and coef(modelS)["size"]
for the intercept and slope arguments of geom_abline
, you don’t include the format()
function, or it’s arguments (digits=2,nsmall=2
), which were just used to display the values in this paragraph in a nicely-formatted way.
We’ll also expand the axes limits so we can see the y intercepts; to do this, we’ll include:
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500))
and
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000))
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatS,x=size,col=factor(baths))) +
geom_abline(aes(intercept = coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
3) The slope of yHatS
is the effect of size (of an additional \(ft^2\)) on the predicted price from the model that only controls for size. What do you think will happen to the effect of size on the predicted price when we also control for baths?
The effect of size on the predicted price when we also control for baths will decrease.
13.3 Number of bathrooms and size
4) Now add the number of bathrooms (baths
) as a variable to the regression (in addition to size) and store the model as modelSB
(“model” with “S” for size and “B” for baths). Display the output using pander. Add a variable yHatSB
to mydata
with the predicted prices from this model. Remember that for this model (the “SB” model that includes both size
and baths
), predicted prices are given by:
\[
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size + \hat{\beta}_2 baths
\]
Hint: Start by copy/pasting the code above where you estimated modelS
and added yHatS
to mydata
. However, make sure to change the R object references when you use them, e.g., change modelS
to modelSB
. If you don’t, you’ll either have the wrong values (e.g., use the coefficient on size
from the wrong model), or get an error (e.g., try to use the coefficient on baths
from modelS
that doesn’t include baths
). Also make sure to change variable names in the expression calculating yHat (e.g., make sure the size
coefficient is multiplied by the size
variable, and the baths
coefficient is multiplied by the baths
variable…otherwise you might not notice here, but later your graph might not work).
<- lm(price ~ size + baths, data=mydata)
modelSB pander(summary(modelSB))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 20004 | 15384 | 1.3 | 0.1949 |
size | 136.3 | 7.944 | 17.16 | 5.014e-42 |
baths | 42005 | 7841 | 5.357 | 0.0000002187 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
216 | 52094 | 0.6967 | 0.6938 |
$yHatSB <- coef(modelSB)["(Intercept)"] + coef(modelSB)["size"] * mydata$size + coef(modelSB)["baths"] * mydata$baths mydata
5) Now let’s add yHatSB
to the graph as a scatterplot. Copy the last graph you made above, remove the geom_point()
of yHatS
, and add a geom_point()
of yHatSB
. Make the color “orange” for all the yHatSB
points (i.e., the new geom_point()
should be this: geom_point(data=mydata,aes(y=yHatSB,x=size),col="orange")
). By “copy the last graph you made”, I do mean that you should leave the geom_abline
that plots the modelS
regression line from the previous graph (don’t change it to modelSB
), and leave the axes starting at (0,0).
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatSB,x=size), col = "orange") +
geom_abline(aes(intercept = coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
6) It looks like there are two upward-sloping parallel rows of yHatSB
predicted prices. What accounts for the general upward slope of the yHatSB
predicted prices? Why are there two rows (try looking at a count()
of baths
to help you answer this part of the question)?
%>% count(baths) mydata
## baths n
## 1 1 102
## 2 2 114
There are two “levels” of baths, 1 or 2. Each level or category of bath has a different intercept. It’s most likely that having two baths will mean a higher house price than just one, so it’s intercept is higher, although the slope is the same because size is still affecting price at the same rate.
7) Look at the graph you just made. Notice that the yHatS
line doesn’t go straight through the middle of the two rows of yHatSB
predicted prices. What is the slope of the yHatS line (the answer is one of the coefficients estimated above)? What is the slope of the two rows of yHatSB predicted prices (the answer is a coefficient estimated above)? Which slope is steeper? What accounts for the difference between these different slopes?
The slope of the yHatS line is 154.4. The slope for the two rows of yHatSB is 136.3. This is because adding baths as an explanatory variable reduces the magnitude of the effect that size has on price, as baths were previously an unobservable variable.
8) How far apart vertically in the y direction (the price direction) are the two rows of yHatSB
predicted prices? Why?
42005 (dollars). This is the effect that number of bathrooms has on price. Adding one more bathroom will increase the price by $42005, making the top line (representing two bathrooms), 42005 higher than the bottom line (representing one bathroom).
9) Now you’re going to create a new version of the graph above, except all the points on the lower row of orange dots are going to be red (and connected by a red line) and all the points on the upper row of orange dots are going to be blue (and connected by a blue line). Copy the code from the previous graph as the basis for making the graph. But before creating the new graph, you’ll need to create a few new variables that you can use to replace parts of the original graph. Here’s what you need to do:
Using only the coefficients from modelSB
and the size variable (and simple arithmetic), generate a variable name yHatSB1
that when you plot it, replaces the lower row of yHatSB
predicted price points. Make sure that these points are only created for observations with 1 bathroom and are NA
for other observation (I’d use ifelse()
for baths==1
). In other words, do something like this:
mydata$yHatSB1 <- ifelse(mydata$baths == 1, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1 + coef(modelSB)["size"]*mydata$size, NA)
Add these yHatSB1
points to the graph as geom_point()
and make them dots red. Also add a geom_abline()
that goes through this row of dots and make it a red line. To do this, think about what part(s) of the yHatSB1 line are the intercept and what part(s) are the slope on a graph that has size on the x axis.
Do the same thing for the upper row. In other words, also using only the coefficients from modelSB
and the size variable (and simple arithmetic), generate a variable name yHatSB2
that when you plot it, replaces the upper row of yHatSB
predicted price points. Make sure that these points are only created for observations with 2 bathrooms and are NA
for other observation (use a similar ifelse()
but for baths==2
and calculating the correct yHat values for 2 bathrooms). Add these to the graph as geom_point()
and make these dots blue. Also add a geom_abline()
that goes through this row of dots and make this line blue.
In addition, remove the orange yHatSB
points you added before (because you’ve replaced them with red points and blue points).
In mine, I also labeled the y-intercepts of the three lines (the black line that connects the yHatS
points, the red line that connects the yHatSB1
points, and the blue line that connects the yHatSB2
points). To do this, you modify the breaks
argument of scale_y_continuous()
. Try it if you have time, but don’t waste much time trying to figure this out. Either way, you should understand how the intercepts correspond with coefficients from the models.
$yHatSB1 <- ifelse(mydata$baths == 1, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1 + coef(modelSB)["size"]*mydata$size, NA)
mydata$yHatSB2 <- ifelse(mydata$baths == 2, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*2 + coef(modelSB)["size"]*mydata$size, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatSB1,x=size), col = "red") +
geom_abline(aes(intercept = coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1, slope = coef(modelSB)["size"]), color = "red") +
geom_point(aes(y=yHatSB2,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*2, slope = coef(modelSB)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).
10) Using ifelse()
, create two dummy variables, baths1
and baths2
, and add them to mydata
. The variable baths1
equals 1 for houses with 1 bathroom and equals 0 otherwise. The variable baths2
equals 1 for houses with 2 bathroom and equals 0 otherwise. Make sure to look at the data after creating the variables to make sure you did it correctly (e.g., use head()
)! Calculate the mean of baths1
and baths2
. What does the mean of baths1
tell us? What about the mean of baths2
?
$baths1 <- ifelse(mydata$baths == 1, 1, 0)
mydata$baths2 <- ifelse(mydata$baths == 2, 1, 0)
mydata
head(mydata)
## price size beds baths yHatS yHatSB yHatSB1 yHatSB2 baths1 baths2
## 1 427923 2211 3 2 390491.9 405408.3 NA 405408.3 0 1
## 2 270778 1296 2 2 249240.8 280679.1 NA 280679.1 0 1
## 3 329174 1525 2 2 284592.2 311895.4 NA 311895.4 0 1
## 4 537281 2913 4 2 498861.6 501102.2 NA 501102.2 0 1
## 5 275112 903 1 1 188572.3 185102.1 185102.1 NA 1 0
## 6 390832 2302 4 2 404539.8 417813.1 NA 417813.1 0 1
mean(mydata$baths1)
## [1] 0.4722222
mean(mydata$baths2)
## [1] 0.5277778
The means tell us what percentage of the houses in our data set have one bathroom or two. 47.22% have one bathroom and 52.78% have two bathrooms.
11) Try estimating a regression (with price
as the y variable) that includes size
, baths1
, and baths2
. Call it model12
. Display the output using pander, but also display coef(model12)
. Why is part of coef(model12)
NA? Why? Hint: which of the 4 MLR assumptions is violated?
<- lm(price ~ size + baths1 + baths2, data=mydata)
model12 pander(summary(model12))
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 104013 | 17658 | 5.89 | 0.00000001487 |
size | 136.3 | 7.944 | 17.16 | 5.014e-42 |
baths1 | -42005 | 7841 | -5.357 | 0.0000002187 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
216 | 52094 | 0.6967 | 0.6938 |
coef(model12)
## (Intercept) size baths1 baths2
## 104013.3696 136.3161 -42004.7326 NA
Knowing there is a 0 in baths1 tells us something about baths2. This violates the colinearity assumption.
12) Since we cannot include both baths1
and baths2
in the regression, lets try again without baths1
. Estimate a model (name it modelInterceptDummy
) that includes size
and baths2
, but leave out baths1
. Display the results in a stargazer table next to the results of modelSB
. What is the interpretation of \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\beta}_2\) in modelInterceptDummy
)?
How do these coefficients compare/correspond with coefficients from modelSB
? Specifically, how does the effect of size compare? What coefficient(s) from each model give you the intercepts for one and two bathroom houses (i.e., the average price when size is 0 for houses with 1 bathroom and for houses with 2 bathrooms)?
<- lm(price ~ size + baths2, data=mydata) modelInterceptDummy
library(stargazer)
stargazer(modelInterceptDummy, modelSB,
type = "html",
report=('vc*p'),
keep.stat = c("n","rsq","adj.rsq"),
notes = "<em>*p<0.1;**p<0.05;***p<0.01</em>",
notes.append = FALSE)
Dependent variable: | ||
price | ||
(1) | (2) | |
size | 136.316*** | 136.316*** |
p = 0.000 | p = 0.000 | |
baths2 | 42,004.730*** | |
p = 0.00000 | ||
baths | 42,004.730*** | |
p = 0.00000 | ||
Constant | 62,008.640*** | 20,003.900 |
p = 0.00004 | p = 0.195 | |
Observations | 216 | 216 |
R2 | 0.697 | 0.697 |
Adjusted R2 | 0.694 | 0.694 |
Note: | *p<0.1;**p<0.05;***p<0.01 |
In modelInterceptDummy
, \(\hat{\beta}_0\) is the average estimated price for a house that has 0 square feet and one bath. It is statistically significant at the 1% level.
\(\hat{\beta}_1\) is the average estimated increase in price for every square foot increase in size, if the number of bathrooms is held constant. It is also statistically significant at the 1% level.
\(\hat{\beta}_2\) is the difference in price for a house with one bathroom and one with two, when size is held constant. It is statistically significant at the 1% level.
The models are very similar, in that the effect of size does not change. The difference is the way they express the effect of baths. In modelSB
, \(\hat{\beta}_0\) would be the average expected price of a house with 0 square feet and 0 bathrooms. \(\hat{\beta}_2\) is not the difference in price for a house with one bathroom and one with two, but is the average expected increase in price for every additional bathroom a house has.
For modelSB
, \(\hat{\beta}_0\) + \(\hat{\beta}_2\) gives you the average price for a house with size 0 and 1 bathroom. \(\hat{\beta}_0\) + \(\hat{\beta}_2 *2\) gives you the average price for a house with size 0 and 2 bathrooms. In contrast, for modelInterceptDummy
, \(\hat{\beta}_0\) gives you the average price for a house with size 0 and 1 bathroom. \(\hat{\beta}_0\) + \(\hat{\beta}_2\) gives you the average price for a house with size 0 and 2 bathrooms.
13) Create the same graph you created above with the red and blue lines, except modify the geom_abline()
layers that use coefficients from modelSB
so that they use modelInterceptDummy
instead. Leave everything else as it is in the previous graph (e.g., leave the black line geom_abline()
that uses modelS
, leave the geom_point()
using yHatSB1
and yHatSB2
). The graph itself should look identical (the two models are identical because the only possible values of baths
are 1 and 2). Make sure that your red line (using geom_abline()
based on modelInterceptDummy
coefficients) actually goes through the red points (the geom_point()
based on yHatSB1
) and make sure that your blue line (using geom_abline()
based on modelInterceptDummy
coefficients) actually goes through the blue points (the geom_point()
based on yHatSB2
).
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatSB1,x=size), col = "red") +
geom_abline(aes(intercept = coef(modelInterceptDummy)["(Intercept)"], slope = coef(modelInterceptDummy)["size"]), color = "red") +
geom_point(aes(y=yHatSB2,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelInterceptDummy)["(Intercept)"] + coef(modelInterceptDummy)["baths2"]*1, slope = coef(modelInterceptDummy)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).
13.4 Slope dummy
14) Think about what model would allow for the slope (with respect to size) to be different for 1 and 2 bathroom houses (but for the intercept to be the same). Write out the equation for this model the way equations were written out using latex code above (e.g., using \(\beta_1\) etc).
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1size + \hat{\beta}_2 baths2 * size \]
Create any new variables you need to create, and then estimate the model and call it modelSlopeDummy
.
<- mydata %>%
mydata mutate(bathsize = baths2 * size)
<- lm(price ~ size + bathsize, data = mydata)
modelSlopeDummy
pander(modelSlopeDummy)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 82311 | 15618 | 5.27 | 0.0000003332 |
size | 124.7 | 9.032 | 13.81 | 2.094e-31 |
bathsize | 21.57 | 3.984 | 5.415 | 0.0000001647 |
Display the results of modelS
, modelInterceptDummy
, and modelSlopeDummy
side-by-side using stargazer.
stargazer(modelS, modelInterceptDummy, modelSlopeDummy,
type = "html",
report=('vc*p'),
keep.stat = c("n","rsq","adj.rsq"),
notes = "<em>*p<0.1;**p<0.05;***p<0.01</em>",
notes.append = FALSE)
Dependent variable: | |||
price | |||
(1) | (2) | (3) | |
size | 154.373*** | 136.316*** | 124.733*** |
p = 0.000 | p = 0.000 | p = 0.000 | |
baths2 | 42,004.730*** | ||
p = 0.00000 | |||
bathsize | 21.572*** | ||
p = 0.00000 | |||
Constant | 49,173.680*** | 62,008.640*** | 82,310.930*** |
p = 0.002 | p = 0.00004 | p = 0.00000 | |
Observations | 216 | 216 | 216 |
R2 | 0.656 | 0.697 | 0.697 |
Adjusted R2 | 0.654 | 0.694 | 0.695 |
Note: | *p<0.1;**p<0.05;***p<0.01 |
Create a graph similar to what you did above, except using this new model. Start with the previous graph and make the following changes:
Remove the black line based on
modelS
.Remove the red
geom_point()
based onyHatSB1
and replace it with redgeom_point()
based onmodelSlopeDummy
(I suggest creating ayHatSlopeDummy1
similar to how you createdyHatSB1
).Remove the blue
geom_point()
based onyHatSB2
and replace it with bluegeom_point()
based onmodelSlopeDummy
(I suggest creating ayHatSlopeDummy2
similar to how you createdyHatSB2
).Remove the red
geom_abline()
based onmodelInterceptDummy
and replace it with a redgeom_abline()
based on the coefficients frommodelSlopeDummy
.Remove the blue
geom_abline()
based onmodelInterceptDummy
and replace it with a redgeom_abline()
based on the coefficients frommodelSlopeDummy
.
NOTE: you rarely want to estimate a model with a slope dummy unless you also have the corresponding intercept dummy (see the next question for that model), but I’m having you do so now because it’s your first one and you need to learn how to do so.
$yHatSlopeDummy1 <- ifelse(mydata$baths == 1, coef(modelSlopeDummy)["(Intercept)"] + coef(modelSlopeDummy)["size"]*mydata$size, NA)
mydata$yHatSlopeDummy2 <- ifelse(mydata$baths == 2, coef(modelSlopeDummy)["(Intercept)"] + coef(modelSlopeDummy)["size"]*mydata$size + coef(modelSlopeDummy)["bathsize"]*mydata$bathsize, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatSlopeDummy1,x=size), col = "red") +
geom_abline(aes(intercept = coef(modelSlopeDummy)["(Intercept)"], slope = coef(modelSlopeDummy)["size"]), color = "red") +
geom_point(aes(y=yHatSlopeDummy2,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelSlopeDummy)["(Intercept)"], slope = coef(modelSlopeDummy)["size"] + coef(modelSlopeDummy)["bathsize"]), color = "blue")
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).
13.5 Intercept and slope dummies
15) Estimate a model that allows for both the intercept and the slope (with respect to size) to be different for 1 and 2 bathroom houses.
Write out the equation for this model the way equations were written out using latex code above (e.g., using \(\beta_1\), etc).
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1size + \hat{\beta}_2 baths2 + \hat{\beta}_3baths2 * size \]
Create any new variables you need to create, if you need to create any, and then estimate the model and call it modelSlopeAndInterceptDummies
.
<- lm(price ~ size + baths2 + bathsize, data = mydata)
modelSlopeAndInterceptDummies
pander(modelSlopeAndInterceptDummies)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 75489 | 21120 | 3.574 | 0.0004349 |
size | 128.5 | 11.92 | 10.77 | 7.231e-22 |
baths2 | 15120 | 31443 | 0.4809 | 0.6311 |
bathsize | 14.12 | 16 | 0.8829 | 0.3783 |
Display the results of the four models (modelS
, modelInterceptDummy
, modelSlopeDummy
, and modelSlopeAndInterceptDummies
) side-by-side using stargazer. For this part though, do it below the graph instead of here.
Create a graph of this model by following the same steps you followed above to create the graph of the slope dummy model.
$yHatInterceptSlopeDummy1 <- ifelse(mydata$baths == 1, coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["size"]*mydata$size, NA)
mydata
$yHatInterceptSlopeDummy2 <- ifelse(mydata$baths == 2, coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["baths2"]*mydata$baths2 + coef(modelSlopeAndInterceptDummies)["size"]*mydata$size + coef(modelSlopeAndInterceptDummies)["bathsize"]*mydata$bathsize, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
geom_point(aes(y=yHatInterceptSlopeDummy1,x=size), col = "red") +
geom_abline(aes(intercept = coef(modelSlopeAndInterceptDummies)["(Intercept)"], slope = coef(modelSlopeAndInterceptDummies)["size"]), color = "red") +
geom_point(aes(y=yHatInterceptSlopeDummy2, x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["baths2"], slope = coef(modelSlopeAndInterceptDummies)["size"] + coef(modelSlopeAndInterceptDummies)["bathsize"]), color = "blue")
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).
16) Display the stargazer table comparing the four models (modelS
, modelInterceptDummy
, modelSlopeDummy
, and modelSlopeAndInterceptDummies
).
stargazer(modelS, modelInterceptDummy, modelSlopeDummy, modelSlopeAndInterceptDummies,
type = "html",
report=('vc*p'),
keep.stat = c("n","rsq","adj.rsq"),
notes = "<em>*p<0.1;**p<0.05;***p<0.01</em>",
notes.append = FALSE)
Dependent variable: | ||||
price | ||||
(1) | (2) | (3) | (4) | |
size | 154.373*** | 136.316*** | 124.733*** | 128.467*** |
p = 0.000 | p = 0.000 | p = 0.000 | p = 0.000 | |
baths2 | 42,004.730*** | 15,120.010 | ||
p = 0.00000 | p = 0.632 | |||
bathsize | 21.572*** | 14.123 | ||
p = 0.00000 | p = 0.379 | |||
Constant | 49,173.680*** | 62,008.640*** | 82,310.930*** | 75,489.050*** |
p = 0.002 | p = 0.00004 | p = 0.00000 | p = 0.0005 | |
Observations | 216 | 216 | 216 | 216 |
R2 | 0.656 | 0.697 | 0.697 | 0.698 |
Adjusted R2 | 0.654 | 0.694 | 0.695 | 0.694 |
Note: | *p<0.1;**p<0.05;***p<0.01 |
Using the estimated coefficients from the four models, write out the equation for the predicted prices for each of the four models, followed by the conditional expectations for each of the four models for one bathroom houses, followed by the conditional expectations for each of the four models for one bathroom houses. I filled in each of these for the first model (the one with size only) for you so you can see what I’m asking you to do. You can copy/paste what I did and then modify it for the subsequent models (modelInterceptDummy
, modelSlopeDummy
, and modelSlopeAndInterceptDummies
). Round all coefficients to 0 decimal places they way I did for modelS
.
Predicted prices from the four models: \[ \begin{aligned} \widehat{price} &= 49174 + 154 \cdot size \\ \widehat{price} &= 62009 + 42005 \cdot baths2 + 136 \cdot size \\ \widehat{price} &= 82311 + 22 \cdot baths2 \cdot size + 125 \cdot size \\ \widehat{price} &= 75489 + 15120 \cdot baths2 + 14 \cdot baths2 \cdot size + 128 \cdot size \end{aligned} \]
Expected prices for 1 bathroom (not-two bathroom) houses \[ \begin{aligned} E(price|size,baths2=0) &= 49174 + 154 \cdot size \\ E(price|size,baths2=0) &= 62009 + 136 \cdot size \\ E(price|size,baths2=0) &= 82311 + 125 \cdot size \\ E(price|size,baths2=0) &= 75489 + 128 \cdot size \end{aligned} \]
Expected prices for two bathroom houses
\[ \begin{aligned} E(price|size,baths2=1) &= 49174 + 154 \cdot size \\ E(price|size,baths2=1) &= 62009 + 42005+ 136 \cdot size \\ E(price|size,baths2=1) &= 82311 + 22 \cdot size + 125 \cdot size \\ E(price|size,baths2=1) &= 75489 + 15120 + 14 \cdot size + 128 \cdot size \end{aligned} \]
17) What do you notice about the intercepts and the slopes? Think about what variation each model allows and what restrictions it imposes. Why are the intercepts furthest out for the model with the intercept dummy (modelInterceptDummy
), at one point in the middle for the model with the slope dummy, and in between for the model with both the intercept and slope dummies? How does that relate to the estimated slopes? How does that relate to the model that only includes size? I’d start by comparing the graphs of the models.
The slope model forces the intercepts to be the same; for a house of size 0, the price is the same for both bathrooms, but it allows for the number of bathrooms to affect the relationship between size and price. The intercept model forces the slopes to be the same; the effect size has on price does not depend on the number of bathrooms. It does allow for the intercept to be different; for a house of size 0, the price is different for a one bathroom house compared to a two bathroom house. The intercept and slope model allows for both intercept and slope to be different.
13.6 Models with the number of bedrooms
18) To help you gain additional intuition for what’s going on in linear regressions, try estimating a model that includes size and the number of bedrooms (beds
). Make a plot that includes yHat
predicted values. Color the yHat
points based on the number of bedrooms (use 1 bedroom = red, 2 bedroom = orange, 3 bedroom = green, 4 bedroom = blue….why? because people are used to seeing rainbow order, but yellow is hard to see so we skipped it…you can do this by adding scale_color_manual(values=c("red", "orange", "green","blue")
). Why are the yHat
points arranged in rows? How many rows are there? Why? Add geom_abline()
s that connect the rows of dots (using the same colors as the yHat
points).
Note that this is 4 separate geom_abline()
s. It makes for a lot of lines of code. However, remember that once you do one of them, the other three are easily obtained by copy/pasting the first and changing the number of bedrooms and the corresponding color.
<- lm(price ~ size + beds, data = mydata)
modelbeds pander(modelbeds)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 69942 | 17390 | 4.022 | 0.00008013 |
size | 113.8 | 18.4 | 6.184 | 0.000000003146 |
beds | 22009 | 9103 | 2.418 | 0.01645 |
$yHatbeds1 <- ifelse(mydata$beds == 1,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*1, NA)
mydata
$yHatbeds2 <- ifelse(mydata$beds == 2,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*2, NA)
mydata
$yHatbeds3 <- ifelse(mydata$beds == 3,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*3, NA)
mydata
$yHatbeds4 <- ifelse(mydata$beds == 4,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*4, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yHatbeds1,x=size), col = "red") +
geom_point(aes(y=yHatbeds2,x=size), col = "orange") +
geom_point(aes(y=yHatbeds3,x=size), col = "green") +
geom_point(aes(y=yHatbeds4,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*1 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*2 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*3 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*4 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "blue") +
geom_point(aes(y=price,x=size, color = factor(beds)), shape=4, alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 184 rows containing missing values (`geom_point()`).
## Warning: Removed 150 rows containing missing values (`geom_point()`).
## Warning: Removed 148 rows containing missing values (`geom_point()`).
## Warning: Removed 166 rows containing missing values (`geom_point()`).
There are 4 lines for each amount of bedrooms in the data set. The lines are arranged in rows because we are only allowing for different intercepts between different amounts of bedrooms, not different slopes. We are also forcing the difference in intercept between 1 and 2 rooms, 2 and 3, and 3 and 4 rooms to be the same.
19) Now try changing the previous model to include factor(beds)
instead of beds
(alternatively, create a dummy variable for 2 bedroom houses, 3 bedroom houses, and 4 bedroom houses, and include these three dummy variables in the model). Make a plot that includes yHat
predicted values. Color the yHat
points based on the number of bedrooms (using the same colors as above). Why are the yHat
points arranged in rows? How many rows are there? Why? Can you add geom_abline()
s that connect the rows of dots? How do the rows of dots (and the geom_abline()
s that connect them) compare with the previous model? Is the effect of going from 1 bedroom to 2, or 2 to 3, or 3 to 4 different in this model than in the first?
$beds1 <- ifelse(mydata$beds == 1, 1, 0)
mydata$beds2 <- ifelse(mydata$beds == 2, 1, 0)
mydata$beds3 <- ifelse(mydata$beds == 3, 1, 0)
mydata$beds4 <- ifelse(mydata$beds == 4, 1, 0)
mydata
<- lm(price ~ size + beds2 + beds3 + beds4, data=mydata)
modelbeddummy pander(modelbeddummy)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 98518 | 25537 | 3.858 | 0.000152 |
size | 111.7 | 18.75 | 5.956 | 0.00000001066 |
beds2 | 17174 | 13691 | 1.254 | 0.2111 |
beds3 | 39851 | 19369 | 2.057 | 0.04087 |
beds4 | 67649 | 27898 | 2.425 | 0.01616 |
$yhatbeddummy1 <- ifelse(mydata$beds == 1, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["size"]*mydata$size, NA)
mydata$yhatbeddummy2 <- ifelse(mydata$beds == 2, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds2"]*mydata$beds2 + coef(modelbeddummy)["size"]*mydata$size, NA)
mydata$yhatbeddummy3 <- ifelse(mydata$beds == 3, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds3"]*mydata$beds3 + coef(modelbeddummy)["size"]*mydata$size, NA)
mydata$yhatbeddummy4 <- ifelse(mydata$beds == 4, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds4"]*mydata$beds4 + coef(modelbeddummy)["size"]*mydata$size, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yhatbeddummy1,x=size), col = "red") +
geom_point(aes(y=yhatbeddummy2,x=size), col = "orange") +
geom_point(aes(y=yhatbeddummy3,x=size), col = "green") +
geom_point(aes(y=yhatbeddummy4,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds2"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds3"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds4"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "blue") +
geom_point(aes(y=price,x=size, color = factor(beds)), shape=4, alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 184 rows containing missing values (`geom_point()`).
## Warning: Removed 150 rows containing missing values (`geom_point()`).
## Warning: Removed 148 rows containing missing values (`geom_point()`).
## Warning: Removed 166 rows containing missing values (`geom_point()`).
This model is similar to the previous in that all lines have the same slope and are parallel because we are not allowing for the amount of bedrooms to affect the effect of size. We also have the same amount of rows as before, one for each number of rooms. However, this model is different because it allows for variation in the differences from 1 to 2, 2 to 3, and 3 to 4 bedrooms. For example, the change in intercept between 1 (red line) bedroom and 2 (orange line) bedrooms is smaller than 2 bedrooms to 3 (green line) bedrooms. As we increase bedrooms, the effect that number of bedrooms has on price increases. (This is all when size is held constant.)
20) Now let’s look at models with the number of bathrooms in addition to size and the number of bedrooms. For this model, include bedrooms as beds
. Try to answer the same questions as the first model with bedrooms. Making a plot that includes yHat
predicted values. Color the yHat
points based on the number of bedrooms (using the same colors as before). Why are the yHat
points arranged in rows? How many rows are there? Why? Add geom_abline()
s that connect the rows of dots (using the same colors).
Note that this is now 8 separate geom_abline()
s, but, as above, this isn’t hard if you copy/paste and just change what needs to be modified.
<- lm(price ~ size + beds + baths, data = mydata)
modelbedsbaths pander(modelbedsbaths)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 38729 | 17445 | 2.22 | 0.02747 |
size | 101.8 | 17.5 | 5.815 | 0.00000002211 |
beds | 19000 | 8603 | 2.209 | 0.02828 |
baths | 40858 | 7788 | 5.246 | 0.0000003751 |
$yHatbeds1baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["baths"]*1, NA)
mydata
$yHatbeds2baths1 <- ifelse(mydata$beds == 2 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["baths"]*1, NA)
mydata
$yHatbeds3baths1 <- ifelse(mydata$beds == 3 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["baths"]*1, NA)
mydata
$yHatbeds4baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["baths"]*1, NA)
mydata
$yHatbeds1baths2 <- ifelse(mydata$beds == 1 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["baths"]*2, NA)
mydata
$yHatbeds2baths2 <- ifelse(mydata$beds == 2 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["baths"]*2, NA)
mydata
$yHatbeds3baths2 <- ifelse(mydata$beds == 3 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["baths"]*2, NA)
mydata
$yHatbeds4baths2 <- ifelse(mydata$beds == 4 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["baths"]*2, NA) mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yHatbeds1baths1,x=size), col = "red") +
geom_point(aes(y=yHatbeds2baths1,x=size), col = "orange") +
geom_point(aes(y=yHatbeds3baths1,x=size), col = "green") +
geom_point(aes(y=yHatbeds4baths1,x=size), col = "blue") +
geom_point(aes(y=yHatbeds1baths2,x=size), col = "pink") +
geom_point(aes(y=yHatbeds2baths2,x=size), col = "purple") +
geom_point(aes(y=yHatbeds3baths2,x=size), col = "brown") +
geom_point(aes(y=yHatbeds4baths2,x=size), col = "black") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "pink") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "purple") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "brown") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "black") +
geom_point(aes(y=price,x=size, color = factor(beds), shape = factor(baths)), alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 178 rows containing missing values (`geom_point()`).
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 211 rows containing missing values (`geom_point()`).
## Warning: Removed 188 rows containing missing values (`geom_point()`).
## Warning: Removed 175 rows containing missing values (`geom_point()`).
## Warning: Removed 176 rows containing missing values (`geom_point()`).
There are 2 rows of 4. One row (red, orange, green, blue) represents houses that all have one bathroom, but varying amounts of bedrooms. Each line corresponds with 1, 2, 3, or 4, bedrooms. Like in the first model in this section, the spacing between the lines is the same. The second row (pink, purple, brown, black) represents houses that all have two bathrooms, but varying amounts of bedrooms. The spacing between the line is the same as the first row, but all of the lines are shifted up by about 40,000, since that is the effect of having 2 bathrooms. For example, a house with one bathroom and one bedroom costs about 40,000 less than a house with two bathrooms and one bedroom. (This is all when size is held constant.)
21) Finally, try adding baths
to the model with factor(beds)
and make a similar graph (with the yHat
points colored by the number of bedrooms, and the corresponding geom_abline()
s). If you understood the previous two questions, you should have no problem understanding this question too. If you didn’t, make sure you understand those models first before trying to wrap your head around this one. How has the spacing between the lines changed? Explain the spacing between the same colored lines (i.e., the same number of bedrooms but 2 bathrooms instead of 1), and the different colored lines (i.e., the number of bedrooms).
$beds1baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, 1, 0)
mydata$beds2baths1 <- ifelse(mydata$beds == 2 & mydata$baths == 1, 1, 0)
mydata$beds3baths1 <- ifelse(mydata$beds == 3 & mydata$baths == 1, 1, 0)
mydata$beds4baths1 <- ifelse(mydata$beds == 4 & mydata$baths == 1, 1, 0)
mydata$beds1baths2 <- ifelse(mydata$beds == 1 & mydata$baths == 2, 1, 0)
mydata$beds2baths2 <- ifelse(mydata$beds == 2 & mydata$baths == 2, 1, 0)
mydata$beds3baths2 <- ifelse(mydata$beds == 3 & mydata$baths == 2, 1, 0)
mydata$beds4baths2 <- ifelse(mydata$beds == 4 & mydata$baths == 2, 1, 0)
mydata
<- lm(price ~ size + beds2baths1 + beds3baths1 + beds4baths1 + beds1baths2 + beds2baths2 + beds3baths2 + beds4baths2, data=mydata)
modelbedbathdummy pander(modelbedbathdummy)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 119831 | 24686 | 4.854 | 0.000002378 |
size | 97.17 | 17.99 | 5.401 | 0.0000001808 |
beds2baths1 | -1500 | 14548 | -0.1031 | 0.918 |
beds3baths1 | 20226 | 19061 | 1.061 | 0.2899 |
beds4baths1 | 47024 | 30517 | 1.541 | 0.1249 |
beds1baths2 | -19583 | 24977 | -0.784 | 0.4339 |
beds2baths2 | 47840 | 15314 | 3.124 | 0.00204 |
beds3baths2 | 67408 | 20432 | 3.299 | 0.001142 |
beds4baths2 | 93105 | 27220 | 3.42 | 0.0007531 |
$yhatbedbathdummy1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy2 <- ifelse(mydata$beds == 2 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths1"]*mydata$beds2baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy3 <- ifelse(mydata$beds == 3 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths1"]*mydata$beds3baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy4 <- ifelse(mydata$beds == 4 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths1"]*mydata$beds4baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy5 <- ifelse(mydata$beds == 1 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds1baths2"]*mydata$beds1baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy6 <- ifelse(mydata$beds == 2 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths2"]*mydata$beds2baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy7 <- ifelse(mydata$beds == 3 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths2"]*mydata$beds3baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
$yhatbedbathdummy8 <- ifelse(mydata$beds == 4 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths2"]*mydata$beds4baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA)
mydata
ggplot(mydata) +
scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) +
scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yhatbedbathdummy1,x=size), col = "red") +
geom_point(aes(y=yhatbedbathdummy2,x=size), col = "orange") +
geom_point(aes(y=yhatbedbathdummy3,x=size), col = "green") +
geom_point(aes(y=yhatbedbathdummy4,x=size), col = "blue") +
geom_point(aes(y=yhatbedbathdummy5,x=size), col = "pink") +
geom_point(aes(y=yhatbedbathdummy6,x=size), col = "purple") +
geom_point(aes(y=yhatbedbathdummy7,x=size), col = "brown") +
geom_point(aes(y=yhatbedbathdummy8,x=size), col = "black") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"], slope = coef(modelbedbathdummy)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths1"], slope = coef(modelbedbathdummy)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths1"], slope = coef(modelbedbathdummy)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths1"], slope = coef(modelbedbathdummy)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds1baths2"], slope = coef(modelbedbathdummy)["size"]), color = "pink") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths2"], slope = coef(modelbedbathdummy)["size"]), color = "purple") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths2"], slope = coef(modelbedbathdummy)["size"]), color = "brown") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths2"], slope = coef(modelbedbathdummy)["size"]), color = "black") +
geom_point(aes(y=price,x=size, color = factor(beds), shape = factor(baths)), alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 178 rows containing missing values (`geom_point()`).
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 206 rows containing missing values (`geom_point()`).
## Warning: Removed 211 rows containing missing values (`geom_point()`).
## Warning: Removed 188 rows containing missing values (`geom_point()`).
## Warning: Removed 175 rows containing missing values (`geom_point()`).
## Warning: Removed 176 rows containing missing values (`geom_point()`).
This model allows for variation in the spacing of the lines. If you hold bathrooms constant, the difference between 1 and 2 bedrooms is not the same as 2 and 3 bedrooms. Likewise, if you have a one bedroom house, the difference between 1 and 2 bathrooms will be different if you have a two bedroom house. For example, the difference between a house with 3 bedrooms and 1 bath and 3 bedrooms and 2 baths is 47, 182, but the difference between a house with 4 bedrooms and 1 bath and 4 bedrooms and 2 baths is 46, 081. (This is all when size is held constant.) (This is all when size is held constant.)