13 Dummy Variables Part 1

This chapter serves as a bridge between working with continuous explanatory variables and binary explanatory variables, often called indicator variables or dummy variables. A dummy variable is a variable that can have only two values, 0 and 1. A value of 1 indicates that the condition is true, while a value of 0 indicates that the condition is false. The “condition” indicated by the variable can be based on qualitative information (e.g., being a particular race, ethnicity, gender, or nationality), quantitative information (e.g., being over the age of 65, living in a house with more than 5 people), or some mix of the two (e.g., having a particular degree instead of years of education). As an example, if we have data on individuals from different US States, we might have a dummy variable WI, where WI=1 indicates the person is from Wisconsin and WI=0 indicates they are NOT from Wisconsin. It is always important to understand the data to understand what the “NOT” condition means. For example, if the data includes people from Illinois and Wisconsin, WI=0 indicates the person is from Illinois. If the data contains people from all 50 states, than WI=0 indicates they are from any state other than Wisconsin.

This chapter uses the data in HousePriceDummies.csv with price of the house in dollars (price), size of the hose in square feet (size), number of bathrooms (baths), and number of bedrooms (beds). This chapter serves as a bridge between working with continuous explanatory variables and dummy variables because you will construct a dummy variable from a variable you have previously used as a continuous variable (the number of bathrooms).

From this chapter you will learn about intercept dummies that allow the average \(y\) value to differ by group by the same amount for all values of the other explanatory variables (this should sound familiar from week 1 with houses that have a garage). Intercept dummies are the most common use of dummy variables. You’ll create an intercept dummy from a numerical variable (baths) that only has two possible values, so while you’re working with a dummy variable, you’re actually just working with regression models you should already understand. This hopefully helps you better understand how to work with all dummy variables, including those based on qualitative information, not just quantitative information like the number of bathrooms. You will also learn about slope dummies that allow the slope to vary by group (slope dummies are actually just a dummy variable multiplied by a continuous variable).

Wherever you see qCnt() in the RMD file you’ll see a number in the HTML output. qCnt() is a counter (defined in the code chunk above) that is putting what are essentially question numbers in the HTML output. This makes it easier to discuss with others because we can refer to the different questions by number.

As you go through the file, you need to do something (add code, short answers) for each question everywhere it says YOUR ANSWER HERE. This is what I will be checking for when I check off your work. We will start out working on this together in class, although you maybe need to finish on your own.

Let’s jump in…

13.1 Data

## Load Data
mydata <- read.csv("https://raw.githubusercontent.com/LhostEcon/380data/main/HousePriceDummies.csv")

Summary statistics

stargazer(mydata, type = "html",summary.stat = c("mean","sd", "min", "p25", "median", "p75", "max"))
Statistic Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
price 348,436.100 94,150.230 109,904 285,144 347,862 408,484.2 611,257
size 1,938.569 493.908 877 1,536.8 1,910.5 2,323.8 2,965
beds 2.630 0.999 1 2 3 3 4
baths 1.528 0.500 1 1 2 2 2

13.2 Size only

1) In this first code chunk, first estimate the simple linear regression model of price on size and store it in modelS (that’s “model” and a capital letter “S” for “Size”). Then display modelS using pander. Next, add a variable yHatS to the mydatadata frame with the predicted price from modelS. Yes, R has a fitted() function to do this for you, but I want you to calculate it manually using:

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size \]

Finally, create a scatter plot of the data using the number of bathrooms (baths) as a factor for the color (we need factor() so it treats baths as distinct integers, 1 and 2, instead of a continuous variable that could have values like 1.234). We also include yHatS as a scatterplot and as a line (also using yHatS). Make sure you understand why all of the yHatS points are on the yHatS line. Throughout this chapter we’ll use an “x” symbol (ggplot’s shape=4) to display the data and dots (i.e., filled-in circles, ggplot’s shape=19, which is also it’s default) to display predicted prices (i.e., yHatS).

I filled this first one in for you (feel free to copy/paste and use this as a basis for what you do).

modelS <- lm(price~size,data=mydata)
pander(summary(modelS))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 49174 15291 3.216 0.001502
size 154.4 7.645 20.19 1.826e-51
Fitting linear model: price ~ size
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
216 55363 0.6558 0.6542
mydata$yHatS <- coef(modelS)["(Intercept)"] + coef(modelS)["size"] * mydata$size
ggplot(mydata) + 
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatS,x=size,col=factor(baths))) +
  geom_line(aes(y=yHatS,x=size),col="black")

Note how we used coef(modelS)["(Intercept)"] to get the value of the intercept and coef(modelS)["size"] to get the value of the coefficient on size. We do NOT look at the output and type the value. Hard-coding the value (i.e., typing it) leads to errors, both from typos and from not updating values if the data or model change (thus changing the coefficients). Make sure to do this throughout anywhere you need to use the values of regression coefficients.

2) A big part of our focus in this chapter is the regression lines, so let’s be more explicit about plotting the line. The line we plotted above used geom_line(aes(y=yHatS,x=size)). We could plot this same line using geom_smooth, but later we’re going to plot lines that don’t work easily with geom_smooth. Instead, we’re going to use use geom_abline() to plot a line using its intercept and slope. Recall that yHatS is: \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size \] so we need to use \(\hat{\beta}_0=\) 49173.68 for geom_abline’s intercept argument and \(\hat{\beta}_1=\) 154.37 for geom_abline’s slope argument. Note that when you use coef(modelS)["(Intercept)"] and coef(modelS)["size"] for the intercept and slope arguments of geom_abline, you don’t include the format() function, or it’s arguments (digits=2,nsmall=2), which were just used to display the values in this paragraph in a nicely-formatted way.

We’ll also expand the axes limits so we can see the y intercepts; to do this, we’ll include:

scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500))

and

scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000))


ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatS,x=size,col=factor(baths))) +
  geom_abline(aes(intercept =  coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))


3) The slope of yHatS is the effect of size (of an additional \(ft^2\)) on the predicted price from the model that only controls for size. What do you think will happen to the effect of size on the predicted price when we also control for baths?


The effect of size on the predicted price when we also control for baths will decrease.


13.3 Number of bathrooms and size

4) Now add the number of bathrooms (baths) as a variable to the regression (in addition to size) and store the model as modelSB (“model” with “S” for size and “B” for baths). Display the output using pander. Add a variable yHatSB to mydata with the predicted prices from this model. Remember that for this model (the “SB” model that includes both size and baths), predicted prices are given by: \[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 size + \hat{\beta}_2 baths \]

Hint: Start by copy/pasting the code above where you estimated modelS and added yHatS to mydata. However, make sure to change the R object references when you use them, e.g., change modelS to modelSB. If you don’t, you’ll either have the wrong values (e.g., use the coefficient on size from the wrong model), or get an error (e.g., try to use the coefficient on baths from modelS that doesn’t include baths). Also make sure to change variable names in the expression calculating yHat (e.g., make sure the size coefficient is multiplied by the size variable, and the baths coefficient is multiplied by the baths variable…otherwise you might not notice here, but later your graph might not work).


modelSB <- lm(price ~ size + baths, data=mydata)
pander(summary(modelSB))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 20004 15384 1.3 0.1949
size 136.3 7.944 17.16 5.014e-42
baths 42005 7841 5.357 0.0000002187
Fitting linear model: price ~ size + baths
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
216 52094 0.6967 0.6938
mydata$yHatSB <- coef(modelSB)["(Intercept)"] + coef(modelSB)["size"] * mydata$size + coef(modelSB)["baths"] * mydata$baths

5) Now let’s add yHatSB to the graph as a scatterplot. Copy the last graph you made above, remove the geom_point() of yHatS, and add a geom_point() of yHatSB. Make the color “orange” for all the yHatSB points (i.e., the new geom_point() should be this: geom_point(data=mydata,aes(y=yHatSB,x=size),col="orange")). By “copy the last graph you made”, I do mean that you should leave the geom_abline that plots the modelS regression line from the previous graph (don’t change it to modelSB), and leave the axes starting at (0,0).


ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatSB,x=size), col = "orange") +
  geom_abline(aes(intercept =  coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))


6) It looks like there are two upward-sloping parallel rows of yHatSB predicted prices. What accounts for the general upward slope of the yHatSB predicted prices? Why are there two rows (try looking at a count() of baths to help you answer this part of the question)?


mydata %>% count(baths)
##   baths   n
## 1     1 102
## 2     2 114

There are two “levels” of baths, 1 or 2. Each level or category of bath has a different intercept. It’s most likely that having two baths will mean a higher house price than just one, so it’s intercept is higher, although the slope is the same because size is still affecting price at the same rate.


7) Look at the graph you just made. Notice that the yHatS line doesn’t go straight through the middle of the two rows of yHatSB predicted prices. What is the slope of the yHatS line (the answer is one of the coefficients estimated above)? What is the slope of the two rows of yHatSB predicted prices (the answer is a coefficient estimated above)? Which slope is steeper? What accounts for the difference between these different slopes?


The slope of the yHatS line is 154.4. The slope for the two rows of yHatSB is 136.3. This is because adding baths as an explanatory variable reduces the magnitude of the effect that size has on price, as baths were previously an unobservable variable.


8) How far apart vertically in the y direction (the price direction) are the two rows of yHatSB predicted prices? Why?


42005 (dollars). This is the effect that number of bathrooms has on price. Adding one more bathroom will increase the price by $42005, making the top line (representing two bathrooms), 42005 higher than the bottom line (representing one bathroom).


9) Now you’re going to create a new version of the graph above, except all the points on the lower row of orange dots are going to be red (and connected by a red line) and all the points on the upper row of orange dots are going to be blue (and connected by a blue line). Copy the code from the previous graph as the basis for making the graph. But before creating the new graph, you’ll need to create a few new variables that you can use to replace parts of the original graph. Here’s what you need to do:

Using only the coefficients from modelSB and the size variable (and simple arithmetic), generate a variable name yHatSB1 that when you plot it, replaces the lower row of yHatSB predicted price points. Make sure that these points are only created for observations with 1 bathroom and are NA for other observation (I’d use ifelse() for baths==1). In other words, do something like this: mydata$yHatSB1 <- ifelse(mydata$baths == 1, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1 + coef(modelSB)["size"]*mydata$size, NA)

Add these yHatSB1 points to the graph as geom_point() and make them dots red. Also add a geom_abline() that goes through this row of dots and make it a red line. To do this, think about what part(s) of the yHatSB1 line are the intercept and what part(s) are the slope on a graph that has size on the x axis.

Do the same thing for the upper row. In other words, also using only the coefficients from modelSB and the size variable (and simple arithmetic), generate a variable name yHatSB2 that when you plot it, replaces the upper row of yHatSB predicted price points. Make sure that these points are only created for observations with 2 bathrooms and are NA for other observation (use a similar ifelse() but for baths==2 and calculating the correct yHat values for 2 bathrooms). Add these to the graph as geom_point() and make these dots blue. Also add a geom_abline() that goes through this row of dots and make this line blue.

In addition, remove the orange yHatSB points you added before (because you’ve replaced them with red points and blue points).

In mine, I also labeled the y-intercepts of the three lines (the black line that connects the yHatS points, the red line that connects the yHatSB1 points, and the blue line that connects the yHatSB2 points). To do this, you modify the breaks argument of scale_y_continuous(). Try it if you have time, but don’t waste much time trying to figure this out. Either way, you should understand how the intercepts correspond with coefficients from the models.


mydata$yHatSB1 <- ifelse(mydata$baths == 1, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1 + coef(modelSB)["size"]*mydata$size, NA)
mydata$yHatSB2 <- ifelse(mydata$baths == 2, coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*2 + coef(modelSB)["size"]*mydata$size, NA)

ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatSB1,x=size), col = "red") +
  geom_abline(aes(intercept = coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*1, slope = coef(modelSB)["size"]), color = "red") +
  geom_point(aes(y=yHatSB2,x=size), col = "blue") +
  geom_abline(aes(intercept =  coef(modelSB)["(Intercept)"] + coef(modelSB)["baths"]*2, slope = coef(modelSB)["size"]), color = "blue") +
    geom_abline(aes(intercept =  coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).


10) Using ifelse(), create two dummy variables, baths1 and baths2, and add them to mydata. The variable baths1 equals 1 for houses with 1 bathroom and equals 0 otherwise. The variable baths2 equals 1 for houses with 2 bathroom and equals 0 otherwise. Make sure to look at the data after creating the variables to make sure you did it correctly (e.g., use head())! Calculate the mean of baths1 and baths2. What does the mean of baths1 tell us? What about the mean of baths2?


mydata$baths1 <- ifelse(mydata$baths == 1, 1, 0)
mydata$baths2 <- ifelse(mydata$baths == 2, 1, 0)

head(mydata)
##    price size beds baths    yHatS   yHatSB  yHatSB1  yHatSB2 baths1 baths2
## 1 427923 2211    3     2 390491.9 405408.3       NA 405408.3      0      1
## 2 270778 1296    2     2 249240.8 280679.1       NA 280679.1      0      1
## 3 329174 1525    2     2 284592.2 311895.4       NA 311895.4      0      1
## 4 537281 2913    4     2 498861.6 501102.2       NA 501102.2      0      1
## 5 275112  903    1     1 188572.3 185102.1 185102.1       NA      1      0
## 6 390832 2302    4     2 404539.8 417813.1       NA 417813.1      0      1
mean(mydata$baths1)
## [1] 0.4722222
mean(mydata$baths2)
## [1] 0.5277778

The means tell us what percentage of the houses in our data set have one bathroom or two. 47.22% have one bathroom and 52.78% have two bathrooms.


11) Try estimating a regression (with price as the y variable) that includes size, baths1, and baths2. Call it model12. Display the output using pander, but also display coef(model12). Why is part of coef(model12) NA? Why? Hint: which of the 4 MLR assumptions is violated?


model12 <- lm(price ~ size + baths1 + baths2, data=mydata)
pander(summary(model12))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 104013 17658 5.89 0.00000001487
size 136.3 7.944 17.16 5.014e-42
baths1 -42005 7841 -5.357 0.0000002187
Fitting linear model: price ~ size + baths1 + baths2
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
216 52094 0.6967 0.6938
coef(model12)
## (Intercept)        size      baths1      baths2 
## 104013.3696    136.3161 -42004.7326          NA

Knowing there is a 0 in baths1 tells us something about baths2. This violates the colinearity assumption.


12) Since we cannot include both baths1 and baths2 in the regression, lets try again without baths1. Estimate a model (name it modelInterceptDummy) that includes size and baths2, but leave out baths1. Display the results in a stargazer table next to the results of modelSB. What is the interpretation of \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\beta}_2\) in modelInterceptDummy)?

How do these coefficients compare/correspond with coefficients from modelSB? Specifically, how does the effect of size compare? What coefficient(s) from each model give you the intercepts for one and two bathroom houses (i.e., the average price when size is 0 for houses with 1 bathroom and for houses with 2 bathrooms)?


modelInterceptDummy <- lm(price ~ size + baths2, data=mydata)
library(stargazer)
stargazer(modelInterceptDummy, modelSB,
          type = "html", 
          report=('vc*p'),
          keep.stat = c("n","rsq","adj.rsq"), 
          notes = "<em>&#42;p&lt;0.1;&#42;&#42;p&lt;0.05;&#42;&#42;&#42;p&lt;0.01</em>", 
          notes.append = FALSE)
Dependent variable:
price
(1) (2)
size 136.316*** 136.316***
p = 0.000 p = 0.000
baths2 42,004.730***
p = 0.00000
baths 42,004.730***
p = 0.00000
Constant 62,008.640*** 20,003.900
p = 0.00004 p = 0.195
Observations 216 216
R2 0.697 0.697
Adjusted R2 0.694 0.694
Note: *p<0.1;**p<0.05;***p<0.01

In modelInterceptDummy, \(\hat{\beta}_0\) is the average estimated price for a house that has 0 square feet and one bath. It is statistically significant at the 1% level. \(\hat{\beta}_1\) is the average estimated increase in price for every square foot increase in size, if the number of bathrooms is held constant. It is also statistically significant at the 1% level. \(\hat{\beta}_2\) is the difference in price for a house with one bathroom and one with two, when size is held constant. It is statistically significant at the 1% level.

The models are very similar, in that the effect of size does not change. The difference is the way they express the effect of baths. In modelSB, \(\hat{\beta}_0\) would be the average expected price of a house with 0 square feet and 0 bathrooms. \(\hat{\beta}_2\) is not the difference in price for a house with one bathroom and one with two, but is the average expected increase in price for every additional bathroom a house has.

For modelSB, \(\hat{\beta}_0\) + \(\hat{\beta}_2\) gives you the average price for a house with size 0 and 1 bathroom. \(\hat{\beta}_0\) + \(\hat{\beta}_2 *2\) gives you the average price for a house with size 0 and 2 bathrooms. In contrast, for modelInterceptDummy, \(\hat{\beta}_0\) gives you the average price for a house with size 0 and 1 bathroom. \(\hat{\beta}_0\) + \(\hat{\beta}_2\) gives you the average price for a house with size 0 and 2 bathrooms.


13) Create the same graph you created above with the red and blue lines, except modify the geom_abline() layers that use coefficients from modelSB so that they use modelInterceptDummy instead. Leave everything else as it is in the previous graph (e.g., leave the black line geom_abline() that uses modelS, leave the geom_point() using yHatSB1 and yHatSB2). The graph itself should look identical (the two models are identical because the only possible values of baths are 1 and 2). Make sure that your red line (using geom_abline() based on modelInterceptDummy coefficients) actually goes through the red points (the geom_point() based on yHatSB1) and make sure that your blue line (using geom_abline() based on modelInterceptDummy coefficients) actually goes through the blue points (the geom_point() based on yHatSB2).


ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatSB1,x=size), col = "red") +
  geom_abline(aes(intercept = coef(modelInterceptDummy)["(Intercept)"], slope = coef(modelInterceptDummy)["size"]), color = "red") +
  geom_point(aes(y=yHatSB2,x=size), col = "blue") +
  geom_abline(aes(intercept =  coef(modelInterceptDummy)["(Intercept)"] + coef(modelInterceptDummy)["baths2"]*1, slope = coef(modelInterceptDummy)["size"]), color = "blue") +
    geom_abline(aes(intercept =  coef(modelS)["(Intercept)"], slope = coef(modelS)["size"]))
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).


13.4 Slope dummy

14) Think about what model would allow for the slope (with respect to size) to be different for 1 and 2 bathroom houses (but for the intercept to be the same). Write out the equation for this model the way equations were written out using latex code above (e.g., using \(\beta_1\) etc).


\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1size + \hat{\beta}_2 baths2 * size \]


Create any new variables you need to create, and then estimate the model and call it modelSlopeDummy.


mydata <- mydata %>%
  mutate(bathsize = baths2 * size)

modelSlopeDummy <- lm(price ~ size + bathsize, data = mydata)

pander(modelSlopeDummy)
Fitting linear model: price ~ size + bathsize
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 82311 15618 5.27 0.0000003332
size 124.7 9.032 13.81 2.094e-31
bathsize 21.57 3.984 5.415 0.0000001647

Display the results of modelS, modelInterceptDummy, and modelSlopeDummy side-by-side using stargazer.


stargazer(modelS, modelInterceptDummy, modelSlopeDummy,
          type = "html", 
          report=('vc*p'),
          keep.stat = c("n","rsq","adj.rsq"), 
          notes = "<em>&#42;p&lt;0.1;&#42;&#42;p&lt;0.05;&#42;&#42;&#42;p&lt;0.01</em>", 
          notes.append = FALSE)
Dependent variable:
price
(1) (2) (3)
size 154.373*** 136.316*** 124.733***
p = 0.000 p = 0.000 p = 0.000
baths2 42,004.730***
p = 0.00000
bathsize 21.572***
p = 0.00000
Constant 49,173.680*** 62,008.640*** 82,310.930***
p = 0.002 p = 0.00004 p = 0.00000
Observations 216 216 216
R2 0.656 0.697 0.697
Adjusted R2 0.654 0.694 0.695
Note: *p<0.1;**p<0.05;***p<0.01

Create a graph similar to what you did above, except using this new model. Start with the previous graph and make the following changes:

  1. Remove the black line based on modelS.

  2. Remove the red geom_point() based on yHatSB1 and replace it with red geom_point() based on modelSlopeDummy (I suggest creating a yHatSlopeDummy1 similar to how you created yHatSB1).

  3. Remove the blue geom_point() based on yHatSB2 and replace it with blue geom_point() based on modelSlopeDummy (I suggest creating a yHatSlopeDummy2 similar to how you created yHatSB2).

  4. Remove the red geom_abline() based on modelInterceptDummy and replace it with a red geom_abline() based on the coefficients from modelSlopeDummy.

  5. Remove the blue geom_abline() based on modelInterceptDummy and replace it with a red geom_abline() based on the coefficients from modelSlopeDummy.

NOTE: you rarely want to estimate a model with a slope dummy unless you also have the corresponding intercept dummy (see the next question for that model), but I’m having you do so now because it’s your first one and you need to learn how to do so.


mydata$yHatSlopeDummy1 <- ifelse(mydata$baths == 1, coef(modelSlopeDummy)["(Intercept)"] + coef(modelSlopeDummy)["size"]*mydata$size, NA)
mydata$yHatSlopeDummy2 <- ifelse(mydata$baths == 2, coef(modelSlopeDummy)["(Intercept)"] + coef(modelSlopeDummy)["size"]*mydata$size + coef(modelSlopeDummy)["bathsize"]*mydata$bathsize, NA)

ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatSlopeDummy1,x=size), col = "red") +
  geom_abline(aes(intercept = coef(modelSlopeDummy)["(Intercept)"], slope = coef(modelSlopeDummy)["size"]), color = "red") +
  geom_point(aes(y=yHatSlopeDummy2,x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelSlopeDummy)["(Intercept)"], slope = coef(modelSlopeDummy)["size"] + coef(modelSlopeDummy)["bathsize"]), color = "blue")
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).


13.5 Intercept and slope dummies

15) Estimate a model that allows for both the intercept and the slope (with respect to size) to be different for 1 and 2 bathroom houses.

Write out the equation for this model the way equations were written out using latex code above (e.g., using \(\beta_1\), etc).


\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1size + \hat{\beta}_2 baths2 + \hat{\beta}_3baths2 * size \]


Create any new variables you need to create, if you need to create any, and then estimate the model and call it modelSlopeAndInterceptDummies.


modelSlopeAndInterceptDummies <- lm(price ~ size + baths2 + bathsize, data = mydata)

pander(modelSlopeAndInterceptDummies)
Fitting linear model: price ~ size + baths2 + bathsize
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 75489 21120 3.574 0.0004349
size 128.5 11.92 10.77 7.231e-22
baths2 15120 31443 0.4809 0.6311
bathsize 14.12 16 0.8829 0.3783

Display the results of the four models (modelS, modelInterceptDummy, modelSlopeDummy, and modelSlopeAndInterceptDummies) side-by-side using stargazer. For this part though, do it below the graph instead of here.

Create a graph of this model by following the same steps you followed above to create the graph of the slope dummy model.


mydata$yHatInterceptSlopeDummy1 <- ifelse(mydata$baths == 1, coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["size"]*mydata$size, NA)

mydata$yHatInterceptSlopeDummy2 <- ifelse(mydata$baths == 2, coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["baths2"]*mydata$baths2 + coef(modelSlopeAndInterceptDummies)["size"]*mydata$size + coef(modelSlopeAndInterceptDummies)["bathsize"]*mydata$bathsize, NA)

ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  geom_point(aes(y=price,x=size,col=factor(baths)),shape=4) +
  geom_point(aes(y=yHatInterceptSlopeDummy1,x=size), col = "red") +
  geom_abline(aes(intercept = coef(modelSlopeAndInterceptDummies)["(Intercept)"], slope = coef(modelSlopeAndInterceptDummies)["size"]), color = "red") +
  geom_point(aes(y=yHatInterceptSlopeDummy2, x=size), col = "blue") +
geom_abline(aes(intercept = coef(modelSlopeAndInterceptDummies)["(Intercept)"] + coef(modelSlopeAndInterceptDummies)["baths2"], slope = coef(modelSlopeAndInterceptDummies)["size"] + coef(modelSlopeAndInterceptDummies)["bathsize"]), color = "blue")
## Warning: Removed 114 rows containing missing values (`geom_point()`).
## Warning: Removed 102 rows containing missing values (`geom_point()`).


16) Display the stargazer table comparing the four models (modelS, modelInterceptDummy, modelSlopeDummy, and modelSlopeAndInterceptDummies).


stargazer(modelS, modelInterceptDummy, modelSlopeDummy, modelSlopeAndInterceptDummies,
          type = "html", 
          report=('vc*p'),
          keep.stat = c("n","rsq","adj.rsq"), 
          notes = "<em>&#42;p&lt;0.1;&#42;&#42;p&lt;0.05;&#42;&#42;&#42;p&lt;0.01</em>", 
          notes.append = FALSE)
Dependent variable:
price
(1) (2) (3) (4)
size 154.373*** 136.316*** 124.733*** 128.467***
p = 0.000 p = 0.000 p = 0.000 p = 0.000
baths2 42,004.730*** 15,120.010
p = 0.00000 p = 0.632
bathsize 21.572*** 14.123
p = 0.00000 p = 0.379
Constant 49,173.680*** 62,008.640*** 82,310.930*** 75,489.050***
p = 0.002 p = 0.00004 p = 0.00000 p = 0.0005
Observations 216 216 216 216
R2 0.656 0.697 0.697 0.698
Adjusted R2 0.654 0.694 0.695 0.694
Note: *p<0.1;**p<0.05;***p<0.01

Using the estimated coefficients from the four models, write out the equation for the predicted prices for each of the four models, followed by the conditional expectations for each of the four models for one bathroom houses, followed by the conditional expectations for each of the four models for one bathroom houses. I filled in each of these for the first model (the one with size only) for you so you can see what I’m asking you to do. You can copy/paste what I did and then modify it for the subsequent models (modelInterceptDummy, modelSlopeDummy, and modelSlopeAndInterceptDummies). Round all coefficients to 0 decimal places they way I did for modelS.


Predicted prices from the four models: \[ \begin{aligned} \widehat{price} &= 49174 + 154 \cdot size \\ \widehat{price} &= 62009 + 42005 \cdot baths2 + 136 \cdot size \\ \widehat{price} &= 82311 + 22 \cdot baths2 \cdot size + 125 \cdot size \\ \widehat{price} &= 75489 + 15120 \cdot baths2 + 14 \cdot baths2 \cdot size + 128 \cdot size \end{aligned} \]

Expected prices for 1 bathroom (not-two bathroom) houses \[ \begin{aligned} E(price|size,baths2=0) &= 49174 + 154 \cdot size \\ E(price|size,baths2=0) &= 62009 + 136 \cdot size \\ E(price|size,baths2=0) &= 82311 + 125 \cdot size \\ E(price|size,baths2=0) &= 75489 + 128 \cdot size \end{aligned} \]

Expected prices for two bathroom houses

\[ \begin{aligned} E(price|size,baths2=1) &= 49174 + 154 \cdot size \\ E(price|size,baths2=1) &= 62009 + 42005+ 136 \cdot size \\ E(price|size,baths2=1) &= 82311 + 22 \cdot size + 125 \cdot size \\ E(price|size,baths2=1) &= 75489 + 15120 + 14 \cdot size + 128 \cdot size \end{aligned} \]


17) What do you notice about the intercepts and the slopes? Think about what variation each model allows and what restrictions it imposes. Why are the intercepts furthest out for the model with the intercept dummy (modelInterceptDummy), at one point in the middle for the model with the slope dummy, and in between for the model with both the intercept and slope dummies? How does that relate to the estimated slopes? How does that relate to the model that only includes size? I’d start by comparing the graphs of the models.


The slope model forces the intercepts to be the same; for a house of size 0, the price is the same for both bathrooms, but it allows for the number of bathrooms to affect the relationship between size and price. The intercept model forces the slopes to be the same; the effect size has on price does not depend on the number of bathrooms. It does allow for the intercept to be different; for a house of size 0, the price is different for a one bathroom house compared to a two bathroom house. The intercept and slope model allows for both intercept and slope to be different.


13.6 Models with the number of bedrooms

18) To help you gain additional intuition for what’s going on in linear regressions, try estimating a model that includes size and the number of bedrooms (beds). Make a plot that includes yHat predicted values. Color the yHat points based on the number of bedrooms (use 1 bedroom = red, 2 bedroom = orange, 3 bedroom = green, 4 bedroom = blue….why? because people are used to seeing rainbow order, but yellow is hard to see so we skipped it…you can do this by adding scale_color_manual(values=c("red", "orange", "green","blue")). Why are the yHat points arranged in rows? How many rows are there? Why? Add geom_abline()s that connect the rows of dots (using the same colors as the yHat points).

Note that this is 4 separate geom_abline()s. It makes for a lot of lines of code. However, remember that once you do one of them, the other three are easily obtained by copy/pasting the first and changing the number of bedrooms and the corresponding color.


modelbeds <- lm(price ~ size + beds, data = mydata)
pander(modelbeds)
Fitting linear model: price ~ size + beds
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 69942 17390 4.022 0.00008013
size 113.8 18.4 6.184 0.000000003146
beds 22009 9103 2.418 0.01645
mydata$yHatbeds1 <- ifelse(mydata$beds == 1,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*1, NA)

mydata$yHatbeds2 <- ifelse(mydata$beds == 2,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*2, NA)

mydata$yHatbeds3 <- ifelse(mydata$beds == 3,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*3, NA)

mydata$yHatbeds4 <- ifelse(mydata$beds == 4,coef(modelbeds)["(Intercept)"] + coef(modelbeds)["size"]*mydata$size + coef(modelbeds)["beds"]*4, NA)


ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yHatbeds1,x=size), col = "red") +
geom_point(aes(y=yHatbeds2,x=size), col = "orange") +
geom_point(aes(y=yHatbeds3,x=size), col = "green") +
geom_point(aes(y=yHatbeds4,x=size), col = "blue") +


geom_abline(aes(intercept = coef(modelbeds)["beds"]*1 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*2 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*3 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbeds)["beds"]*4 + coef(modelbeds)["(Intercept)"], slope = coef(modelbeds)["size"]), color = "blue") +
  
geom_point(aes(y=price,x=size, color = factor(beds)), shape=4, alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 184 rows containing missing values (`geom_point()`).
## Warning: Removed 150 rows containing missing values (`geom_point()`).
## Warning: Removed 148 rows containing missing values (`geom_point()`).
## Warning: Removed 166 rows containing missing values (`geom_point()`).

There are 4 lines for each amount of bedrooms in the data set. The lines are arranged in rows because we are only allowing for different intercepts between different amounts of bedrooms, not different slopes. We are also forcing the difference in intercept between 1 and 2 rooms, 2 and 3, and 3 and 4 rooms to be the same.


19) Now try changing the previous model to include factor(beds) instead of beds (alternatively, create a dummy variable for 2 bedroom houses, 3 bedroom houses, and 4 bedroom houses, and include these three dummy variables in the model). Make a plot that includes yHat predicted values. Color the yHat points based on the number of bedrooms (using the same colors as above). Why are the yHat points arranged in rows? How many rows are there? Why? Can you add geom_abline()s that connect the rows of dots? How do the rows of dots (and the geom_abline()s that connect them) compare with the previous model? Is the effect of going from 1 bedroom to 2, or 2 to 3, or 3 to 4 different in this model than in the first?


mydata$beds1 <- ifelse(mydata$beds == 1, 1, 0)
mydata$beds2 <- ifelse(mydata$beds == 2, 1, 0)
mydata$beds3 <- ifelse(mydata$beds == 3, 1, 0)
mydata$beds4 <- ifelse(mydata$beds == 4, 1, 0)

modelbeddummy <- lm(price ~ size + beds2 + beds3 + beds4, data=mydata)
pander(modelbeddummy)
Fitting linear model: price ~ size + beds2 + beds3 + beds4
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 98518 25537 3.858 0.000152
size 111.7 18.75 5.956 0.00000001066
beds2 17174 13691 1.254 0.2111
beds3 39851 19369 2.057 0.04087
beds4 67649 27898 2.425 0.01616
mydata$yhatbeddummy1 <- ifelse(mydata$beds == 1, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["size"]*mydata$size, NA)
mydata$yhatbeddummy2 <- ifelse(mydata$beds == 2, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds2"]*mydata$beds2 + coef(modelbeddummy)["size"]*mydata$size, NA) 
mydata$yhatbeddummy3 <- ifelse(mydata$beds == 3, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds3"]*mydata$beds3 + coef(modelbeddummy)["size"]*mydata$size, NA) 
mydata$yhatbeddummy4 <- ifelse(mydata$beds == 4, coef(modelbeddummy)["(Intercept)"] + coef(modelbeddummy)["beds4"]*mydata$beds4 + coef(modelbeddummy)["size"]*mydata$size, NA)

ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  
geom_point(aes(y=yhatbeddummy1,x=size), col = "red") +
geom_point(aes(y=yhatbeddummy2,x=size), col = "orange") +
geom_point(aes(y=yhatbeddummy3,x=size), col = "green") +
geom_point(aes(y=yhatbeddummy4,x=size), col = "blue") +
  
geom_abline(aes(intercept = coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds2"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds3"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbeddummy)["beds4"] + coef(modelbeddummy)["(Intercept)"], slope = coef(modelbeddummy)["size"]), color = "blue") +
  
geom_point(aes(y=price,x=size, color = factor(beds)), shape=4, alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 184 rows containing missing values (`geom_point()`).
## Warning: Removed 150 rows containing missing values (`geom_point()`).
## Warning: Removed 148 rows containing missing values (`geom_point()`).
## Warning: Removed 166 rows containing missing values (`geom_point()`).

This model is similar to the previous in that all lines have the same slope and are parallel because we are not allowing for the amount of bedrooms to affect the effect of size. We also have the same amount of rows as before, one for each number of rooms. However, this model is different because it allows for variation in the differences from 1 to 2, 2 to 3, and 3 to 4 bedrooms. For example, the change in intercept between 1 (red line) bedroom and 2 (orange line) bedrooms is smaller than 2 bedrooms to 3 (green line) bedrooms. As we increase bedrooms, the effect that number of bedrooms has on price increases. (This is all when size is held constant.)


20) Now let’s look at models with the number of bathrooms in addition to size and the number of bedrooms. For this model, include bedrooms as beds. Try to answer the same questions as the first model with bedrooms. Making a plot that includes yHat predicted values. Color the yHat points based on the number of bedrooms (using the same colors as before). Why are the yHat points arranged in rows? How many rows are there? Why? Add geom_abline()s that connect the rows of dots (using the same colors).

Note that this is now 8 separate geom_abline()s, but, as above, this isn’t hard if you copy/paste and just change what needs to be modified.


modelbedsbaths <- lm(price ~ size + beds + baths, data = mydata)
pander(modelbedsbaths)
Fitting linear model: price ~ size + beds + baths
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 38729 17445 2.22 0.02747
size 101.8 17.5 5.815 0.00000002211
beds 19000 8603 2.209 0.02828
baths 40858 7788 5.246 0.0000003751
mydata$yHatbeds1baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["baths"]*1, NA)

mydata$yHatbeds2baths1 <- ifelse(mydata$beds == 2 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["baths"]*1, NA)

mydata$yHatbeds3baths1 <- ifelse(mydata$beds == 3 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["baths"]*1, NA)

mydata$yHatbeds4baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["baths"]*1, NA)

mydata$yHatbeds1baths2 <- ifelse(mydata$beds == 1 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["baths"]*2, NA)

mydata$yHatbeds2baths2 <- ifelse(mydata$beds == 2 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["baths"]*2, NA)

mydata$yHatbeds3baths2 <- ifelse(mydata$beds == 3 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["baths"]*2, NA)

mydata$yHatbeds4baths2 <- ifelse(mydata$beds == 4 & mydata$baths == 2, coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["size"]*mydata$size + coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["baths"]*2, NA)
ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
geom_point(aes(y=yHatbeds1baths1,x=size), col = "red") +
geom_point(aes(y=yHatbeds2baths1,x=size), col = "orange") +
geom_point(aes(y=yHatbeds3baths1,x=size), col = "green") +
geom_point(aes(y=yHatbeds4baths1,x=size), col = "blue") +
geom_point(aes(y=yHatbeds1baths2,x=size), col = "pink") +
geom_point(aes(y=yHatbeds2baths2,x=size), col = "purple") +
geom_point(aes(y=yHatbeds3baths2,x=size), col = "brown") +
geom_point(aes(y=yHatbeds4baths2,x=size), col = "black") +
  
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*1, slope = coef(modelbedsbaths)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*1 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "pink") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*2 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "purple") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*3 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "brown") +
geom_abline(aes(intercept = coef(modelbedsbaths)["beds"]*4 + coef(modelbedsbaths)["(Intercept)"] + coef(modelbedsbaths)["baths"]*2, slope = coef(modelbedsbaths)["size"]), color = "black") +
  
geom_point(aes(y=price,x=size, color = factor(beds), shape = factor(baths)), alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 178 rows containing missing values (`geom_point()`).
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 211 rows containing missing values (`geom_point()`).
## Warning: Removed 188 rows containing missing values (`geom_point()`).
## Warning: Removed 175 rows containing missing values (`geom_point()`).
## Warning: Removed 176 rows containing missing values (`geom_point()`).

There are 2 rows of 4. One row (red, orange, green, blue) represents houses that all have one bathroom, but varying amounts of bedrooms. Each line corresponds with 1, 2, 3, or 4, bedrooms. Like in the first model in this section, the spacing between the lines is the same. The second row (pink, purple, brown, black) represents houses that all have two bathrooms, but varying amounts of bedrooms. The spacing between the line is the same as the first row, but all of the lines are shifted up by about 40,000, since that is the effect of having 2 bathrooms. For example, a house with one bathroom and one bedroom costs about 40,000 less than a house with two bathrooms and one bedroom. (This is all when size is held constant.)


21) Finally, try adding baths to the model with factor(beds) and make a similar graph (with the yHat points colored by the number of bedrooms, and the corresponding geom_abline()s). If you understood the previous two questions, you should have no problem understanding this question too. If you didn’t, make sure you understand those models first before trying to wrap your head around this one. How has the spacing between the lines changed? Explain the spacing between the same colored lines (i.e., the same number of bedrooms but 2 bathrooms instead of 1), and the different colored lines (i.e., the number of bedrooms).


mydata$beds1baths1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, 1, 0)
mydata$beds2baths1 <- ifelse(mydata$beds == 2 & mydata$baths == 1, 1, 0)
mydata$beds3baths1 <- ifelse(mydata$beds == 3 & mydata$baths == 1, 1, 0)
mydata$beds4baths1 <- ifelse(mydata$beds == 4 & mydata$baths == 1, 1, 0)
mydata$beds1baths2 <- ifelse(mydata$beds == 1 & mydata$baths == 2, 1, 0)
mydata$beds2baths2 <- ifelse(mydata$beds == 2 & mydata$baths == 2, 1, 0)
mydata$beds3baths2 <- ifelse(mydata$beds == 3 & mydata$baths == 2, 1, 0)
mydata$beds4baths2 <- ifelse(mydata$beds == 4 & mydata$baths == 2, 1, 0)

modelbedbathdummy <- lm(price ~ size + beds2baths1 + beds3baths1 + beds4baths1 + beds1baths2 + beds2baths2 + beds3baths2 + beds4baths2, data=mydata)
pander(modelbedbathdummy)
Fitting linear model: price ~ size + beds2baths1 + beds3baths1 + beds4baths1 + beds1baths2 + beds2baths2 + beds3baths2 + beds4baths2
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 119831 24686 4.854 0.000002378
size 97.17 17.99 5.401 0.0000001808
beds2baths1 -1500 14548 -0.1031 0.918
beds3baths1 20226 19061 1.061 0.2899
beds4baths1 47024 30517 1.541 0.1249
beds1baths2 -19583 24977 -0.784 0.4339
beds2baths2 47840 15314 3.124 0.00204
beds3baths2 67408 20432 3.299 0.001142
beds4baths2 93105 27220 3.42 0.0007531
mydata$yhatbedbathdummy1 <- ifelse(mydata$beds == 1 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["size"]*mydata$size, NA)

mydata$yhatbedbathdummy2 <- ifelse(mydata$beds == 2 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths1"]*mydata$beds2baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA)

mydata$yhatbedbathdummy3 <- ifelse(mydata$beds == 3 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths1"]*mydata$beds3baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

mydata$yhatbedbathdummy4 <- ifelse(mydata$beds == 4 & mydata$baths == 1, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths1"]*mydata$beds4baths1 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

mydata$yhatbedbathdummy5 <- ifelse(mydata$beds == 1 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds1baths2"]*mydata$beds1baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

mydata$yhatbedbathdummy6 <- ifelse(mydata$beds == 2 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths2"]*mydata$beds2baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

mydata$yhatbedbathdummy7 <- ifelse(mydata$beds == 3 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths2"]*mydata$beds3baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

mydata$yhatbedbathdummy8 <- ifelse(mydata$beds == 4 & mydata$baths == 2, coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths2"]*mydata$beds4baths2 + coef(modelbedbathdummy)["size"]*mydata$size, NA) 

ggplot(mydata) + 
  scale_x_continuous(expand = c(0, 0),limits = c(0, max(mydata$size)*1.02),breaks = seq(0,max(mydata$size)*1.02,500)) + 
  scale_y_continuous(expand = c(0, 0),limits = c(0, max(mydata$price)*1.02), breaks = seq(0,max(mydata$price)*1.02,50000)) +
  
geom_point(aes(y=yhatbedbathdummy1,x=size), col = "red") +
geom_point(aes(y=yhatbedbathdummy2,x=size), col = "orange") +
geom_point(aes(y=yhatbedbathdummy3,x=size), col = "green") +
geom_point(aes(y=yhatbedbathdummy4,x=size), col = "blue") +
geom_point(aes(y=yhatbedbathdummy5,x=size), col = "pink") +
geom_point(aes(y=yhatbedbathdummy6,x=size), col = "purple") +
geom_point(aes(y=yhatbedbathdummy7,x=size), col = "brown") +
geom_point(aes(y=yhatbedbathdummy8,x=size), col = "black") +
  
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"], slope = coef(modelbedbathdummy)["size"]), color = "red") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths1"], slope = coef(modelbedbathdummy)["size"]), color = "orange") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths1"], slope = coef(modelbedbathdummy)["size"]), color = "green") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths1"], slope = coef(modelbedbathdummy)["size"]), color = "blue") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds1baths2"], slope = coef(modelbedbathdummy)["size"]), color = "pink") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds2baths2"], slope = coef(modelbedbathdummy)["size"]), color = "purple") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds3baths2"], slope = coef(modelbedbathdummy)["size"]), color = "brown") +
geom_abline(aes(intercept = coef(modelbedbathdummy)["(Intercept)"] + coef(modelbedbathdummy)["beds4baths2"], slope = coef(modelbedbathdummy)["size"]), color = "black") +

geom_point(aes(y=price,x=size, color = factor(beds), shape = factor(baths)), alpha = .4) +
scale_color_manual(values=c("red", "orange", "green","blue"))  
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 178 rows containing missing values (`geom_point()`).
## Warning: Removed 189 rows containing missing values (`geom_point()`).
## Warning: Removed 206 rows containing missing values (`geom_point()`).
## Warning: Removed 211 rows containing missing values (`geom_point()`).
## Warning: Removed 188 rows containing missing values (`geom_point()`).
## Warning: Removed 175 rows containing missing values (`geom_point()`).
## Warning: Removed 176 rows containing missing values (`geom_point()`).

This model allows for variation in the spacing of the lines. If you hold bathrooms constant, the difference between 1 and 2 bedrooms is not the same as 2 and 3 bedrooms. Likewise, if you have a one bedroom house, the difference between 1 and 2 bathrooms will be different if you have a two bedroom house. For example, the difference between a house with 3 bedrooms and 1 bath and 3 bedrooms and 2 baths is 47, 182, but the difference between a house with 4 bedrooms and 1 bath and 4 bedrooms and 2 baths is 46, 081. (This is all when size is held constant.) (This is all when size is held constant.)