1.1 Introduction

The purpose of our project is for developing a hedonic price model to predict the home prices in San Francisco.This project focus on San Francisco’s housing prices is because our customer Zillow needs a more accurate model to predict house prices. Our group developed a regression model that accounts for a wide variety in relevant variables. The difficulty of this project is that the fluctuation of house prices is a complex phenomenon due to the interrelationship of many different factors. So, it is hard to find the best model. In addition, in the process of modeling, predictions models that are accurate in other regions may not work in San Francisco which requires our group to explore variables unique to San Francisco that significantly affect housing prices.The overall strategy of this project is selecting various relevant variables related to economic, social, transportation, safety, housing and environmental characteristics around each house or establishing spatial feartures that helping to explain the fluctuation of the house prices.

Our result is the model that we have developed explains roughly 80% of the variation in home prices in San Francisco (R2 of .787) within a 18% range of accuracy for each price that we predict (MAPE of .18). Our model shows a regularly distributed residual, both numerically and geographically, meaning that our model is generalizable–equally able to predict sales price in one neighborhood versus another. All of these factors lead us to believe that our model will be useful for Zillow.

1.2 Setup & Data Wrangling

The following packages will be use in the project. The map theme and plot theme is used to standardize the cartography on all maps.The palette is used for map coloring. The qBr and q5 are used for converting a column in to quintiles. It is used for mapping.

library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(tidyverse)
library(sf)
library(spdep)
library(data.table)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(png)
library(broom)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}


palette5 <- c("#0000FF","#00FF00","#FF0000","#000000","#FFFF00")
palette1 <- c("#999999", "#E69F00", "#56B4E9")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                          c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

1.2.1 Gathering the data

First, the base map of San Francisco and the basic data set are loaded

baseMap <-  st_read("https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON") %>%
  st_transform(2227)

student <- 
  st_read("/Users/Anyang/Desktop/midtermData_studentVersion/midterm_data_sf_studentVersion.shp") %>%
  st_transform(st_crs(baseMap))

We load the data of neighborhood1 and neighborhood2

The neighborhood1 has 37 districts the neighborhood2 has 117 districts in San Francisco. These data sets will be used when join the independent variables and dependent variable.

Here we use two kind of Classification of neighborhood becasue when we join some data following, some independent variables (such as fog) need more detailed classification to be joined.

sf_neighborhood <- 
  st_read("https://data.sfgov.org/api/geospatial/iacs-ws63?method=export&format=GeoJSON")%>%
  st_transform(2227)

sf_neighborhood2 <- 
  st_read("https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON")%>%
  st_transform(2227)

Then we use the st_join to join the student and sf_neighborhood1 and sf_neighborhood2

student1 = st_join(student,sf_neighborhood2[,-1],join= st_intersects )
student2 = st_join(student1,sf_neighborhood,join = st_intersects)

We use the data collected from the https://sfplanning.org/resource/san-francisco-neighborhoods-socio-economic-profiles-2010-2014 and cleaning it into CSV files. The orginal data is classifed by neighbourhood,and our group merge it with the basic data set.

##fill the names of neighborhoods that are missing.
which(is.na(student2$neighborho)==TRUE)
student2[210,]$name <- "Bayview"
student2[7207,]$name <- "Crocker Amazon"
student2[7207,]$neighborho <- "Crocker Amazon"
student2[7466,]$name <- "Oceanview"
student2[7466,]$neighborho <- "Ocean View"
student2[7691,]$name <- "Crocker Amazon"
student2[7691,]$neighborho <- "Crocker Amazon"
student2[7798,]$name <- "Crocker Amazon"
student2[7798,]$neighborho <- "Crocker Amazon"
student2[8762,]$name <- "Crocker Amazon"
student2[8762,]$neighborho <- "Crocker Amazon"

which(student2$Baths>15)
student2[6024,]$Baths <- median(student2[which(student2$name=="Crocker Amazon"),]$Baths)
student2[7479,]$Baths <- median(student2[which(student2$name=="Oceanview"),]$Baths)

which(student2$Beds>15)
student2[380,]$Beds <- median(student2[which(student2$name=="Pacific Heights"),]$Beds)

which(student2$Rooms==0)
which(student2$Rooms>30)
student2[6177,]$Rooms <- median(student2[which(student2$name=="Mission"),]$Rooms)

which(student2$Stories>60)
student2[1097,]$Stories <- median(student2[which(student2$name=="Western Addition"),]$Stories)
student2[1120,]$Stories <- median(student2[which(student2$name=="Buena Vista"),]$Stories)
student2[8574,]$Stories <- median(student2[which(student2$name=="Ingleside"),]$Stories)
student2 = student2[-which(student2$PropArea==0),]

demo1 <-  read_csv("C:/Users/Anyang/Desktop/demo2.csv")
demo2 <- demo1 %>%
  select(`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Fair Market Rent`,`No Bedroom`,`Foreign Born`,`Employment Rate`,`Some College/Associate Degree` ,`College Degree`,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,nhood,`Elementary school access`)

student2 = merge(student2, demo2, by.x="neighborho",by.y="nhood")

We use the fog data

fog <-  read_csv("C:/Users/Anyang/Desktop/fog.csv")
student2 = merge(student2, fog, by.x="name",by.y="nhood")

We use the business data from Open Data SF

business <-  read_csv("C:/Users/Anyang/Desktop/businessn.csv")
student2 = merge(student2, business, by.x="neighborho",by.y="nhood")

2.1 Feature Engineering

Our team calculated the number of internal spatial features by building a buffer, and we selected some features we supposed that have a significant impact on house prices.

We creat the count number of transit stops of public transportation in a specific buffer

transittop <- 
  st_read('https://data.sfgov.org/resource/i28k-bkz6.geojson')%>%
  st_transform(2227)

transittop$latitude <- as.numeric(as.character(transittop$latitude))
transittop$longitude <- as.numeric(as.character(transittop$longitude))

transittop.sf <- 
  transittop %>%
  st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
  st_transform(2227)

transittopfilter <- 
  transittop %>%
  filter(latitude >37.7,longitude < -122)

transittopfilter_sf <- 
  transittopfilter %>%
  dplyr::select(latitude,longitude) %>%
  na.omit() %>%
  st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
  st_transform(2227)%>%
  distinct()
  

student2$transittop.Buffer =
  student2 %>%
  st_buffer(10000) %>%
  aggregate(mutate(transittopfilter_sf , transittop.Buffer = 1),., sum) %>%
  pull(transittop.Buffer)

We creat the count number of crime in a specific buffer

crime <- 
  st_read('https://data.sfgov.org/resource/wg3w-h783.geojson')%>%
  st_transform(2227)

crime$latitude <- as.numeric(as.character(crime$latitude))
crime$longitude <- as.numeric(as.character(crime$longitude))

crime.sf <- 
  crime %>%
  st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
  st_transform(2227)

crimefilter <- 
  crime %>%
  filter(latitude >37.7,longitude < -122)

crimefilter_sf <- 
  crimefilter %>%
  dplyr::select(latitude,longitude) %>%
  na.omit() %>%
  st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
  st_transform(2227)%>%
  distinct()


student2$crime.Buffer =
  student2 %>%
  st_buffer(10000) %>%
  aggregate(mutate(crimefilter_sf , crime.Buffer = 1),., sum) %>%
  pull(crime.Buffer)

We creat the count number of hospital in a specific buffer

hospital <- 
  st_read('C:/Users/Anyang/Desktop/Hospitals/Hospitals_BayArea.shp')%>%
  st_transform(2227)

hospital_sf <- 
  crimefilter %>%
  dplyr::select(geometry) %>%
  na.omit() %>%
  st_as_sf(crs=4326,agr='constant')%>%
  st_transform(2227)%>%
  distinct()

student2$hospital.Buffer =
  student2 %>%
  st_buffer(5000) %>%
  aggregate(mutate(hospital_sf , hospital.Buffer = 1),., sum) %>%
  pull(hospital.Buffer)

We creat the the measures of average nearest distance from each home sale price observation to its 5 nearest crime neighbor.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  
  return(output)  
}

schools <- 
  st_read('https://data.sfgov.org/resource/tpp3-epx2.geojson')%>%
  st_transform(2227)

student2$school_dis =
  nn_function(st_coordinates(student2), st_coordinates(schools), 5)

The lag price/lag crime buffer count/lag transit stop count/lag business

coords <- st_coordinates(student2) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
student2$lagPrice <- lag.listw(spatialWeights, student2$SalePrice)
student2$lagcrime <- lag.listw(spatialWeights, student2$crime.Buffer)
student2$lagtransittop.Buffer <- lag.listw(spatialWeights, student2$transittop.Buffer)
student2$lagbiz <- lag.listw(spatialWeights, student2$num_business)

2.2 Data Collection

Our team has selected and compiled data from a number of variables that we believe can predict the price of a home. The resulting data is the most useful data for model building. We divide it into the following categories.

  • internal characteristics
  • Property Area
  • Lot Area
  • Built Year
  • Beds
  • Beths
  • Average Household Size..

  • amenities/public services
  • transit stops of public transportation
  • schools
  • hospitals
  • crimes
  • business..

  • Spatial Structure
  • Spatial lag of price/transit stops/cirme
  • neighborhood..

  • Social Factors
  • Employment Rate
  • Foreign Born
  • college degree
  • Asian
  • Black/African American..

2.3 Data preprocessing

We reclassify for built year.

student2$BuiltYear=as.numeric(student2$BuiltYear)
student2$SaleYr=as.factor(student2$SaleYr)
quantile(student2$BuiltYear)
BuiltYr=student2$BuiltYear
BuiltYr[10078]
for(i in 1:10078)
{
  if(BuiltYr[i]<=36)
  {
    BuiltYr[i]=1
  }
  else if((BuiltYr[i]>36)&(BuiltYr[i]<=52))
  {
    BuiltYr[i]=2
  }
  else if((BuiltYr[i]>52)&(BuiltYr[i]<=71))
  {
    BuiltYr[i]=3
  }
  else if((BuiltYr[i]>71)&(BuiltYr[i]<=139))
  {
    BuiltYr[i]=4
  }
  
}
student2$BuiltYr=as.factor(BuiltYr) 

Cleaning the Data

#set a new row called college
student2$college<- student2$`Some College/Associate Degree`+student2$`College Degree`
#log transformed this features
student2$lnBeds = log(student2$Beds+1)
student2$lnSalePrice = log(student2$SalePrice)
student2$lnPropArea = log(student2$PropArea)
student2$lnRooms = log(student2$Rooms+1)

2.3.1 Summary Statistics

Following is the table of summary statistics using the Stargazer package.

stargazer(as.data.frame(student2), type="text", title = "Summary Statistics")
## 
## Summary Statistics
## =============================================================================================================================
## Statistic                              N        Mean       St. Dev.        Min        Pctl(25)      Pctl(75)         Max     
## -----------------------------------------------------------------------------------------------------------------------------
## X                                    10,078 5,998,962.000  8,956.204  5,980,569.000 5,992,443.000 6,005,293.000 6,020,464.000
## Y                                    10,078 2,100,286.000  8,463.156  2,086,052.000 2,094,039.000 2,105,302.000 2,121,818.000
## long                                 10,078    37.747        0.023       37.708        37.730        37.761        37.806    
## lat                                  10,078   -122.446       0.031      -122.511      -122.469      -122.425      -122.371   
## SalePrice                            10,078 1,064,620.000 735,206.300       0         642,602.5    1,327,752.0    4,750,003  
## LotArea                              10,078  247,080.900  136,698.800       0          187,500       300,000      1,890,500  
## PropArea                             10,078   1,647.934     774.378        187          1,159         1,978        24,308    
## BuiltYear                            10,078    56.825       26.155          1            36            71            139     
## Stories                              10,078     1.346        0.676          0             1             2            10      
## Units                                10,078     1.000        0.000          1             1             1             1      
## Rooms                                10,078     6.217        2.125          0             5             7            26      
## Beds                                 10,078     1.702        1.700          0             0             3            10      
## Baths                                10,078     1.801        0.961          0             1             2             8      
## holdOut                              10,078     0.069        0.254          0             0             0             1      
## Graduate/Professional Degree         10,078     0.191        0.104        0.030         0.090         0.280         0.420    
## Average Household Size               10,078     2.724        0.620        1.600         2.200         3.400         3.600    
## Managerial Professional              10,078     0.274        0.123        0.032         0.159         0.374         0.532    
## Fair Market Rent                     10,078   2,550.188     542.181       1,800         2,010         3,000         4,330    
## No Bedroom                           10,078     0.048        0.050        0.010         0.020         0.060         0.630    
## Foreign Born                         10,078     0.360        0.118        0.140         0.280         0.430         0.700    
## Employment Rate                      10,078     0.550        0.076        0.419         0.508         0.583         0.739    
## Some College/Associate Degree        10,078     0.223        0.044        0.100         0.200         0.250         0.300    
## College Degree                       10,078     0.300        0.086        0.160         0.210         0.350         0.520    
## Asian                                10,078     0.350        0.171        0.080         0.160         0.510         0.800    
## 20 Units or more                     10,078     0.090        0.144        0.000         0.020         0.080         0.940    
## Households with Children, % of Total 10,078     0.252        0.092        0.060         0.180         0.300         0.400    
## Owner occupied                       10,078     0.514        0.173        0.030         0.400         0.620         0.790    
## Black/African American               10,078     0.058        0.076        0.000         0.010         0.060         0.310    
## Vacant Units                         10,078     0.061        0.021        0.020         0.050         0.070         0.220    
## Elementary school access             10,078    33.663       16.566          6            21            52            64      
## Avg_hr_fog                           10,078     9.439        1.355        6.000         8.670        10.320        12.000    
## num_business                         10,078   2,723.511    1,498.333       166          1,796         3,284        15,618    
## transittop.Buffer                    10,078    209.964      68.728         49            151           266           397     
## crime.Buffer                         10,078    101.189      63.841         15            54            132           341     
## hospital.Buffer                      10,078    26.448       19.542          4            14            32            155     
## school_dis                           10,078   1,578.461     481.432      286.606      1,226.702     1,901.334     4,208.625  
## lagPrice                             10,078 1,067,666.000 540,723.400    100,000      674,026.7    1,316,001.0    4,283,002  
## lagcrime                             10,078    101.179      63.743         15            54           131.4          332     
## lagtransittop.Buffer                 10,078    209.970      68.587         51           151.4         266.2          385     
## lagbiz                               10,078   2,718.748    1,455.416       166         1,857.7        3,284         8,760    
## college                              10,078     0.523        0.057        0.270         0.480         0.560         0.640    
## lnBeds                               10,078     0.763        0.704        0.000         0.000         1.386         2.398    
## lnSalePrice                          10,078   -Inf.000                  -Inf.000       13.373        14.099        15.374    
## lnPropArea                           10,078     7.323        0.405        5.231         7.055         7.590        10.099    
## lnRooms                              10,078     1.921        0.376        0.000         1.792         2.079         3.296    
## -----------------------------------------------------------------------------------------------------------------------------

2.3.2 correlation matrix

The correlation matrix of selected variables can help us to find the correlation between two variables.

1

1

2.3.3 Correlation Scatterplots

The scatterplots of home price correlated with 4 different variable. We select 4 interesting variables to make scatterlots of home price correlation.

The first is the built year of the houses. The second is the count of hospital in the 2 mile buffer. The third is the distance to the school of each house. The forth is the count of transit stops of public transportation in the 2 mile buffer.From the result, it is notable to indicate that the increase of built year and hospitals in the buffer will decrease the sale price, but this effect is not signifcant.The increase of school distance will signifcant decrease the sale price. The increase of the transit stops will leads the increase the sale price and the change is more obvious. It is possible that the school distance and transit stops have a higher correlation than the other two features.

as.data.frame(student2) %>% 
  dplyr::select(SalePrice, transittop.Buffer, school_dis, hospital.Buffer,BuiltYear) %>%
  filter(SalePrice <= 1000000 & SalePrice > 0,BuiltYear > 10) %>%
  gather(Variable, Value, -SalePrice) %>% 
   ggplot(aes(Value, SalePrice)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     theme(axis.text.x = element_blank()) +
     plotTheme()

2.3.4 Map of sale price

From that the dependent variable SalePrice can be plot on the map.Here is the Distribution Map of the Sales Price. From this map, it can be seen that high value houses located in the center and the north of the San Francisco.The west and the south-west houses have a lower value. The south-east seems has the lowest value.

ggplot() +
  geom_sf(data=st_union(baseMap),fill='grey80') +
  geom_sf(data = student, aes(color = q5(SalePrice)), show.legend = "point", size = .75) + 
  scale_colour_manual(values = palette5,
                      labels=qBr(student,"SalePrice"),
                      name="Quintile\nBreaks") +
  labs(title="Sales Price in San Francisco") +
  mapTheme()

2.3.5 Maps of most Independent Variables

The map of density of crime

From the map of density, it is notable to indicate that the north east area has the most number of crime. Relating to the house price distribution, there is less sample in that area.

ggplot() + 
  geom_sf(data = baseMap, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(crime)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = FALSE) +
  labs(title = "Density of Crime, San Francisco") +
  mapTheme()

The map of transit stops with the distribution of price

palette4 <- c("#FA7800","#C48C04","#8FA108","#5AB60C","#25CB10")
ggplot() + 
  geom_sf(data = baseMap, fill = "grey40") +
  geom_sf(data = student2, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) +
  geom_sf(data = transittop.sf, colour = "black", size = .75) +
  scale_colour_manual(values = palette4,
                      labels=qBr(student2,"SalePrice"),
                      name="Quintile\nBreaks") +
  labs(title = "Transit stops and home sale prices, San Francisco", subtitle = "transit stop in black") +
  mapTheme()

The density map of hospitals

ggplot() + 
  geom_sf(data = baseMap, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(schools)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.50), guide = FALSE) +
  labs(title = "Density of Hospitals, San Francisco") +
  mapTheme()

3.1 Methods

Our group used an ordinary least square linear regression to predict the housing price. We used a wide range of relevant dependent variables (see 2.3.1 for a complete list.) After gathering all the dependent variables, we divided the dataset into 3 groups: prediction group, training group and test group. We conducted a linear regression on Sale price against all those variables on the training group, and then used this model to predict the sale price for the test group. To evaluate the accuracy of our model, we then calculated the mean absolute percent error (MAPE). To improve the model, we redid the regression based on a different set of variables and recalculate the MAPE. We used trial and error until we reached the model with the lowest MAPE. Once we found our best model, we regressed again using the variables on both training and test group together, using this model to predict for the prices in the prediction group.

student4 <- student2[-which(student2$holdOut==1),]
student4.use <- student2[which(student2$holdOut==1),]

inTrain <- createDataPartition(
              y=paste(student4$PropClassC,student4$SaleYr),
              p =.8, list = FALSE)
              
training <- student4[inTrain,] 
test <- student4[-inTrain,]

mean(training$SalePrice)
mean(test$SalePrice)

training$Index1 = training$lnPropArea * training$`Fair Market Rent`
test$Index1 = test$lnPropArea * test$`Fair Market Rent`
training$Index2 = training$lnPropArea * training$`Households with Children, % of Total`
test$Index2 = test$lnPropArea * test$`Households with Children, % of Total`
training$Index3 = training$lnPropArea * training$transittop.Buffer
test$Index3 = test$lnPropArea * test$transittop.Buffer
training$Index4 = training$`Fair Market Rent` * training$transittop.Buffer
test$Index4 = test$`Fair Market Rent` * test$transittop.Buffer
training$Index5 = training$lnPropArea * training$`Graduate/Professional Degree`
test$Index5 = test$lnPropArea * test$`Graduate/Professional Degree`

reg <- lm(lnSalePrice ~ .,data =training %>%
            st_set_geometry(NULL) %>%
            dplyr::select(lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,Index1,Index2,Index3,Index4,Index5))

summary(reg)

3.2 Results

3.2.1 The in-sample model results

kable(tidy(reg))
term estimate std.error statistic p.value
(Intercept) 11.6142249 0.5790895 20.0560092 0.0000000
LotArea 0.0000003 0.0000000 11.9374140 0.0000000
SaleYr13 0.1908435 0.0079468 24.0152361 0.0000000
SaleYr14 0.3315673 0.0081944 40.4627639 0.0000000
SaleYr15 0.4677250 0.0083283 56.1611655 0.0000000
lnPropArea 0.2709918 0.0742470 3.6498676 0.0002642
PropClassCD 0.6289958 0.1770816 3.5520107 0.0003847
PropClassCDA 0.6074296 0.1901391 3.1946586 0.0014058
PropClassCF 0.5267360 0.1820146 2.8939220 0.0038156
PropClassCLZ 0.3167473 0.1873998 1.6902224 0.0910271
PropClassCOZ -0.2377586 0.3099122 -0.7671806 0.4429984
PropClassCTH 0.4514064 0.1903710 2.3711933 0.0177560
PropClassCTIC -0.3913447 0.1815146 -2.1559955 0.0311158
PropClassCZ 0.4766883 0.1774002 2.6870792 0.0072239
PropClassCZBM -0.2015081 0.1959883 -1.0281637 0.3039061
PropClassCZEU 2.3485593 0.2527058 9.2936507 0.0000000
hospital.Buffer -0.0013778 0.0004249 -3.2428949 0.0011884
crime.Buffer 0.0012018 0.0002162 5.5579919 0.0000000
lnRooms -0.0155356 0.0107287 -1.4480478 0.1476456
BuiltYr2 0.0041174 0.0090805 0.4534366 0.6502476
BuiltYr3 0.0084638 0.0094917 0.8917003 0.3725823
BuiltYr4 -0.0156838 0.0093130 -1.6840870 0.0922066
Stories 0.0272989 0.0051665 5.2838003 0.0000001
Baths 0.0099762 0.0047336 2.1075164 0.0351061
lnBeds -0.0017185 0.0050671 -0.3391453 0.7345098
Graduate/Professional Degree -2.6173437 0.7085223 -3.6940880 0.0002223
Average Household Size -0.2765524 0.0261216 -10.5871142 0.0000000
Managerial Professional -0.7304696 0.1499385 -4.8717935 0.0000011
Elementary school access -0.0101962 0.0005472 -18.6331141 0.0000000
Fair Market Rent -0.0004357 0.0001569 -2.7773310 0.0054944
num_business 0.0000102 0.0000033 3.1394240 0.0016994
No Bedroom -0.8380124 0.1841074 -4.5517584 0.0000054
Foreign Born -1.0838755 0.1660528 -6.5272946 0.0000000
Employment Rate -0.5517922 0.2011114 -2.7437146 0.0060893
college 0.6701368 0.1468570 4.5631939 0.0000051
Asian 0.8095830 0.0837671 9.6646920 0.0000000
20 Units or more -0.2840930 0.0801059 -3.5464674 0.0003928
Households with Children, % of Total 3.8110599 0.8807094 4.3272616 0.0000153
Owner occupied 0.1693753 0.0503530 3.3637604 0.0007727
Black/African American -1.9492244 0.0988415 -19.7207169 0.0000000
Vacant Units 1.5176525 0.2693955 5.6335490 0.0000000
Index1 0.0000787 0.0000203 3.8713414 0.0001092
Index2 -0.6173940 0.1205043 -5.1234189 0.0000003
Index3 0.0003232 0.0000511 6.3182582 0.0000000
Index4 -0.0000006 0.0000002 -3.8274479 0.0001305
Index5 0.4531083 0.0951579 4.7616464 0.0000020

3.2.2 The R squared, mean absolute error and MAPE

test1 <-
  test %>%
  mutate(SalePrice.Predict = exp(predict(reg, test)),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)

st_set_geometry(test1, NULL) %>%    # the Mean Absolute Erroe
  summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
  pull()

st_set_geometry(test1, NULL) %>%    # the Mean Absolute Percent Error
  summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
  round(2) %>%
  paste("%")

MAE = st_set_geometry(test1, NULL) %>%    # the Mean Absolute Error
  summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
  pull()

print(MAE)

MAPE = st_set_geometry(test1, NULL) %>%    # the Mean Absolute Percent Error
  summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
  round(2) %>%
  paste("%")
print(MAPE)
summary(reg)$r.squared
df = data.frame("r.squared"=summary(reg)$r.squared,"Mean Absolute Error"= MAE, "Mean Absolute Percent Error"=MAPE)
kable(df)  %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
r.squared Mean.Absolute.Error Mean.Absolute.Percent.Error
0.7865802 210338.9 18.44 %

3.2.3 Cross-validation test

fitControl <- trainControl(method = "cv", number = 100)

student2$Index1 = student2$lnPropArea * student2$`Fair Market Rent`
student2$Index2 = student2$lnPropArea * student2$`Households with Children, % of Total`
student2$Index3 = student2$lnPropArea * student2$transittop.Buffer
student2$Index4 = student2$`Fair Market Rent` * student2$transittop.Buffer
student2$Index5 = student2$lnPropArea * student2$`Graduate/Professional Degree`

set.seed(825)

reg.cv <- train(SalePrice ~ ., data = as.data.frame(student2) %>% 
                                               dplyr::select(SalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,Index1,Index2,Index3,Index4,Index5,lagPrice), 
                 method = "lm", trControl = fitControl, na.action = na.pass)


A=ggplot(as.data.frame(reg.cv$resample), aes(MAE)) + 
  geom_histogram(bins = 50, colour="white", fill = "#FA7800") +
  labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
       x="Mean Absolute Error", y="Count")+
  plotTheme()

B=A+geom_vline(xintercept=mean(MAE), linetype="dashed", 
                color = "red", size=2)
mean(MAE)
plot(B)

The mean MAE iS 213393

3.2.4 The predicted prices as a function of observed prices

test2 <- 
  test1%>%
  filter(SalePrice <= 4000000 & SalePrice > 0,SalePrice.Predict<4000000) 

ggplot(test2, aes(SalePrice.Predict, SalePrice)) + 
  geom_point() +
  stat_smooth(data=test2, aes(SalePrice, SalePrice), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(data=test2, aes(SalePrice, SalePrice.Predict), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  labs(title="Predicted sale price as a function of\nobserved price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme(plot.title = element_text(size = 18,colour = "black"))

3.2.5 The residuals and sale price for the test set

grid.arrange(
  ggplot() +
    geom_sf(data = baseMap, fill = "grey40") +
    geom_sf(data = test1, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette4,
                     labels=qBr(test1,"SalePrice"),
                     name="Quintile\nBreaks") +
    labs(title="Test set sale prices") +
    mapTheme(),
  
  ggplot() +
    geom_sf(data = baseMap, fill = "grey40") +
    geom_sf(data = test1, aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette4,
                     labels=qBr(test1,"SalePrice.AbsError"),
                     name="Quintile\nBreaks") +
    labs(title="Sale price errors on the test set") +
    mapTheme(),
  ncol = 2)

3.2.6 The Moran’s I test

coords.test <- 
  filter(test1, !is.na(SalePrice.Error)) %>% 
  st_coordinates() 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

test1 %>% 
  filter(!is.na(SalePrice.Error)) %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)) %>%
  ggplot(aes(lagPriceError, SalePrice.Error)) +
    geom_point(colour = "#FA7800") +
    geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
    labs(title = "Error as a function of the spatial lag of price",
         subtitle = "As a measure of spatial autocorrelation; lag of 5 nearest neighbors\nRegression line in green",
         caption = "Public Policy Analytics, Figure 6.6",
         x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
         y = "Sale Price") +
    plotTheme()

moranTest <- moran.mc(filter(test1, !is.na(SalePrice.Error))$SalePrice.Error, 
                      spatialWeights.test , nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in red",
       x="Moran's I",
       y="Count",
       caption="Public Policy Analytics, Figure 6.8") +
  plotTheme()

3.2.7 predicted values for the entire dataset

3.2.8 The map of mean absolute percentage error (MAPE) by neighborhood

inTrain <- createDataPartition(
             y=paste(student4$PropClassC,student4$SaleYr),
              p =.8, list = FALSE)
sf.training <- student4[inTrain,] 
sf.test <- student4[-inTrain,]  

reg2 <- lm(lnSalePrice ~ .,data =sf.training %>%
            st_set_geometry(NULL) %>%
            dplyr::select(lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`))

sf.test <-
  sf.test %>%
  mutate(Regression = "Baseline Regression",
    SalePrice.Predict = exp(predict(reg2, sf.test)),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice,
         SalePrice = exp(lnSalePrice))%>%
  filter(SalePrice < 5000000)

left_join(
  st_set_geometry(sf.test, NULL) %>%
    group_by(name) %>%
    summarize(meanPrice = mean(SalePrice, na.rm = T)),

  mutate(sf.test, predict.fe = predict(lm(SalePrice ~ name, data = sf.test), 
                                           sf.test)) %>%
  st_set_geometry(NULL) %>%
  group_by(name) %>%
  summarize(meanPrediction = mean(predict.fe))) %>%
  kable() %>% kable_styling()

reg3 <- lm(lnSalePrice ~ .,data =sf.training %>%
            st_set_geometry(NULL) %>%
            dplyr::select(name,lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`))

sf.test.nhood <-
  sf.test %>%
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = exp(predict(reg3, sf.test)),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice,
         SalePrice = exp(lnSalePrice))%>%
  filter(SalePrice < 5000000)

bothRegressions <- 
  rbind(
    dplyr::select(sf.test, c(starts_with("SalePrice."), 
                               Regression, SalePrice, name)) %>% 
      na.omit() %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
    dplyr::select(sf.test.nhood, c(starts_with("SalePrice."), 
                                     Regression, SalePrice, name)) %>% 
      na.omit() %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))

bothRegressions %>%
  gather(Variable, Value, -Regression, -name, -geometry) %>%
  filter(Variable != "SalePrice" & Variable != "SalePrice.Predict") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(Variable, meanValue) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "#25CB10") %>%
    row_spec(2, color = "black", background = "#FA7800")

bothRegressions %>%
  ggplot() +
    geom_sf(data = student2, fill = "grey40") +
    geom_sf(aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = 1) +
    facet_wrap(~Regression) +
    scale_colour_manual(values = palette4,
                       labels=qBr(bothRegressions,"SalePrice.AbsError"),
                       name="Quintile\nBreaks") +
    labs(title="Test set errors") +
    mapTheme()

bothRegressions.nhoods <-
  bothRegressions %>%
    group_by(Regression, name) %>%
    summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
    st_set_geometry(NULL) %>%
    left_join(sf_neighborhood2) %>%
    st_sf()

 ggplot() + 
  geom_sf(data = bothRegressions.nhoods, aes(fill = q5(mean.MAPE))) +
  geom_sf(data = bothRegressions, colour = "black", size = .5) +
  facet_wrap(~Regression) +
  scale_fill_manual(values = palette4,
                    labels = qBr(bothRegressions.nhoods, "mean.MAPE", rnd = F),
                    name="Quintile\nBreaks") +
  labs(title = "Mean test set MAPE by neighborhood") +
  mapTheme()

3.2.9 The scatterplot plot of MAPE by neighborhood

bothRegressions.nhoods2 <-
  bothRegressions %>%
    group_by(Regression, name) %>%
    summarize(mean.price = mean(SalePrice, na.rm = T)) %>%
    st_set_geometry(NULL) %>%
    left_join(bothRegressions.nhoods) %>%
    st_sf()


bothRegressions.nhoods2 %>%
  ggplot(aes( mean.price, mean.MAPE)) +
    geom_point(colour = "#FA7800") +
    geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
    facet_wrap(~Regression) +
    labs(title = "MAPE as a Function of Average Sales Price by neighborhood",
         caption = "Public Policy Analytics, Figure 6.6",
         x = "Mean Price by neighbourhood",
         y = "Mean MAPE by neighborhood") +
    plotTheme()

3.2.10 Race and Income group of the city

census_api_key("d82854b2521c584efb039f747f0343702a6211eb",overwrite = TRUE,install = TRUE)
readRenviron("~/.Renviron")

tracts17 <- 
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"), 
        year = 2017, state=06, county=075, geometry=T) %>%
st_transform(2227)  %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
       NumberWhites = B01001A_001,
       Median_Income = B06011_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
       raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
       incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))

grid.arrange(
  ggplot() + 
    geom_sf(data = na.omit(tracts17), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"),
                      name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  
  ggplot() + 
    geom_sf(data = na.omit(tracts17), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"),
                      name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

st_join(bothRegressions, tracts17) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Mean Absolute Error of test set sales by neighborhood racial context") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "#25CB10") %>%
    row_spec(2, color = "black", background = "#FA7800") 
Mean Absolute Error of test set sales by neighborhood racial context
Regression Majority Non-White Majority White
Baseline Regression 0.1910693 0.2080595
Neighborhood Effects 0.1708258 0.1921813
st_join(bothRegressions, tracts17) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  st_set_geometry(NULL) %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Mean Absolute Error of test set sales by neighborhood income context") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "#25CB10") %>%
    row_spec(2, color = "black", background = "#FA7800") 
Mean Absolute Error of test set sales by neighborhood income context
Regression High Income Low Income
Baseline Regression 0.1952467 0.2076065
Neighborhood Effects 0.1794249 0.1825406

Our model can be promoted to some extent. It can assess low-income and high-income, home sales prices for most non-white communities and white communities with little error. Our MAPE fluctuation is low when measuring different communities, which is good for promoting our model to San Francisco.

4.1 Discussion

Our model of house price predictionis effective in terms of results from cross validation. The more interesting variables are built year of the houses, count of hospitals in a 2 mile buffer, distance to school of each house and count of transit stops of public transportation in a 2 mile buffer. Our model is capable of explaining 78.7% of the variation in sale prices in San Francisco. The most important predictors are lot area of the house, sale year of the house, property of the house, average household size in the neighborhood, elementary school access in the neighborhood, percentage of Asian in the neighborhoos and percentage of Black/African American in the neighborhood. In the cross validation, our model had a Mean Absolute Error of 292377.6 and a MAPE of 471227.5, which is acceptable with OLS regression and limited features. According to the map, there is spatial variation in house prices across neighborhoods and this could be attributed to factors such as some neighborhoods are richer than other neighborhoods and some neighborhoods has more white people than other neighborhoods. Our model is predicting better in downtown San Francisco and predicting not very well in uptown San Francisco. This might be because house prices in uptown San Francisco are generally higher and have more variability. We need to look for more features in future to further account for the prediction errors.

5.1 Conclusion

I would recommend this model to Zillow even though we understand Zillow may need a model with higher predictive power. However, we have a very good dataset with features of neighborhood which could be used in other algothrithms as well. If we were able to improve the model, we would also include a spatial lag factor to account for the spatial pattern in sale prices.