The purpose of our project is for developing a hedonic price model to predict the home prices in San Francisco.This project focus on San Francisco’s housing prices is because our customer Zillow needs a more accurate model to predict house prices. Our group developed a regression model that accounts for a wide variety in relevant variables. The difficulty of this project is that the fluctuation of house prices is a complex phenomenon due to the interrelationship of many different factors. So, it is hard to find the best model. In addition, in the process of modeling, predictions models that are accurate in other regions may not work in San Francisco which requires our group to explore variables unique to San Francisco that significantly affect housing prices.The overall strategy of this project is selecting various relevant variables related to economic, social, transportation, safety, housing and environmental characteristics around each house or establishing spatial feartures that helping to explain the fluctuation of the house prices.
Our result is the model that we have developed explains roughly 80% of the variation in home prices in San Francisco (R2 of .787) within a 18% range of accuracy for each price that we predict (MAPE of .18). Our model shows a regularly distributed residual, both numerically and geographically, meaning that our model is generalizable–equally able to predict sales price in one neighborhood versus another. All of these factors lead us to believe that our model will be useful for Zillow.
The following packages will be use in the project. The map theme and plot theme is used to standardize the cartography on all maps.The palette is used for map coloring. The qBr and q5 are used for converting a column in to quintiles. It is used for mapping.
library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(tidyverse)
library(sf)
library(spdep)
library(data.table)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(png)
library(broom)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
palette5 <- c("#0000FF","#00FF00","#FF0000","#000000","#FFFF00")
palette1 <- c("#999999", "#E69F00", "#56B4E9")
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
First, the base map of San Francisco and the basic data set are loaded
baseMap <- st_read("https://data.sfgov.org/api/geospatial/p5b7-5n3h?method=export&format=GeoJSON") %>%
st_transform(2227)
student <-
st_read("/Users/Anyang/Desktop/midtermData_studentVersion/midterm_data_sf_studentVersion.shp") %>%
st_transform(st_crs(baseMap))
We load the data of neighborhood1 and neighborhood2
The neighborhood1 has 37 districts the neighborhood2 has 117 districts in San Francisco. These data sets will be used when join the independent variables and dependent variable.
Here we use two kind of Classification of neighborhood becasue when we join some data following, some independent variables (such as fog) need more detailed classification to be joined.
sf_neighborhood <-
st_read("https://data.sfgov.org/api/geospatial/iacs-ws63?method=export&format=GeoJSON")%>%
st_transform(2227)
sf_neighborhood2 <-
st_read("https://data.sfgov.org/api/geospatial/pty2-tcw4?method=export&format=GeoJSON")%>%
st_transform(2227)
Then we use the st_join to join the student and sf_neighborhood1 and sf_neighborhood2
student1 = st_join(student,sf_neighborhood2[,-1],join= st_intersects )
student2 = st_join(student1,sf_neighborhood,join = st_intersects)
We use the data collected from the https://sfplanning.org/resource/san-francisco-neighborhoods-socio-economic-profiles-2010-2014 and cleaning it into CSV files. The orginal data is classifed by neighbourhood,and our group merge it with the basic data set.
##fill the names of neighborhoods that are missing.
which(is.na(student2$neighborho)==TRUE)
student2[210,]$name <- "Bayview"
student2[7207,]$name <- "Crocker Amazon"
student2[7207,]$neighborho <- "Crocker Amazon"
student2[7466,]$name <- "Oceanview"
student2[7466,]$neighborho <- "Ocean View"
student2[7691,]$name <- "Crocker Amazon"
student2[7691,]$neighborho <- "Crocker Amazon"
student2[7798,]$name <- "Crocker Amazon"
student2[7798,]$neighborho <- "Crocker Amazon"
student2[8762,]$name <- "Crocker Amazon"
student2[8762,]$neighborho <- "Crocker Amazon"
which(student2$Baths>15)
student2[6024,]$Baths <- median(student2[which(student2$name=="Crocker Amazon"),]$Baths)
student2[7479,]$Baths <- median(student2[which(student2$name=="Oceanview"),]$Baths)
which(student2$Beds>15)
student2[380,]$Beds <- median(student2[which(student2$name=="Pacific Heights"),]$Beds)
which(student2$Rooms==0)
which(student2$Rooms>30)
student2[6177,]$Rooms <- median(student2[which(student2$name=="Mission"),]$Rooms)
which(student2$Stories>60)
student2[1097,]$Stories <- median(student2[which(student2$name=="Western Addition"),]$Stories)
student2[1120,]$Stories <- median(student2[which(student2$name=="Buena Vista"),]$Stories)
student2[8574,]$Stories <- median(student2[which(student2$name=="Ingleside"),]$Stories)
student2 = student2[-which(student2$PropArea==0),]
demo1 <- read_csv("C:/Users/Anyang/Desktop/demo2.csv")
demo2 <- demo1 %>%
select(`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Fair Market Rent`,`No Bedroom`,`Foreign Born`,`Employment Rate`,`Some College/Associate Degree` ,`College Degree`,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,nhood,`Elementary school access`)
student2 = merge(student2, demo2, by.x="neighborho",by.y="nhood")
We use the fog data
fog <- read_csv("C:/Users/Anyang/Desktop/fog.csv")
student2 = merge(student2, fog, by.x="name",by.y="nhood")
We use the business data from Open Data SF
business <- read_csv("C:/Users/Anyang/Desktop/businessn.csv")
student2 = merge(student2, business, by.x="neighborho",by.y="nhood")
Our team calculated the number of internal spatial features by building a buffer, and we selected some features we supposed that have a significant impact on house prices.
We creat the count number of transit stops of public transportation in a specific buffer
transittop <-
st_read('https://data.sfgov.org/resource/i28k-bkz6.geojson')%>%
st_transform(2227)
transittop$latitude <- as.numeric(as.character(transittop$latitude))
transittop$longitude <- as.numeric(as.character(transittop$longitude))
transittop.sf <-
transittop %>%
st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
st_transform(2227)
transittopfilter <-
transittop %>%
filter(latitude >37.7,longitude < -122)
transittopfilter_sf <-
transittopfilter %>%
dplyr::select(latitude,longitude) %>%
na.omit() %>%
st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
st_transform(2227)%>%
distinct()
student2$transittop.Buffer =
student2 %>%
st_buffer(10000) %>%
aggregate(mutate(transittopfilter_sf , transittop.Buffer = 1),., sum) %>%
pull(transittop.Buffer)
We creat the count number of crime in a specific buffer
crime <-
st_read('https://data.sfgov.org/resource/wg3w-h783.geojson')%>%
st_transform(2227)
crime$latitude <- as.numeric(as.character(crime$latitude))
crime$longitude <- as.numeric(as.character(crime$longitude))
crime.sf <-
crime %>%
st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
st_transform(2227)
crimefilter <-
crime %>%
filter(latitude >37.7,longitude < -122)
crimefilter_sf <-
crimefilter %>%
dplyr::select(latitude,longitude) %>%
na.omit() %>%
st_as_sf(coords=c('latitude','longitude'),crs=4326,agr='constant')%>%
st_transform(2227)%>%
distinct()
student2$crime.Buffer =
student2 %>%
st_buffer(10000) %>%
aggregate(mutate(crimefilter_sf , crime.Buffer = 1),., sum) %>%
pull(crime.Buffer)
We creat the count number of hospital in a specific buffer
hospital <-
st_read('C:/Users/Anyang/Desktop/Hospitals/Hospitals_BayArea.shp')%>%
st_transform(2227)
hospital_sf <-
crimefilter %>%
dplyr::select(geometry) %>%
na.omit() %>%
st_as_sf(crs=4326,agr='constant')%>%
st_transform(2227)%>%
distinct()
student2$hospital.Buffer =
student2 %>%
st_buffer(5000) %>%
aggregate(mutate(hospital_sf , hospital.Buffer = 1),., sum) %>%
pull(hospital.Buffer)
We creat the the measures of average nearest distance from each home sale price observation to its 5 nearest crime neighbor.
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
schools <-
st_read('https://data.sfgov.org/resource/tpp3-epx2.geojson')%>%
st_transform(2227)
student2$school_dis =
nn_function(st_coordinates(student2), st_coordinates(schools), 5)
The lag price/lag crime buffer count/lag transit stop count/lag business
coords <- st_coordinates(student2)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
student2$lagPrice <- lag.listw(spatialWeights, student2$SalePrice)
student2$lagcrime <- lag.listw(spatialWeights, student2$crime.Buffer)
student2$lagtransittop.Buffer <- lag.listw(spatialWeights, student2$transittop.Buffer)
student2$lagbiz <- lag.listw(spatialWeights, student2$num_business)
Our team has selected and compiled data from a number of variables that we believe can predict the price of a home. The resulting data is the most useful data for model building. We divide it into the following categories.
Average Household Size..
business..
neighborhood..
Black/African American..
We reclassify for built year.
student2$BuiltYear=as.numeric(student2$BuiltYear)
student2$SaleYr=as.factor(student2$SaleYr)
quantile(student2$BuiltYear)
BuiltYr=student2$BuiltYear
BuiltYr[10078]
for(i in 1:10078)
{
if(BuiltYr[i]<=36)
{
BuiltYr[i]=1
}
else if((BuiltYr[i]>36)&(BuiltYr[i]<=52))
{
BuiltYr[i]=2
}
else if((BuiltYr[i]>52)&(BuiltYr[i]<=71))
{
BuiltYr[i]=3
}
else if((BuiltYr[i]>71)&(BuiltYr[i]<=139))
{
BuiltYr[i]=4
}
}
student2$BuiltYr=as.factor(BuiltYr)
Cleaning the Data
#set a new row called college
student2$college<- student2$`Some College/Associate Degree`+student2$`College Degree`
#log transformed this features
student2$lnBeds = log(student2$Beds+1)
student2$lnSalePrice = log(student2$SalePrice)
student2$lnPropArea = log(student2$PropArea)
student2$lnRooms = log(student2$Rooms+1)
Following is the table of summary statistics using the Stargazer package.
stargazer(as.data.frame(student2), type="text", title = "Summary Statistics")
##
## Summary Statistics
## =============================================================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------------------------------------------------------------------------
## X 10,078 5,998,962.000 8,956.204 5,980,569.000 5,992,443.000 6,005,293.000 6,020,464.000
## Y 10,078 2,100,286.000 8,463.156 2,086,052.000 2,094,039.000 2,105,302.000 2,121,818.000
## long 10,078 37.747 0.023 37.708 37.730 37.761 37.806
## lat 10,078 -122.446 0.031 -122.511 -122.469 -122.425 -122.371
## SalePrice 10,078 1,064,620.000 735,206.300 0 642,602.5 1,327,752.0 4,750,003
## LotArea 10,078 247,080.900 136,698.800 0 187,500 300,000 1,890,500
## PropArea 10,078 1,647.934 774.378 187 1,159 1,978 24,308
## BuiltYear 10,078 56.825 26.155 1 36 71 139
## Stories 10,078 1.346 0.676 0 1 2 10
## Units 10,078 1.000 0.000 1 1 1 1
## Rooms 10,078 6.217 2.125 0 5 7 26
## Beds 10,078 1.702 1.700 0 0 3 10
## Baths 10,078 1.801 0.961 0 1 2 8
## holdOut 10,078 0.069 0.254 0 0 0 1
## Graduate/Professional Degree 10,078 0.191 0.104 0.030 0.090 0.280 0.420
## Average Household Size 10,078 2.724 0.620 1.600 2.200 3.400 3.600
## Managerial Professional 10,078 0.274 0.123 0.032 0.159 0.374 0.532
## Fair Market Rent 10,078 2,550.188 542.181 1,800 2,010 3,000 4,330
## No Bedroom 10,078 0.048 0.050 0.010 0.020 0.060 0.630
## Foreign Born 10,078 0.360 0.118 0.140 0.280 0.430 0.700
## Employment Rate 10,078 0.550 0.076 0.419 0.508 0.583 0.739
## Some College/Associate Degree 10,078 0.223 0.044 0.100 0.200 0.250 0.300
## College Degree 10,078 0.300 0.086 0.160 0.210 0.350 0.520
## Asian 10,078 0.350 0.171 0.080 0.160 0.510 0.800
## 20 Units or more 10,078 0.090 0.144 0.000 0.020 0.080 0.940
## Households with Children, % of Total 10,078 0.252 0.092 0.060 0.180 0.300 0.400
## Owner occupied 10,078 0.514 0.173 0.030 0.400 0.620 0.790
## Black/African American 10,078 0.058 0.076 0.000 0.010 0.060 0.310
## Vacant Units 10,078 0.061 0.021 0.020 0.050 0.070 0.220
## Elementary school access 10,078 33.663 16.566 6 21 52 64
## Avg_hr_fog 10,078 9.439 1.355 6.000 8.670 10.320 12.000
## num_business 10,078 2,723.511 1,498.333 166 1,796 3,284 15,618
## transittop.Buffer 10,078 209.964 68.728 49 151 266 397
## crime.Buffer 10,078 101.189 63.841 15 54 132 341
## hospital.Buffer 10,078 26.448 19.542 4 14 32 155
## school_dis 10,078 1,578.461 481.432 286.606 1,226.702 1,901.334 4,208.625
## lagPrice 10,078 1,067,666.000 540,723.400 100,000 674,026.7 1,316,001.0 4,283,002
## lagcrime 10,078 101.179 63.743 15 54 131.4 332
## lagtransittop.Buffer 10,078 209.970 68.587 51 151.4 266.2 385
## lagbiz 10,078 2,718.748 1,455.416 166 1,857.7 3,284 8,760
## college 10,078 0.523 0.057 0.270 0.480 0.560 0.640
## lnBeds 10,078 0.763 0.704 0.000 0.000 1.386 2.398
## lnSalePrice 10,078 -Inf.000 -Inf.000 13.373 14.099 15.374
## lnPropArea 10,078 7.323 0.405 5.231 7.055 7.590 10.099
## lnRooms 10,078 1.921 0.376 0.000 1.792 2.079 3.296
## -----------------------------------------------------------------------------------------------------------------------------
The correlation matrix of selected variables can help us to find the correlation between two variables.
1
The scatterplots of home price correlated with 4 different variable. We select 4 interesting variables to make scatterlots of home price correlation.
The first is the built year of the houses. The second is the count of hospital in the 2 mile buffer. The third is the distance to the school of each house. The forth is the count of transit stops of public transportation in the 2 mile buffer.From the result, it is notable to indicate that the increase of built year and hospitals in the buffer will decrease the sale price, but this effect is not signifcant.The increase of school distance will signifcant decrease the sale price. The increase of the transit stops will leads the increase the sale price and the change is more obvious. It is possible that the school distance and transit stops have a higher correlation than the other two features.
as.data.frame(student2) %>%
dplyr::select(SalePrice, transittop.Buffer, school_dis, hospital.Buffer,BuiltYear) %>%
filter(SalePrice <= 1000000 & SalePrice > 0,BuiltYear > 10) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables") +
theme(axis.text.x = element_blank()) +
plotTheme()
From that the dependent variable SalePrice can be plot on the map.Here is the Distribution Map of the Sales Price. From this map, it can be seen that high value houses located in the center and the north of the San Francisco.The west and the south-west houses have a lower value. The south-east seems has the lowest value.
ggplot() +
geom_sf(data=st_union(baseMap),fill='grey80') +
geom_sf(data = student, aes(color = q5(SalePrice)), show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(student,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Sales Price in San Francisco") +
mapTheme()
The map of density of crime
From the map of density, it is notable to indicate that the north east area has the most number of crime. Relating to the house price distribution, there is less sample in that area.
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(crime)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
scale_alpha(range = c(0.00, 0.50), guide = FALSE) +
labs(title = "Density of Crime, San Francisco") +
mapTheme()
The map of transit stops with the distribution of price
palette4 <- c("#FA7800","#C48C04","#8FA108","#5AB60C","#25CB10")
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = student2, aes(colour = q5(SalePrice)), show.legend = "point", size = .75) +
geom_sf(data = transittop.sf, colour = "black", size = .75) +
scale_colour_manual(values = palette4,
labels=qBr(student2,"SalePrice"),
name="Quintile\nBreaks") +
labs(title = "Transit stops and home sale prices, San Francisco", subtitle = "transit stop in black") +
mapTheme()
The density map of hospitals
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
stat_density2d(data = data.frame(st_coordinates(schools)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
scale_alpha(range = c(0.00, 0.50), guide = FALSE) +
labs(title = "Density of Hospitals, San Francisco") +
mapTheme()
Our group used an ordinary least square linear regression to predict the housing price. We used a wide range of relevant dependent variables (see 2.3.1 for a complete list.) After gathering all the dependent variables, we divided the dataset into 3 groups: prediction group, training group and test group. We conducted a linear regression on Sale price against all those variables on the training group, and then used this model to predict the sale price for the test group. To evaluate the accuracy of our model, we then calculated the mean absolute percent error (MAPE). To improve the model, we redid the regression based on a different set of variables and recalculate the MAPE. We used trial and error until we reached the model with the lowest MAPE. Once we found our best model, we regressed again using the variables on both training and test group together, using this model to predict for the prices in the prediction group.
student4 <- student2[-which(student2$holdOut==1),]
student4.use <- student2[which(student2$holdOut==1),]
inTrain <- createDataPartition(
y=paste(student4$PropClassC,student4$SaleYr),
p =.8, list = FALSE)
training <- student4[inTrain,]
test <- student4[-inTrain,]
mean(training$SalePrice)
mean(test$SalePrice)
training$Index1 = training$lnPropArea * training$`Fair Market Rent`
test$Index1 = test$lnPropArea * test$`Fair Market Rent`
training$Index2 = training$lnPropArea * training$`Households with Children, % of Total`
test$Index2 = test$lnPropArea * test$`Households with Children, % of Total`
training$Index3 = training$lnPropArea * training$transittop.Buffer
test$Index3 = test$lnPropArea * test$transittop.Buffer
training$Index4 = training$`Fair Market Rent` * training$transittop.Buffer
test$Index4 = test$`Fair Market Rent` * test$transittop.Buffer
training$Index5 = training$lnPropArea * training$`Graduate/Professional Degree`
test$Index5 = test$lnPropArea * test$`Graduate/Professional Degree`
reg <- lm(lnSalePrice ~ .,data =training %>%
st_set_geometry(NULL) %>%
dplyr::select(lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,Index1,Index2,Index3,Index4,Index5))
summary(reg)
kable(tidy(reg))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 11.6142249 | 0.5790895 | 20.0560092 | 0.0000000 |
LotArea | 0.0000003 | 0.0000000 | 11.9374140 | 0.0000000 |
SaleYr13 | 0.1908435 | 0.0079468 | 24.0152361 | 0.0000000 |
SaleYr14 | 0.3315673 | 0.0081944 | 40.4627639 | 0.0000000 |
SaleYr15 | 0.4677250 | 0.0083283 | 56.1611655 | 0.0000000 |
lnPropArea | 0.2709918 | 0.0742470 | 3.6498676 | 0.0002642 |
PropClassCD | 0.6289958 | 0.1770816 | 3.5520107 | 0.0003847 |
PropClassCDA | 0.6074296 | 0.1901391 | 3.1946586 | 0.0014058 |
PropClassCF | 0.5267360 | 0.1820146 | 2.8939220 | 0.0038156 |
PropClassCLZ | 0.3167473 | 0.1873998 | 1.6902224 | 0.0910271 |
PropClassCOZ | -0.2377586 | 0.3099122 | -0.7671806 | 0.4429984 |
PropClassCTH | 0.4514064 | 0.1903710 | 2.3711933 | 0.0177560 |
PropClassCTIC | -0.3913447 | 0.1815146 | -2.1559955 | 0.0311158 |
PropClassCZ | 0.4766883 | 0.1774002 | 2.6870792 | 0.0072239 |
PropClassCZBM | -0.2015081 | 0.1959883 | -1.0281637 | 0.3039061 |
PropClassCZEU | 2.3485593 | 0.2527058 | 9.2936507 | 0.0000000 |
hospital.Buffer | -0.0013778 | 0.0004249 | -3.2428949 | 0.0011884 |
crime.Buffer | 0.0012018 | 0.0002162 | 5.5579919 | 0.0000000 |
lnRooms | -0.0155356 | 0.0107287 | -1.4480478 | 0.1476456 |
BuiltYr2 | 0.0041174 | 0.0090805 | 0.4534366 | 0.6502476 |
BuiltYr3 | 0.0084638 | 0.0094917 | 0.8917003 | 0.3725823 |
BuiltYr4 | -0.0156838 | 0.0093130 | -1.6840870 | 0.0922066 |
Stories | 0.0272989 | 0.0051665 | 5.2838003 | 0.0000001 |
Baths | 0.0099762 | 0.0047336 | 2.1075164 | 0.0351061 |
lnBeds | -0.0017185 | 0.0050671 | -0.3391453 | 0.7345098 |
Graduate/Professional Degree
|
-2.6173437 | 0.7085223 | -3.6940880 | 0.0002223 |
Average Household Size
|
-0.2765524 | 0.0261216 | -10.5871142 | 0.0000000 |
Managerial Professional
|
-0.7304696 | 0.1499385 | -4.8717935 | 0.0000011 |
Elementary school access
|
-0.0101962 | 0.0005472 | -18.6331141 | 0.0000000 |
Fair Market Rent
|
-0.0004357 | 0.0001569 | -2.7773310 | 0.0054944 |
num_business | 0.0000102 | 0.0000033 | 3.1394240 | 0.0016994 |
No Bedroom
|
-0.8380124 | 0.1841074 | -4.5517584 | 0.0000054 |
Foreign Born
|
-1.0838755 | 0.1660528 | -6.5272946 | 0.0000000 |
Employment Rate
|
-0.5517922 | 0.2011114 | -2.7437146 | 0.0060893 |
college | 0.6701368 | 0.1468570 | 4.5631939 | 0.0000051 |
Asian | 0.8095830 | 0.0837671 | 9.6646920 | 0.0000000 |
20 Units or more
|
-0.2840930 | 0.0801059 | -3.5464674 | 0.0003928 |
Households with Children, % of Total
|
3.8110599 | 0.8807094 | 4.3272616 | 0.0000153 |
Owner occupied
|
0.1693753 | 0.0503530 | 3.3637604 | 0.0007727 |
Black/African American
|
-1.9492244 | 0.0988415 | -19.7207169 | 0.0000000 |
Vacant Units
|
1.5176525 | 0.2693955 | 5.6335490 | 0.0000000 |
Index1 | 0.0000787 | 0.0000203 | 3.8713414 | 0.0001092 |
Index2 | -0.6173940 | 0.1205043 | -5.1234189 | 0.0000003 |
Index3 | 0.0003232 | 0.0000511 | 6.3182582 | 0.0000000 |
Index4 | -0.0000006 | 0.0000002 | -3.8274479 | 0.0001305 |
Index5 | 0.4531083 | 0.0951579 | 4.7616464 | 0.0000020 |
test1 <-
test %>%
mutate(SalePrice.Predict = exp(predict(reg, test)),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice)
st_set_geometry(test1, NULL) %>% # the Mean Absolute Erroe
summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
pull()
st_set_geometry(test1, NULL) %>% # the Mean Absolute Percent Error
summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
round(2) %>%
paste("%")
MAE = st_set_geometry(test1, NULL) %>% # the Mean Absolute Error
summarize(mean(SalePrice.AbsError, na.rm = T)) %>%
pull()
print(MAE)
MAPE = st_set_geometry(test1, NULL) %>% # the Mean Absolute Percent Error
summarize(mean(SalePrice.APE, na.rm = T) * 100) %>%
round(2) %>%
paste("%")
print(MAPE)
summary(reg)$r.squared
df = data.frame("r.squared"=summary(reg)$r.squared,"Mean Absolute Error"= MAE, "Mean Absolute Percent Error"=MAPE)
kable(df) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
r.squared | Mean.Absolute.Error | Mean.Absolute.Percent.Error |
---|---|---|
0.7865802 | 210338.9 | 18.44 % |
fitControl <- trainControl(method = "cv", number = 100)
student2$Index1 = student2$lnPropArea * student2$`Fair Market Rent`
student2$Index2 = student2$lnPropArea * student2$`Households with Children, % of Total`
student2$Index3 = student2$lnPropArea * student2$transittop.Buffer
student2$Index4 = student2$`Fair Market Rent` * student2$transittop.Buffer
student2$Index5 = student2$lnPropArea * student2$`Graduate/Professional Degree`
set.seed(825)
reg.cv <- train(SalePrice ~ ., data = as.data.frame(student2) %>%
dplyr::select(SalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`,Index1,Index2,Index3,Index4,Index5,lagPrice),
method = "lm", trControl = fitControl, na.action = na.pass)
A=ggplot(as.data.frame(reg.cv$resample), aes(MAE)) +
geom_histogram(bins = 50, colour="white", fill = "#FA7800") +
labs(title="Distribution of MAE", subtitle = "k-fold cross validation; k = 100",
x="Mean Absolute Error", y="Count")+
plotTheme()
B=A+geom_vline(xintercept=mean(MAE), linetype="dashed",
color = "red", size=2)
mean(MAE)
plot(B)
The mean MAE iS 213393
test2 <-
test1%>%
filter(SalePrice <= 4000000 & SalePrice > 0,SalePrice.Predict<4000000)
ggplot(test2, aes(SalePrice.Predict, SalePrice)) +
geom_point() +
stat_smooth(data=test2, aes(SalePrice, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(data=test2, aes(SalePrice, SalePrice.Predict),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
labs(title="Predicted sale price as a function of\nobserved price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
theme(plot.title = element_text(size = 18,colour = "black"))
grid.arrange(
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = test1, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette4,
labels=qBr(test1,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Test set sale prices") +
mapTheme(),
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = test1, aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette4,
labels=qBr(test1,"SalePrice.AbsError"),
name="Quintile\nBreaks") +
labs(title="Sale price errors on the test set") +
mapTheme(),
ncol = 2)
coords.test <-
filter(test1, !is.na(SalePrice.Error)) %>%
st_coordinates()
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
test1 %>%
filter(!is.na(SalePrice.Error)) %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)) %>%
ggplot(aes(lagPriceError, SalePrice.Error)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
labs(title = "Error as a function of the spatial lag of price",
subtitle = "As a measure of spatial autocorrelation; lag of 5 nearest neighbors\nRegression line in green",
caption = "Public Policy Analytics, Figure 6.6",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
y = "Sale Price") +
plotTheme()
moranTest <- moran.mc(filter(test1, !is.na(SalePrice.Error))$SalePrice.Error,
spatialWeights.test , nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count",
caption="Public Policy Analytics, Figure 6.8") +
plotTheme()
inTrain <- createDataPartition(
y=paste(student4$PropClassC,student4$SaleYr),
p =.8, list = FALSE)
sf.training <- student4[inTrain,]
sf.test <- student4[-inTrain,]
reg2 <- lm(lnSalePrice ~ .,data =sf.training %>%
st_set_geometry(NULL) %>%
dplyr::select(lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`))
sf.test <-
sf.test %>%
mutate(Regression = "Baseline Regression",
SalePrice.Predict = exp(predict(reg2, sf.test)),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice,
SalePrice = exp(lnSalePrice))%>%
filter(SalePrice < 5000000)
left_join(
st_set_geometry(sf.test, NULL) %>%
group_by(name) %>%
summarize(meanPrice = mean(SalePrice, na.rm = T)),
mutate(sf.test, predict.fe = predict(lm(SalePrice ~ name, data = sf.test),
sf.test)) %>%
st_set_geometry(NULL) %>%
group_by(name) %>%
summarize(meanPrediction = mean(predict.fe))) %>%
kable() %>% kable_styling()
reg3 <- lm(lnSalePrice ~ .,data =sf.training %>%
st_set_geometry(NULL) %>%
dplyr::select(name,lnSalePrice,LotArea,SaleYr,lnPropArea,PropClassC, hospital.Buffer, crime.Buffer,lnRooms,BuiltYr,Stories,Baths,lnBeds,`Graduate/Professional Degree`,`Average Household Size`,`Managerial Professional`,`Elementary school access`,`Fair Market Rent`,num_business,`No Bedroom`,`Foreign Born`,`Employment Rate`,college,Asian,`20 Units or more`, `Households with Children, % of Total`,`Owner occupied`,`Black/African American`,`Vacant Units`))
sf.test.nhood <-
sf.test %>%
mutate(Regression = "Neighborhood Effects",
SalePrice.Predict = exp(predict(reg3, sf.test)),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice,
SalePrice = exp(lnSalePrice))%>%
filter(SalePrice < 5000000)
bothRegressions <-
rbind(
dplyr::select(sf.test, c(starts_with("SalePrice."),
Regression, SalePrice, name)) %>%
na.omit() %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)),
dplyr::select(sf.test.nhood, c(starts_with("SalePrice."),
Regression, SalePrice, name)) %>%
na.omit() %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error)))
bothRegressions %>%
gather(Variable, Value, -Regression, -name, -geometry) %>%
filter(Variable != "SalePrice" & Variable != "SalePrice.Predict") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
st_set_geometry(NULL) %>%
spread(Variable, meanValue) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10") %>%
row_spec(2, color = "black", background = "#FA7800")
bothRegressions %>%
ggplot() +
geom_sf(data = student2, fill = "grey40") +
geom_sf(aes(colour = q5(SalePrice.AbsError)), show.legend = "point", size = 1) +
facet_wrap(~Regression) +
scale_colour_manual(values = palette4,
labels=qBr(bothRegressions,"SalePrice.AbsError"),
name="Quintile\nBreaks") +
labs(title="Test set errors") +
mapTheme()
bothRegressions.nhoods <-
bothRegressions %>%
group_by(Regression, name) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
st_set_geometry(NULL) %>%
left_join(sf_neighborhood2) %>%
st_sf()
ggplot() +
geom_sf(data = bothRegressions.nhoods, aes(fill = q5(mean.MAPE))) +
geom_sf(data = bothRegressions, colour = "black", size = .5) +
facet_wrap(~Regression) +
scale_fill_manual(values = palette4,
labels = qBr(bothRegressions.nhoods, "mean.MAPE", rnd = F),
name="Quintile\nBreaks") +
labs(title = "Mean test set MAPE by neighborhood") +
mapTheme()
bothRegressions.nhoods2 <-
bothRegressions %>%
group_by(Regression, name) %>%
summarize(mean.price = mean(SalePrice, na.rm = T)) %>%
st_set_geometry(NULL) %>%
left_join(bothRegressions.nhoods) %>%
st_sf()
bothRegressions.nhoods2 %>%
ggplot(aes( mean.price, mean.MAPE)) +
geom_point(colour = "#FA7800") +
geom_smooth(method = "lm", se = FALSE, colour = "#25CB10") +
facet_wrap(~Regression) +
labs(title = "MAPE as a Function of Average Sales Price by neighborhood",
caption = "Public Policy Analytics, Figure 6.6",
x = "Mean Price by neighbourhood",
y = "Mean MAPE by neighborhood") +
plotTheme()
census_api_key("d82854b2521c584efb039f747f0343702a6211eb",overwrite = TRUE,install = TRUE)
readRenviron("~/.Renviron")
tracts17 <-
get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E","B06011_001"),
year = 2017, state=06, county=075, geometry=T) %>%
st_transform(2227) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B01001A_001,
Median_Income = B06011_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))
grid.arrange(
ggplot() +
geom_sf(data = na.omit(tracts17), aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() +
geom_sf(data = na.omit(tracts17), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
st_join(bothRegressions, tracts17) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
st_set_geometry(NULL) %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Mean Absolute Error of test set sales by neighborhood racial context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10") %>%
row_spec(2, color = "black", background = "#FA7800")
Regression | Majority Non-White | Majority White |
---|---|---|
Baseline Regression | 0.1910693 | 0.2080595 |
Neighborhood Effects | 0.1708258 | 0.1921813 |
st_join(bothRegressions, tracts17) %>%
filter(!is.na(incomeContext)) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
st_set_geometry(NULL) %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Mean Absolute Error of test set sales by neighborhood income context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "#25CB10") %>%
row_spec(2, color = "black", background = "#FA7800")
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 0.1952467 | 0.2076065 |
Neighborhood Effects | 0.1794249 | 0.1825406 |
Our model can be promoted to some extent. It can assess low-income and high-income, home sales prices for most non-white communities and white communities with little error. Our MAPE fluctuation is low when measuring different communities, which is good for promoting our model to San Francisco.
Our model of house price predictionis effective in terms of results from cross validation. The more interesting variables are built year of the houses, count of hospitals in a 2 mile buffer, distance to school of each house and count of transit stops of public transportation in a 2 mile buffer. Our model is capable of explaining 78.7% of the variation in sale prices in San Francisco. The most important predictors are lot area of the house, sale year of the house, property of the house, average household size in the neighborhood, elementary school access in the neighborhood, percentage of Asian in the neighborhoos and percentage of Black/African American in the neighborhood. In the cross validation, our model had a Mean Absolute Error of 292377.6 and a MAPE of 471227.5, which is acceptable with OLS regression and limited features. According to the map, there is spatial variation in house prices across neighborhoods and this could be attributed to factors such as some neighborhoods are richer than other neighborhoods and some neighborhoods has more white people than other neighborhoods. Our model is predicting better in downtown San Francisco and predicting not very well in uptown San Francisco. This might be because house prices in uptown San Francisco are generally higher and have more variability. We need to look for more features in future to further account for the prediction errors.
I would recommend this model to Zillow even though we understand Zillow may need a model with higher predictive power. However, we have a very good dataset with features of neighborhood which could be used in other algothrithms as well. If we were able to improve the model, we would also include a spatial lag factor to account for the spatial pattern in sale prices.