1.Intoduction

1.1 The Aim of this project

This project takes Boston as a research object. In order to estimate the next large-scale comprehensive planning process in the area, the local urban planning organization (MPO) has requested to predict urban development in 2020. From the perspective of supply side and demand side, it is realized by machine learning algorithms.

1.2 The Background and Motivation

The Background:

Founded in 1630, Boston is the largest city in New England. Boston is an important port city and manufacturing center. Boston is constantly optimizing its industrial structure in order to maintain its competitiveness. At present, the city has become the center of American higher education, healthcare, technology, and finance. While employment opportunities and social resources have promoted population and economic growth, they have also brought unprecedented challenges to municipal management. In the process of continuous urban development, studying urban expansion will help to understand the improvement of land use efficiency and the protection of sensitive areas such as agricultural land to achieve the goal of sustainable development.

The Motivation:

Our motivation is to predict the land that may be developed and the land that is not suitable for development in the context of urban expansion in 2020. With rapid population and economic growth, urban boundaries continue to expand and occupy agricultural land. In 2010, the Boston area has surpassed Los Angeles to become the second largest metropolitan area in the United States. The ever-expanding urban boundaries have caused many impacts on urban management, including more public service expenditures, environmental pollution, and increased agricultural product costs. The expanding urban fringe poses challenges to urban infrastructure and environment. So the motivation of this project is very practical.

1.3 The Steps for the project

  • Data Wrangling (Clean the raw data from the websites)
  • Feature engineering (Create the features and combine the features and labels in one dataframe )
  • Exploratory Analysis
  • Run the Model
  • Predicting land cover demand for 2020
  • Allocation

1.4 Set Up

Here we import the libraries needed for the analysis and mapTheme and plotTheme. We use the settled palette colors.

library(tidyverse)
library(sf)
library(raster)
library(knitr)
library(kableExtra)
library(tidycensus)
library(tigris)
library(FNN)
library(QuantPsyc)
library(caret)
library(yardstick)
library(pscl)
library(plotROC) 
library(ggrepel)
library(pROC)
library(grid)
library(gridExtra)
library(viridis)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette2 <- c("#41b6c4","#253494")
palette4 <- c("#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette10 <- c("#f7fcf0","#e0f3db","#ccebc5","#a8ddb5","#7bccc4",
               "#4eb3d3","#2b8cbe","#0868ac","#084081","#f7fcf0")

#this function converts a column in to quintiles. It is used for mapping.
quintileBreaks <- function(df,variable) {
    as.character(quantile(df[[variable]],
                          c(.01,.2,.4,.6,.8),na.rm=T))
}

#This function can be used to convert a polygon sf to centroids xy coords.
xyC <- function(aPolygonSF) {
  as.data.frame(
    cbind(x=st_coordinates(st_centroid(aPolygonSF))[,1],
          y=st_coordinates(st_centroid(aPolygonSF))[,2]))
} 

#this function convert a raster to a data frame so it can be plotted in ggplot
rast <- function(inRaster) {
  data.frame(
    xyFromCell(inRaster, 1:ncell(inRaster)), 
    value = getValues(inRaster)) }

2. Data wrangling & feature engineering

2.1 Land Cover Change data

  • We need the city boundary data of boston. It comes from the Open data Boston.
  • We need the land cover data. It comes from the Multi-Resolution Land Characteristics Consortium’s National Land Cover Database (NLCD).
  • We use the nation data from 2000 and 2010. We clip the data to fit the Boston Boudary in ArcGIS.
  • We want to find the landcover change, we also use the ArcGIS recalssify to defined the develop and the undeveloped area and then use the raster calculator to get the change area. If the land use us change is 1, and not change is 0. Then use the reclassify to set the not change as NoData.

The following plot can show the land cover change in the boston area.

setwd("/Users/zhangjing/Desktop/zhaoanyang/data")

BostonMSA <- 
  st_read("boston.shp") %>%
  st_transform(102286)

lc_change <-   raster("develop.tif")
lc_change01 <-   raster("lc_change01.tif")

newproj <- "+proj=lcc +lat_1=41.71666666666667 +lat_2=42.68333333333333 +lat_0=41 +lon_0=-71.5 +x_0=200000 +y_0=750000 +ellps=GRS80 +units=m +no_defs "
lc_change <- projectRaster(lc_change, crs=newproj)
lc_change01 <- projectRaster(lc_change01, crs=newproj)



ggplot() +
  geom_sf(data=BostonMSA) +
  geom_raster(data=rast(lc_change) %>% na.omit, 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
  labs(title="Development land use change In Boston") +
  mapTheme()

Next, The fishnet is created at 100 by 100 foot resolution and subset it to the Boston with st_intersection. The vector fishnet is then plotted.The concept of fishnet help us to put all the features in it and it is useful for us to run the regression.

BostonMSA_fishnet <- 
  st_make_grid(BostonMSA, 100) %>%
  st_sf()

BostonMSA_fishnet <-
  BostonMSA_fishnet[BostonMSA,]

ggplot() +
  geom_sf(data=BostonMSA_fishnet) +
  labs(title="Fishnet, 100 foot resolution") +
  mapTheme()

Here is the function that converts the raster to an sf point layer and then joins the points to the fishnet with aggregate. To speed up the mapping process, fishnet polygons are converted to centroid points using the xyC function defined above.

The following plot show the no change area and the new developed area change as point.

changePoints <-
  rasterToPoints(lc_change) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(BostonMSA_fishnet))

fishnet <- 
  aggregate(changePoints, BostonMSA_fishnet, sum) %>%
  mutate(develop = ifelse(is.na(develop),0,1),
         develop = as.factor(develop))

ggplot() +
  geom_sf(data=BostonMSA) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=develop)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land cover development change", subtitle = "As fishnet centroids") +
  mapTheme()

2.2 Land Cover in 2001

Here we load the data of the Land Cover in 2001. For the regression, it is possible that the propensity of new development is a function of existing land cover categories.

setwd("/Users/zhangjing/Desktop/zhaoanyang/data")
lc_2001 <- raster("boston_2001.tif")
lc_2001 <- projectRaster(lc_2001, crs=newproj)

#ggplot() +
#  geom_sf(data=BostonMSA) +
#  geom_raster(data=rast(lc_2001) %>% na.omit %>% filter(value > 0), 
#              aes(x,y,fill=as.factor(value))) +
#  scale_fill_viridis(discrete=TRUE, name ="") +
#  labs(title = "Land Cover, 2001") +
#  mapTheme()

From the value of the land, we combined specific value as developed, forest, farm, wetlands, otherUndeveloped, and water

developed <- lc_2001 == 21 | lc_2001 == 22 | lc_2001 == 23 | lc_2001 == 24
forest <- lc_2001 == 41 | lc_2001 == 42 | lc_2001 == 43 
farm <- lc_2001 == 81 | lc_2001 == 82 
wetlands <- lc_2001 == 90 | lc_2001 == 95 
otherUndeveloped <- lc_2001 == 52 | lc_2001 == 71 | lc_2001 == 31 
water <- lc_2001 == 11

names(developed) <- "developed"
names(forest) <- "forest"
names(farm) <- "farm"
names(wetlands) <- "wetlands"
names(otherUndeveloped) <- "otherUndeveloped"
names(water) <- "water"

Here, we set the function called aggregateRaster and each raster is aggregated to the fishnet.

aggregateRaster <- function(inputRasterList, theFishnet) {
  #create an empty fishnet with the same dimensions as the input fishnet
  theseFishnets <- theFishnet %>% dplyr::select()
  #for each raster in the raster list
  for (i in inputRasterList) {
  #create a variable name corresponding to the ith raster
  varName <- names(i)
  #convert raster to points as an sf
    thesePoints <-
      rasterToPoints(i) %>%
      as.data.frame() %>%
      st_as_sf(coords = c("x", "y"), crs = st_crs(theFishnet)) %>%
      filter(.[[1]] == 1)
  #aggregate to the fishnet
    thisFishnet <-
      aggregate(thesePoints, theFishnet, length) %>%
      mutate(!!varName := ifelse(is.na(.[[1]]),0,1))
  #add to the larger fishnet
    theseFishnets <- cbind(theseFishnets,thisFishnet)
  }
  #output all aggregates as one large fishnet
   return(theseFishnets)
  }

The Folloeing plot explore the land cover type in 2001.

theRasterList <- c(developed,forest,farm,wetlands,otherUndeveloped,water)

aggregatedRasters <-
  aggregateRaster(theRasterList, BostonMSA_fishnet) %>%
  dplyr::select(developed,forest,farm,wetlands,otherUndeveloped,water) %>%
  mutate_if(is.numeric,as.factor)

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%    #just to make sure no weird geometries slipped in
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
    geom_sf(data=BostonMSA) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land cover types, 2001",
         subtitle = "As fishnet centroids") +
   mapTheme()

2.3 Census Data

Population and its change usually have significant implications for local development demand. Population surge is a kind of imply of potential development pressure while population shrinking usually means that developers don’t interest in this area. Thus, we selected and explored 2000 and 2010 population data as the following plot shows.

census_api_key("d82854b2521c584efb039f747f0343702a6211eb")

BostonPop00 <- 
  get_decennial(geography = "tract", variables = "P001001", year = 2000,
                state=25, county=025,geometry = TRUE) %>%
  rename(pop_2000 = value) %>%
  st_transform(st_crs(BostonMSA_fishnet))

BostonPop10 <- 
  get_decennial(geography = "tract", variables = "P001001", year = 2010,
                state=25, county=025,geometry = TRUE) %>%
  rename(pop_2010 = value) %>%
  st_transform(st_crs(BostonMSA_fishnet)) %>%
  st_buffer(-1)
grid.arrange(
ggplot() +
  geom_sf(data = BostonPop00, aes(fill=factor(ntile(pop_2000,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(BostonPop00,"pop_2000"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Boston MSA: 2000") +
  mapTheme(),

ggplot() +
  geom_sf(data = BostonPop10, aes(fill=factor(ntile(pop_2010,5))), colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=quintileBreaks(BostonPop10,"pop_2010"),
                   name="Quintile\nBreaks") +
  labs(title="Population, Boston MSA: 2010") +
  mapTheme(), ncol=2)

BostonMSA_fishnet <-
  BostonMSA_fishnet %>%
  rownames_to_column("fishnetID") %>% 
  mutate(fishnetID = as.numeric(fishnetID)) %>%
  dplyr::select(fishnetID)

fishnetPopulation00 <-
  st_interpolate_aw(BostonPop00["pop_2000"],BostonMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  left_join(BostonMSA_fishnet, ., by=c("fishnetID"='Group.1')) %>% 
  mutate(pop_2000 = replace_na(pop_2000,0)) %>%
  dplyr::select(pop_2000)

fishnetPopulation10 <-
  st_interpolate_aw(BostonPop10["pop_2010"],BostonMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.) %>%
  left_join(BostonMSA_fishnet, ., by=c("fishnetID"='Group.1')) %>% 
  mutate(pop_2010 = replace_na(pop_2010,0)) %>%
  dplyr::select(pop_2010)

fishnetPopulation <- 
  cbind(fishnetPopulation00,fishnetPopulation10) %>%
  dplyr::select(pop_2000,pop_2010) %>%
  mutate(pop_Change = pop_2010 - pop_2000)

As the census tract changed in 2010, it’s a little bit difficult to find the main changes. Thus, a smaller unit, e.g. fishnet will help a lot, then the next step is to join the census data to fishnet layer efficiently and accurately. Instead of using traditional st_join, we adopt st_interpolate_aw, which also be known as areal weighted interpolation. It assumes that the population is distributed averagely in the whole area and will assign the population data to each grid by weighting the intersected area. For the fishnet that not exists in census geometry (e.g. water area) will be assigned 0 for the population.

grid.arrange(
ggplot() +
  geom_sf(data=BostonPop10, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                    labels=substr(quintileBreaks(BostonPop10,"pop_2010"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Boston: 2010",
       subtitle="Represented as tracts; Boundaries omitted") +
  mapTheme(),

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"pop_2010"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population, Boston: 2010",
       subtitle="Represented as fishnet gridcells; Boundaries omitted") +
  mapTheme(), ncol=2)

The following plot shows the population change from 2000 to 2010 in Boston (represented by fishnet, by quantile). It implies that population shrink in central Boston is quite severe from 2000 to 2010, while the area along the Charles River has experienced a slight population increase in that period.

ggplot() +
  geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(pop_Change,5))),colour=NA) +
  scale_fill_manual(values = palette5,
                   labels=substr(quintileBreaks(fishnetPopulation,"pop_Change"),1,4),
                   name="Quintile\nBreaks") +
  labs(title="Population Change from 2000 to 2010, Boston MSA",
       subtitle="Represented as fishnet gridcells; Boundaries omitted")+
  mapTheme()

2.4 Highway

Transportation accessibility is also a key factor to stimulate or turn down potential development. Highway density or the distance to the highway is a useful measurement to assess the location’s accessibility, especially for the suburban area where public transit is not such convenient. In this diagram, we observed that the density of the highway can explain almost all the new development from 2001 to 2011, for example, the development cluster in Southern and Eastern Boston, as well as the Back Bay area. Thus, we calculated the average distance from each grid cell to the nearby Highway as the next plot shows.

setwd("/Users/zhangjing/Desktop/zhaoanyang/data")
BostonHighways <-
  st_read("road1.shp") %>%
  st_transform(st_crs(BostonMSA)) %>%
  st_intersection(BostonMSA)

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=develop),size=1.5) +
  geom_sf(data=BostonHighways) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids") +
  mapTheme()

emptyRaster <- lc_change01
emptyRaster[] <- NA

highway_raster <- 
  as(BostonHighways,'Spatial') %>%
  rasterize(.,emptyRaster)

highway_raster_distance <- distance(highway_raster)
names(highway_raster_distance) <- "distance_highways"

highwayPoints <-
  rasterToPoints(highway_raster_distance) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(BostonMSA_fishnet))

highwayPoints_fishnet <- 
  aggregate(highwayPoints, BostonMSA_fishnet, mean) %>%
  mutate(distance_highways = ifelse(is.na(distance_highways),0,distance_highways))

ggplot() +
  geom_sf(data=BostonMSA) +
  geom_point(data=highwayPoints_fishnet, aes(x=xyC(highwayPoints_fishnet)[,1], 
                                             y=xyC(highwayPoints_fishnet)[,2], 
                 colour=factor(ntile(distance_highways,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(highwayPoints_fishnet,"distance_highways"),1,8),
                      name="Quintile\nBreaks") +
  geom_sf(data=BostonHighways, colour = "red") +
  labs(title = "Distance to Highways",
       subtitle = "As fishnet centroids; Highways visualized in red") +
  mapTheme()

2.5 The spatial lag of development

Urban development history tells us that new development from 2001 to 2011 might also have a considerable influence on future new development to some extent. We believe that the influence varies considerably from city to city or even central city to a rural area. For example, for some area experienced new development from 2001 to 2011, the new development might stagnate for the next 10 years as the land resource is limited and which actually expels the future new development. However, for some suburban areas, the new development from 2001 to 2011 is likely to continue in the next 10 years but with different development rates or speeds. In a word, the forecasting model needs to take the previous new development into consideration.

Thus, we calculated the distance from each grid cell to nearby k previous development to get the information of surrounding previous development. This method also called as spatial lag.

nn_function <- function(measureFrom,measureTo,k) {
  #convert the sf layers to matrices
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}
fishnet$lagDevelopment <-
    nn_function(xyC(fishnet),
                xyC(filter(aggregatedRasters,developed==1)),
                10)

ggplot() +
  geom_sf(data=BostonMSA) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
                 colour=factor(ntile(lagDevelopment,5))), size=1.5) +
  scale_colour_manual(values = palette5,
                     labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
                     name="Quintile\nBreaks") +
  labs(title = "Spatial lag to 2001 development",
       subtitle = "As fishnet centroids") +
  mapTheme()

2.6 Study Area

options(tigris_class = "sf")

studyAreaCounties <- 
  county_subdivisions(state=25,county = 025,cb = TRUE) %>%
  st_transform(st_crs(BostonMSA)) %>%
  dplyr::select(NAME) %>%
  .[st_buffer(BostonMSA,-100), , op=st_intersects]


ggplot() +
  geom_sf(data=studyAreaCounties) +
  labs(title = "Study area counties") +
  mapTheme()

2.7 Create Final Dataset

Finally, we need to put all variables together into one table.

dat <- 
  cbind(
    fishnet, highwayPoints_fishnet, fishnetPopulation, aggregatedRasters) %>%
  dplyr::select(develop, developed, forest, farm, wetlands, otherUndeveloped, water,pop_2000, pop_2010, pop_Change, distance_highways,lagDevelopment) %>%
  st_join(studyAreaCounties) %>%
  mutate(developed10 = ifelse(develop == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0) 

dat$NAME[is.na(dat$NAME)]<- "Boston"

3. Exploratory Analysis

Here is the exploratory analysis

First, we want to explore the what the feature can tell us for dividing the No change and the new development. For the bar chart we calculate the mean value of the distance highway and the lag development

It tells us the no change area is more far to the highway. This is in line with our intuition. The closer it is to the highway, the higher its development value, so the area where the land has not changed is far away from the highway on average, while the new development area is closer.

It also tell use the value of lag development. It can be shown that new development area seems has more neighbour development.

dat %>%
  dplyr::select(distance_highways,lagDevelopment,develop) %>%
  gather(Variable, Value, -develop, -geometry) %>%
  ggplot(., aes(develop, Value, fill=develop)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New development as a function of the countinuous variables") +
    plotTheme() 

The bar chart reflect the population of 2000 and 2010 and the population change in this period. In fact, population trends are not very obvious here. This also reflects from the side that Boston is a gradually saturated city. The mobility of the population is not very strong.

dat %>%
  dplyr::select(pop_2000,pop_2010,pop_Change,develop) %>%
  gather(Variable, Value, -develop, -geometry) %>%
  ggplot(., aes(develop, Value, fill=develop)) + 
    geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
    facet_wrap(~Variable) +
    scale_fill_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name="") +
    labs(title="New development as a function of factor variables") +
    plotTheme()

The following table is the land cover conversion between 2001 and 2011. In the boston area, the forest and the other undeveloped area contribute to the land cover change

dat %>%
  dplyr::select(develop:otherUndeveloped,developed) %>%
  gather(Land_Cover_Type, Value, -develop, -geometry) %>%
   st_set_geometry(NULL) %>%
     group_by(develop, Land_Cover_Type) %>%
     summarize(n = sum(as.numeric(Value))) %>%
     ungroup() %>%
    mutate(Conversion_Rate = paste0(round(100 * n/sum(n), 2), "%")) %>%
    filter(develop == 1) %>%
  dplyr::select(Land_Cover_Type,Conversion_Rate) %>%
  kable() %>% kable_styling(full_width = F)
Land_Cover_Type Conversion_Rate
developed 1%
farm 0.01%
forest 0.52%
otherUndeveloped 0.27%
wetlands 0.12%

4. Predicting for 2010

4.1 Modeling

First, we split the data and change it into 50% training/test sets. Models are estimated on the training set. we use the ’‘’nrow’’’ to look at how many data here.

set.seed(3456)
trainIndex <- 
  createDataPartition(dat$developed, p = .50,
                                  list = FALSE,
                                  times = 1)
datTrain <- dat[ trainIndex,]
datTest  <- dat[-trainIndex,]

nrow(dat)

Here are six separate logistic regression models are estimated to predict development change between 2001 and 2011. It can be shown that we add the features gradually into the model. We want to look at the performance if the model.

Model1 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)
              
Model3 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2000, 
              family="binomial"(link="logit"), data = datTrain)          
              
Model4 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2000 + 
              pop_2010, 
              family="binomial"(link="logit"), data = datTrain)              
            
Model5 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change, 
              family="binomial"(link="logit"), data = datTrain)              
              
Model6 <- glm(develop ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_Change + 
              distance_highways, 
              family="binomial"(link="logit"), data = datTrain) 

Here we creating a data frame of psudeo R Squares for each model and plotting them for comparison. It can help us looping through the models retrieving the goodness of fit for each models with different numbers of features.

modelList <- paste0("Model", 1:6)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:6)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
    geom_bar(stat="identity") +
    labs(title= "McFadden R-Squared by Model") +
    plotTheme()

The following density plot shows the the distribution of predicted probabilities by observed class. It make sense for us to choose the threshold.

testSetProbs <- 
  data.frame(class = datTest$develop,
             probs = predict(Model6, datTest, type="response")) 
  
ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme()

4.2 Accuracy

From the following part, here we can pick a predicted probability threshold to classify an area as having new development.

The Sensitivity means the proportion of actual positives (1’s) that were predicted to be positive. The Specificity is the proportion of actual negatives (0’s) that were predicted to be negatives.

From the urban planning perspective, when you focus on different issues, you may care different the Sensitivity or Specificity or both.

In this project, we use two threshold 2% and 7%. It can be seen that the 2% can predict beeter for the Sensitivity that means it is better for predicting the development. And the 7% threshold works really well for the Specificity. But it can not predict the development well.

options(yardstick.event_first = FALSE)

testSetProbs <- 
  testSetProbs %>% 
  mutate(predClass_02 = as.factor(ifelse(testSetProbs$probs >= 0.02 ,1,0)),
         predClass_07 = as.factor(ifelse(testSetProbs$probs >= 0.07 ,1,0))) 

testSetProbs %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)

For this map, it can be shown that for the 2%. Many of the Specificity goes wrong and that means many undeveloped area is predict as developed area.

predsForMap <-         
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_02_Pct = as.factor(ifelse(probs >= 0.02 ,1,0)),
           Threshold_07_Pct =  as.factor(ifelse(probs >= 0.07 ,1,0))) %>%
    dplyr::select(develop,Threshold_02_Pct,Threshold_07_Pct) %>%
    gather(Variable,Value, -geometry) %>%
    st_cast("POLYGON")

ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development predictions - Low threshold")+coord_sf(crs = st_crs(102286)) + mapTheme()

The following charts show the Sensitivity and Specificity for each grid cell by threshold type.

ConfusionMatrix.metrics <-
  dat %>%
    mutate(probs = predict(Model6, dat, type="response") ,
           Threshold_02_Pct = as.factor(ifelse(probs >= 0.02 ,1,0)),
           Threshold_07_Pct =  as.factor(ifelse(probs >= 0.07 ,1,0))) %>%
    mutate(TrueP_02 = ifelse(develop  == 1 & Threshold_02_Pct == 1, 1,0),
           TrueN_02 = ifelse(develop  == 0 & Threshold_02_Pct == 0, 1,0),
           TrueP_07 = ifelse(develop  == 1 & Threshold_07_Pct == 1, 1,0),
           TrueN_07 = ifelse(develop  == 0 & Threshold_07_Pct == 0, 1,0)) %>%
    dplyr::select(., starts_with("True")) %>%
    gather(Variable, Value, -geometry) %>%
    st_cast("POLYGON")%>%
    st_transform(102286)

ggplot(data=ConfusionMatrix.metrics) +
  geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1], 
                 y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("Correct","Incorrect"),
                       name="") +
  labs(title="Development predictions - Low threshold") +coord_sf(crs = st_crs(102286)) + mapTheme()

5.Predicting land cover demand for 2020

Now the model has been initially established, but this model uses the characteristics of 2001 to predict the urban expansion between 2001 and 2011. We will update our features in this section, focusing on 2010. After doing so, our new model’s forecast will be around 2020.

Here, we update two features in our model. First, it will make sense that the population will change, so the features pop_change will chaneg. The second is lagDevelopment, here we set it again.

dat <-
  dat %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed10 == 1)),2))

Here we upload the population by the population as create a dataframe for the boston. The population change can be calculate and plot.

countyPopulation_2020 <- 
  data.frame(
   NAME = 
     c("Boston"),
   county_projection_2020 = 
     c(633333)) %>%
   left_join(
     dat %>%
       st_set_geometry(NULL) %>%
       group_by(NAME) %>%
       summarize(county_population_2010 = round(sum(pop_2010))))

countyPopulation_2020 %>%
  gather(Variable,Value, -NAME) %>%
  ggplot(aes(reorder(NAME,-Value),Value)) +
  geom_bar(aes(fill=Variable), stat = "identity") +
  scale_fill_manual(values = palette2,
                    labels=c("2020","2010"),
                    name="Population") +
  labs(title="Population Change by county: 2010 - 2020",
       x="County", y="Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

## 5.1 Predicting development demand

The Countypopulation 2020 is joined with the dat and the popchange.And then use the model 6 to predict the need for the 2020. The following map can show the results the predicted development demand in 2020. We can find from the map that north east and south west have a high demand for the development.

dat_infill <-
  dat %>%
  #calculate population change
    left_join(countyPopulation_2020) %>%
    mutate(proportion_of_county_pop = pop_2010 / county_population_2010,
           pop_2020.infill = proportion_of_county_pop * county_projection_2020,
           pop_Change = round(pop_2020.infill - pop_2010),2) %>%
    dplyr::select(-county_projection_2020, -county_population_2010, 
                  -proportion_of_county_pop, -pop_2020.infill) %>%
  #predict for 2020
    mutate(predict_2020.infill = predict(Model6,. , type="response"))

dat_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat_infill)[,1], y=xyC(dat_infill)[,2], colour = factor(ntile(predict_2020.infill,5)))) +
  scale_colour_manual(values = palette5,
                    labels=substr(quintileBreaks(dat_infill,"predict_2020.infill"),1,4),
                    name="Quintile\nBreaks") +
  geom_sf(data=BostonMSA, fill=NA, colour="black", size=1) +
  labs(title= "Development Demand in 2020: Predicted Probabilities") +
  mapTheme()

6. Comparing predicted development demand & environmental sensitivity

From last part, we find the further demand of the boston. However, in the actual planning and development process, it is necessary to consider not only the demand but also the supply of land. Because environmentally sensitive land cannot be developed on a large scale. Comparing demand and supply can help city planners decide which land should be developed, and which land should not be developed.

6.1 2011 Land Cover data

First, we read the land cover data of 2011. We can have a brief look at the land cover data of 2001 and 2011 in the following chart.

setwd("/Users/zhangjing/Desktop/zhaoanyang/data")

newproj <- "+proj=lcc +lat_1=41.71666666666667 +lat_2=42.68333333333333 +lat_0=41 +lon_0=-71.5 +x_0=200000 +y_0=750000 +ellps=GRS80 +units=m +no_defs "
lc_2011 <- raster("boston_2011.tif")
lc_2011 <- projectRaster(lc_2011, crs=newproj)

developed11 <- lc_2011 == 21 | lc_2011 == 22 | lc_2011 == 23 | lc_2011 == 24
forest11 <- lc_2011 == 41 | lc_2011 == 42 | lc_2011 == 43 
farm11 <- lc_2011 == 81 | lc_2011 == 82 
wetlands11 <- lc_2011 == 90 | lc_2011 == 95 
otherUndeveloped11 <- lc_2011 == 52 | lc_2011 == 71 | lc_2011 == 31 
water11 <- lc_2011 == 11

names(developed11) <- "developed11"
names(forest11) <- "forest11"
names(farm11) <- "farm11"
names(wetlands11) <- "wetlands11"
names(otherUndeveloped11) <- "otherUndeveloped11"
names(water11) <- "water11"

ggplot() +
  geom_sf(data=BostonMSA) +
  geom_raster(data = rbind(rast(lc_2001) %>% mutate(label = "2001"),
                           rast(lc_2011) %>% mutate(label = "2011")) %>% 
              na.omit %>% filter(value > 0), 
              aes(x,y,fill=as.factor(value))) +
  facet_wrap(~label) +
  scale_fill_viridis(discrete=TRUE, name ="") +
  labs(title = "Land Cover, 2001 & 2011a") +
  mapTheme() + theme(legend.position = "none")

The raster is aggregated to the fishnet using the function we defined as aggregateRaster. The following map is the 2011 land cover.

theRasterList11 <- c(developed11,forest11,farm11,wetlands11,otherUndeveloped11,water11)

dat2 <-
  aggregateRaster(theRasterList11, dat) %>%
  dplyr::select(developed11,forest11,farm11,wetlands11,otherUndeveloped11,water11) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")

dat2 %>%
  gather(var,value,developed11:water11) %>%
  st_centroid() %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) %>%
  ggplot() +
    geom_sf(data=BostonMSA) +
    geom_point(aes(X,Y, colour=as.factor(value))) +
    facet_wrap(~var) +
    scale_colour_manual(values = palette2,
                        labels=c("Other","Land Cover"),
                        name = "") +
    labs(title = "Land cover types, 2011",
         subtitle = "As fishnet centroids") +
   mapTheme()

6.2 Sensitive land lover lost

Here we define the sensitive land that means the land of forest or wetlands in 2001 but were no longer so in 2011. The following map of sensitive_land_lost show the development in the boston effected the natural environment.

dat2 <-
  dat2 %>%
   mutate(sensitive_lost11 = ifelse(forest == 1 & forest11 == 0 |
                                    wetlands == 1 & wetlands11 == 0,1,0))
                      
ggplot() +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost11))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2001 - 2011",
       subtitle = "As fishnet centroids") +coord_sf(crs = st_crs(102286))+
  mapTheme()

6.3 Summarize

The following chart compare the demand side with black bars and the supply side with the red bars. There are few undeveloped lands available, and there are many sensitive lands. Then the demand for development is also high. In fact, urban expansion is urgent in Boston.

county_specific_metrics <- 
  dat2 %>%
  #predict development demand from our model
  mutate(Development_Demand = predict(Model6, dat2, type="response")) %>%
  #get a count count of grid cells by county which we can use to calculate rates below
  left_join(st_set_geometry(dat, NULL) %>% group_by(NAME) %>% summarize(count = n())) %>%
  #calculate summary statistics by county
  group_by(NAME) %>%
  summarize(Total_Farmland = sum(farm11) / max(count),
            Total_Forest = sum(forest11) / max(count),
            Total_Wetlands = sum(wetlands11) / max(count),
            Total_Undeveloped = sum(otherUndeveloped11) / max(count),
            Sensitive_Land_Lost = sum(sensitive_lost11) / max(count),
            Mean_Development_Demand = mean(Development_Demand)) %>%
  #get population data by county
  left_join(countyPopulation_2020 %>% 
            mutate(Population_Change = county_projection_2020 - county_population_2010,
                   Population_Change_Rate = Population_Change / county_projection_2020) %>%
            dplyr::select(NAME,Population_Change_Rate))

county_specific_metrics %>%
  gather(Variable, Value, -NAME, -geometry) %>%
  mutate(Variable = factor(Variable, levels=c("Population_Change_Rate","Mean_Development_Demand",
                                              "Total_Farmland","Total_Undeveloped","Total_Forest",
                                              "Total_Wetlands","Sensitive_Land_Lost",
                                              ordered = TRUE))) %>%
  mutate(Planning_Designation = case_when(
    Variable == "Population_Change_Rate" | Variable == "Mean_Development_Demand" ~ "Demand-Side",
    Variable == "Total_Farmland" | Variable == "Total_Undeveloped"               ~ "Suitable",
    TRUE                                                                         ~ "Not Suitable")) %>%
  ggplot(aes(x=Variable, y=Value, fill=Planning_Designation)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black") +
    facet_wrap(~NAME, ncol=5) +
    coord_flip() +
    scale_y_continuous(breaks = seq(.25, 1, by = .25)) +
    geom_vline(xintercept = 2.5) + geom_vline(xintercept = 4.5) +
    scale_fill_manual(values=c("black","red","darkgreen")) +
    labs(title= "County Specific Allocation Metrics", subtitle= "As rates", x="Indicator", y="Rate") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position="bottom")

7. Allocation

The allocation is the deliverable for the project. And For the last allocation, we use our predict model to predict the places do not suitable for development, we set the transparent red and the lay for the demand of the development and the population.

It can give a reference for the city planner that where is the place that can used as urban sprawl but not suitable.

From the final results, it can be seen that the center and south has the area with huge development potential and demand but is sensitive, it can direct the planner’s decision if we should develop these area for further urban sprawl.

Boston <-
  dat2 %>%
    mutate(Development_Demand = predict(Model6, dat2, type="response")) %>%
    filter(NAME == "Boston") 

Boston_landUse <- rbind(
  filter(Boston, forest11 == 1 | wetlands11 == 1 ) %>%
  dplyr::select() %>% mutate(Land_Use = "Not Suitable"))



ggplot() +
  geom_sf(data=Boston, aes(fill=factor(ntile(Development_Demand,5))), colour=NA) +
  geom_point(data=Boston_landUse, aes(x=xyC(Boston_landUse)[,1], 
                                        y=xyC(Boston_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  scale_fill_manual(values = palette5, name="Development\nDemand",
                    labels=substr(quintileBreaks(Boston,"Development_Demand"),1,5)) +
  scale_colour_manual(values = alpha(c("red", "red"), .1)) + 
  labs(title = "Development Potential, 2020: Boston") + mapTheme() +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2))

ggplot() +
  geom_sf(data=Boston, aes(fill=factor(ntile(pop_Change,5))), colour=NA) +
  geom_point(data=Boston_landUse, aes(x=xyC(Boston_landUse)[,1], 
                                        y=xyC(Boston_landUse)[,2], colour=Land_Use),
                                        shape = 15, size = 2) +
  scale_fill_manual(values = palette5, name="Population\nChange",
                    labels=substr(quintileBreaks(Boston,"pop_Change"),1,5)) +
  scale_colour_manual(values = alpha(c("red", "red"), .1)) + 
  labs(title = "Projected Population, 2020: Boston") + mapTheme() +
  guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2))

7.1 Limitation

There are a few limitation to improve of this project in the future.

  • The boston should be divided into more neighbourhood to make the pridiction.
  • Collect more features for building up the model, for example the subway.
  • We need more features for the demand.