As the economy of Louisville, KY is accelerating, the city is innovating its transportation infrastructure at all dimensions. Mobility scooter, a transportation tool that moves with a relatively slow speed and can be operated without physical power of users, is a reasonable alternative to driving motor vehicles and traditional walking and cycling. In response to the “Move Louisville” transportation project that has been implemented since 2016, our consultant team seek to conduct a spatiotemperal comparative analysis of dockless vehicle usage patterns in Louisville that will
Like most American cities, one of the most cost-effective methods to reduce vehicle miles traveled in Louisville is to build dockless vehicle systems such as bike-share systems and e-scooter services. Speicifically, scooters are intended mainly for short journeys in the urban environment, for the purposes of shopping, recreation, and social life. For instance, most scooter users live in urban areas, where scooters serve for their daily transportation needs such as trips from their residence to shops, doctor appointments, and visits to friends. However, there is one big operational challenges of these shared systems, which is called “re-balancing” - getting scooters to stations that are anticipated to have demand but lack scooters. Figuring out how to solve the “re-balancing” problem is one of the keys to operating a successful system in Louisville.
Therefore our team have designed smart city solutions that will help shift short trips away from cars and forecast scooter demands in most stations in Louisville. We proposed a prediction model that takes 3 steps to complete:
#import libraries
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(data.table)
library(ggrepel)
library(reticulate)
library(readr)
use_python("/usr/local/bin/python")
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
mapTheme2 <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
mapTheme3 <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
We first read in street network data using Python package OSMnx. The Louisville, KY extract contains approximately 117,227 roads with 38,365 intersections in both urban and rural areas. Of all these roads, there are primary and secondary highway, driveway, cycleway, motorway, walkway, pedestrian etc. Some of the streets are oneway; some of them have speed limits. These factors may potentially affect scooter usage, so we will first explore street hierarchy in sample areas in Louisville.
#Python code used to collect street network in Louisville.
#L = ox.graph_from_place('Louisville, Kentucky, USA', network_type='all')
Louisville Street Network from Python OSMnx
library(magick)
street <- image_read('/Users/zixiliu/Desktop/LouisvilleStreets.png')
image_scale(street,"1800")
Louisville Primary, Secondary Street Network vs. Residential Street Network
primary <- image_read('/Users/zixiliu/Desktop/primary.png')
residential <- image_read('/Users/zixiliu/Desktop/residential.png')
residential <- image_resize(residential, "2040")
image_append(c(primary, residential))
When inspecting the Louisville OSM data in urban areas, it became apparent that Downtown Louisville had the highest density of street networks, primarily in Central Business District, and other neighborhoods such as Cherokee Triangle and University also have very dense street networks. Here we are visualizing road networks in Downtown, Cloverleaf, University and Downtown and University apparently has denser streets.
Downtown, Cloverleaf, University Street Network (3 km radius)
downtown <- image_read('/Users/zixiliu/Desktop/downtown.png')
university <- image_read('/Users/zixiliu/Desktop/university.png')
cloverleaf <- image_read('/Users/zixiliu/Desktop/cloverleaf.png')
image_append(c(downtown, cloverleaf, university))
Some other aspects we want to inspect encompass oneway vs. two way streets. Downtown and University apparently have more one-way streets than Cloverleaf.
Downtown, Cloverleaf, University Oneway Streets (3 km radius)
Now that we have a clearer sense of the distribution of street networks in Louisville and how the density of streets may affect scooter usage, we will move on to inspect the dockless vehicle data.
We have collected data of dockless (scooter and bike) trips across the city from all providers including Bird, Lime, Spin, etc. Here is a brief variable description of the dockless vehicle data:
TripID - a unique ID created by Louisville Metro
StartDate - in YYYY-MM-DD format
StartTime - rounded to the nearest 15 minutes in HH:MM format
EndDate - in YYYY-MM-DD format
EndTime - rounded to the nearest 15 minutes in HH:MM format
TripDuration - duration of the trip minutes
TripDistance - distance of trip in miles based on company route data
StartLatitude - rounded to nearest 3 decimal places
StartLongitude - rounded to nearest 3 decimal places
EndLatitude - rounded to nearest 3 decimal places
EndLongitude - rounded to nearest 3 decimal places
DayOfWeek - 1-7 based on date, 1 = Sunday through 7 = Saturday, useful for analysis
HourNum - the hour part of the time from 0-24 of the StartTime, useful for analysis
Load the data and join service regions and neighborhood informations to the data.
dockless <- read_csv('/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv')
## Warning: 1683 parsing failures.
## row col expected actual file
## 71 EndTime valid date 24:00 '/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv'
## 605 EndTime valid date 24:00 '/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv'
## 840 StartTime valid date 24:00 '/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv'
## 840 EndTime valid date 24:00 '/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv'
## 960 EndTime valid date 24:00 '/Users/zixiliu/Downloads/DocklessTripOpenData_7.csv'
## ... ......... .......... ...... .....................................................
## See problems(...) for more details.
dockless <- dockless %>% filter(is.na(StartDate)==FALSE) %>%
filter(is.na(StartTime)==FALSE) %>%
filter(is.na(EndDate)==FALSE) %>%
filter(is.na(EndTime)==FALSE) %>%
filter(is.na(StartLatitude )==FALSE) %>%
filter(is.na(StartLongitude )==FALSE)%>%
filter(is.na(EndLatitude )==FALSE)%>%
filter(is.na(EndLongitude )==FALSE)
service = st_read("/Users/zixiliu/Downloads/Dockless Vehicle Distribution Zones v2/Dockless_Vehicle_Distribution_Zones.shp")%>%st_transform(4326)
## Reading layer `Dockless_Vehicle_Distribution_Zones' from data source `/Users/zixiliu/Downloads/Dockless Vehicle Distribution Zones v2/Dockless_Vehicle_Distribution_Zones.shp' using driver `ESRI Shapefile'
## Simple feature collection with 9 features and 2 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 1184528 ymin: 239861.3 xmax: 1247259 ymax: 292598.8
## epsg (SRID): 2246
## proj4string: +proj=lcc +lat_1=37.96666666666667 +lat_2=38.96666666666667 +lat_0=37.5 +lon_0=-84.25 +x_0=500000.0001016001 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
dockless = st_join(dockless %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326) ,
service %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE)
dockless <- dockless %>% mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>% filter(is.na(Pl2040Area)==FALSE)
nh = st_read("https://opendata.arcgis.com/datasets/6e3dea8bd9cf49e6a764f7baa9141a95_30.geojson")%>%st_transform(4326)
## Reading layer `6e3dea8bd9cf49e6a764f7baa9141a95_30' from data source `https://opendata.arcgis.com/datasets/6e3dea8bd9cf49e6a764f7baa9141a95_30.geojson' using driver `GeoJSON'
## Simple feature collection with 92 features and 5 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -85.83734 ymin: 38.1399 xmax: -85.59608 ymax: 38.28157
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
dockless = st_join(dockless %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326) ,
nh %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE)
dockless <- dockless %>% mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>% filter(is.na(NH_NAME)==FALSE)
Dockless Vehicle vs. Building Footprint Data Visualization, Louisville
scooter <- image_read('/Users/zixiliu/Desktop/scooters.png')
image_scale(scooter,"1800")
Then we also take a closer look at Dockless Vehicle Stations in Downtown Louisville, where dockless scooter usage is most clustered. Interestingly, some areas with dense buildings do not have much scooter usage.
Downtown Dockless Vehicle vs. Building Footprint Data Visualization, Louisville
Dscooter <- image_read('/Users/zixiliu/Desktop/Dscooter.png')
image_scale(Dscooter,"1800")
We then use choropleth maps to get some sense of the variation of trip counts in different neighborhoods.
Dockless Vehicle Usage Quantile Choropleth Map, Louisville
choropleth <- image_read('/Users/zixiliu/Desktop/choro.png')
image_scale(choropleth,"1800")
Dockless Vehicle Usage Quantize Choropleth Map, Louisville
Quantize <- image_read('/Users/zixiliu/Desktop/quan.png')
image_scale(Quantize,"1800")
From the Quantile and Quantize Choropleth Map, we can gain information that Downtown and University has the most scooter trip counts and our data is imbalanced with respect to trip counts. Then we further explore the Origin Destination flow in May - Aug 2019 to get some sense of our riders’ trips.
Dockless Vehicle Origin Destination Map, Louisville
OD <- image_read('/Users/zixiliu/Desktop/OD.png')
image_scale(OD,"1800")
We want to make sure we have cleaned all the NAs.
any(is.na(dockless))
## [1] FALSE
We then further clean the data outliers (locations not in Louisville& long trips) and filter data from May, 2019 to Aug, 2019.
dockless <- dockless %>% filter(TripDistance <= 40 & TripDistance > 0 ) %>%
filter(StartDate >= "2019-05-01" & StartDate <= "2019-08-31") %>%
filter(StartLatitude < 38.75 & StartLatitude > 37.75 ) %>%
filter(EndLatitude < 38.75 & EndLatitude > 37.75) %>%
filter(StartLongitude > -86.36 & StartLongitude < -85.16) %>%
filter(EndLongitude > -86.36 & EndLongitude < -85.16) %>%
filter(TripDuration > 0 & TripDuration < 480)
dockless$StartInterval <- as.POSIXct(strptime(paste(dockless$StartDate,dockless$StartTime),"%Y-%m-%d %H:%M:%S"))
dockless <- dockless %>% mutate(interval60 = floor_date(ymd_hms(StartInterval), unit = "hour"),
interval15 = floor_date(ymd_hms(StartInterval), unit = "15 mins"),
week = week(StartInterval))
dockless$DayOfWeek <- ordered(dockless$DayOfWeek, levels = 1:7,
labels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
dockless$HourNum <- as.numeric(dockless$HourNum)
glimpse(dockless)
## Observations: 171,107
## Variables: 24
## $ TripID <chr> "00006088-2579-e0d0-6a30-a15bb878", "00008c1a-899…
## $ StartDate <date> 2019-08-21, 2019-07-03, 2019-05-04, 2019-05-07, …
## $ StartTime <time> 17:30:00, 11:00:00, 21:15:00, 17:30:00, 13:30:00…
## $ EndDate <date> 2019-08-21, 2019-07-03, 2019-05-04, 2019-05-07, …
## $ EndTime <time> 17:30:00, 11:15:00, 21:30:00, 18:00:00, 13:45:00…
## $ TripDuration <dbl> 6, 6, 7, 32, 10, 19, 10, 6, 13, 45, 1, 1, 5, 13, …
## $ TripDistance <dbl> 0.300, 0.640, 0.684, 1.740, 0.808, 0.497, 1.429, …
## $ EndLatitude <dbl> 38.265, 38.221, 38.223, 38.254, 38.257, 38.246, 3…
## $ EndLongitude <dbl> -85.739, -85.763, -85.764, -85.707, -85.760, -85.…
## $ DayOfWeek <ord> Thu, Thu, Sun, Wed, Mon, Sat, Tue, Sun, Wed, Sat,…
## $ HourNum <dbl> 17, 11, 21, 17, 13, 5, 18, 18, 11, 23, 8, 19, 15,…
## $ Dist_Zone <int> 3, 6, 6, 3, 2, 2, 5, 2, 6, 5, 6, 5, 5, 5, 2, 2, 2…
## $ Pl2040Area <fct> Northeast Core, University, University, Northeast…
## $ OBJECTID <int> 6, 45, 22, 10, 8, 8, 18, 8, 45, 37, 22, 18, 18, 3…
## $ NH_CODE <int> 13, 71, 47, 22, 17, 17, 20, 17, 71, 37, 47, 20, 2…
## $ NH_NAME <fct> BUTCHERTOWN, UNIVERSITY, OLD LOUISVILLE, CLIFTON,…
## $ SHAPEAREA <dbl> 25600550, 22759446, 33395329, 19009889, 33029828,…
## $ SHAPELEN <dbl> 22476.08, 32175.69, 31463.93, 17779.81, 27923.75,…
## $ StartLongitude <dbl> -85.736, -85.757, -85.762, -85.717, -85.758, -85.…
## $ StartLatitude <dbl> 38.263, 38.217, 38.221, 38.254, 38.251, 38.246, 3…
## $ StartInterval <dttm> 2019-08-21 17:30:00, 2019-07-03 11:00:00, 2019-0…
## $ interval60 <dttm> 2019-08-21 17:00:00, 2019-07-03 11:00:00, 2019-0…
## $ interval15 <dttm> 2019-08-21 17:30:00, 2019-07-03 11:00:00, 2019-0…
## $ week <int> 34, 27, 18, 19, 23, 26, 22, 31, 21, 32, 21, 33, 3…
Import Cencus Data
Using the tidycensus package, we can download census geography and variables. These are used to test generalizeability later, but we don’t use them as independent variables because they end up being perfectly colinear with the stations fixed effects. We extract the tracts for mapping and joining purposes - creating an sf object that consists only of GEOIDs and geometries.
We add the spatial information to our data as origin and destination data, first joining the origin station, then the destination station to our census data.
LouisvilleCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2017,
state = "KY",
geometry = TRUE,
county=c("Jefferson"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 5%
|
|==== | 6%
|
|==== | 7%
|
|===== | 7%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 15%
|
|========== | 16%
|
|=========== | 17%
|
|============ | 18%
|
|============= | 20%
|
|============= | 21%
|
|============== | 21%
|
|=============== | 22%
|
|=============== | 23%
|
|================ | 24%
|
|================ | 25%
|
|================= | 26%
|
|================== | 27%
|
|================== | 28%
|
|=================== | 29%
|
|=================== | 30%
|
|==================== | 31%
|
|===================== | 32%
|
|====================== | 33%
|
|====================== | 35%
|
|======================= | 35%
|
|======================== | 36%
|
|======================== | 37%
|
|========================= | 38%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 41%
|
|============================ | 43%
|
|============================ | 44%
|
|============================= | 44%
|
|============================== | 46%
|
|=============================== | 47%
|
|=============================== | 48%
|
|================================ | 49%
|
|================================= | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|===================================== | 56%
|
|===================================== | 58%
|
|====================================== | 58%
|
|======================================= | 59%
|
|======================================= | 60%
|
|======================================== | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================== | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 67%
|
|============================================= | 69%
|
|============================================== | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|==================================================== | 80%
|
|==================================================== | 81%
|
|===================================================== | 81%
|
|====================================================== | 82%
|
|====================================================== | 83%
|
|======================================================= | 84%
|
|======================================================= | 85%
|
|======================================================== | 86%
|
|========================================================= | 87%
|
|========================================================= | 88%
|
|========================================================== | 89%
|
|========================================================== | 90%
|
|=========================================================== | 90%
|
|============================================================ | 92%
|
|============================================================= | 93%
|
|============================================================= | 95%
|
|============================================================== | 95%
|
|=============================================================== | 96%
|
|=============================================================== | 97%
|
|================================================================ | 98%
|
|================================================================ | 99%
|
|=================================================================| 100%
LouisvilleTracts <-
LouisvilleCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
dockless = dockless%>% select(-"StartInterval")
dat_census <- st_join(dockless %>%
filter(is.na(StartLongitude) == FALSE &
is.na(StartLatitude) == FALSE &
is.na(EndLongitude) == FALSE &
is.na(EndLatitude) == FALSE) %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326),
LouisvilleTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("EndLongitude", "EndLatitude"), crs = 4326) %>%
st_join(., LouisvilleTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate("EndLongitude" = unlist(map(geometry, 1)),
"EndLatitude" = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
Dockless Vehicle Temperal Usage Patterns, Louisville
There is a lot that we can explore pertaining to scooter usage patterns in Louisville.
What is the distribution of scooter trip counts in Louisville by day of week? What is the distribution of scooter trip counts in Louisville by weekday and weekends? If we break down the data into service regions, what is the distribution of scooter usage in each region?
We begin by examining the time and frequency components of our data.
First, we look at the overall time pattern - there is clearly a daily periodicity and there are lull periods on weekends.
ggplot(dat_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n))+
labs(title="Dockless Vehicle trips in Louisville, May - Aug, 2019",
x="Month",
y="Number of trips")+
plotTheme
a <- ggplot(dat_census)+
geom_freqpoly(aes(HourNum, color = DayOfWeek), stat = "count",binwidth = 1)+
labs(title="Dockless Vehicle Trips in Louisville, by day of the week, May - Aug 2019",
x="Hour",
y="Trip Counts")+
plotTheme
## Warning: Ignoring unknown parameters: binwidth
b <- ggplot(dat_census %>%
mutate(weekend = ifelse(DayOfWeek %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(HourNum, color = weekend), binwidth = 1)+
labs(title="Dockless Vehicle Trips in Louisville, weekend vs weekday, May - Aug 2019",
x="Hour",
y="Trip Counts")+
plotTheme
grid.arrange(a,b)
In general, there is more scooter usage on weekends than on weekdays and more demand in the afternoon from 14 - 20 pm than in other time periods.
Dockless Vehicle Spatial Usage Patterns, Louisville
df <- dat_census %>% group_by(Pl2040Area, DayOfWeek) %>% tally()
gp <- ggplot(df)
gp + geom_bar(aes(x=DayOfWeek, y = n), stat= "identity") +
facet_wrap(~ Pl2040Area,ncol = 3) +
labs(x="Day Of Week",
y="Trip Counts") +
ggtitle(label = "Trip Counts by Day of Week, Per Service Region")
We break down the usage data into service regions and see that three service regions have the most usage, where Downtown has the most scooter usage, followed by University and Southeast Core. In Downtown, we have more scooter usage on weekends than on weekdays; whereas in University, there are more scooter usage on weekdays. This is probably because people use scooters to commute to school from Mon - Fri in University and use scooters to explore the city in downtown areas on weekends.
This implies our data is imbalanced with respect to stations and we need to take this in mind when building prediction models.
Dockless Vehicle Usage Cluster Map, Louisville
From the dendrogram cluster heatmap, we have several insights to take away:
We speculate that dockless vehicle usage in Louisville is closely related with weather conditions and street information, so we will next create these features.
Import Weather Data
We import and visualize the weather data collected at Louisville International Airport from May-Aug, 2019.
weather.Panel <-
riem_measures(station = "SDF", date_start = "2019-05-01", date_end = "2019-08-31") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
DayOfWeek = wday(interval60)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt),
Visibility = max(vsby)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)
## Observations: 2,928
## Variables: 5
## $ interval60 <dttm> 2019-05-01 00:00:00, 2019-05-01 01:00:00, 2019-05…
## $ Temperature <dbl> 77.0, 75.0, 72.0, 75.0, 73.0, 72.0, 73.0, 72.0, 72…
## $ Precipitation <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.…
## $ Wind_Speed <dbl> 11, 8, 8, 9, 8, 8, 9, 8, 9, 10, 9, 9, 13, 16, 16, …
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Visibility", x="Hour", y="Visibility") + plotTheme,
top="Weather Data - Louisville SDF - May - Aug, 2018")
Join Street Information
We join street information in Louisville to each scooter station, including street names, highway(primary/secondary), and oneway or not. We use sjoin and nearest_point in Python package Geopandas and Shapely to find the nearest street LineString to each scooter coordinate and then merge the two datasets.
dat_census['Stations'] = paste(dat_census$StartLatitude, dat_census$StartLongitude)
#Python Code for Reference
#riders = pd.read_csv('/content/drive/My Drive/census.csv')
#geometry = [Point(xy) for xy in zip(riders.StartLongitude, riders.StartLatitude)]
#riders['geometry']= geometry
#crs = {'init': 'epsg:4326'}
#startPoints = GeoDataFrame(riders, crs=crs)
#data = pd.DataFrame(data)
#data = data[-data['geometry'].isna()]
#dataLines = GeoDataFrame(data, crs=crs)
#geometry = [MultiPoint(list((x.coords))) for x in dataLines.geometry]
#dataLines['geometry'] = geometry
#strLine = pd.DataFrame(dataLines)
#scoPts = pd.DataFrame(startPoints)
#nearest = [nearest_points(x,y) for x, y in zip(scoPts.geometry, strLine.geometry)]
#nearestPts = GeoDataFrame(nearest, crs=crs)
#nearestPts = GeoDataFrame(nearestPts[0]) #original scooter locations
#nearestPts.columns = ['geometry']
#nearestP = GeoDataFrame(nearest, crs=crs)
#onStrPts = GeoDataFrame(nearestP[1]) #street locations
#onStrPts.columns = ['geometry']
#roads_region = gpd.sjoin(onStrPts, dataLines, how="inner", op = 'intersects')
#nearestPts = pd.DataFrame(nearestPts)
#joinPts = roads_region.reset_index().drop_duplicates(subset='index', #keep='first').sort_values('index').set_index('index')
#joinPts = pd.DataFrame(joinPts)
#joinPts.rename(columns = {'geometry':'geom'}, inplace = True)
#joinStr = pd.concat([nearestPts,joinPts], axis = 1)
#riders = GeoDataFrame(riders, crs = crs)
#joinStr = GeoDataFrame(joinStr, crs = crs)
#joinStr = joinStr.drop(['index_right'], axis=1)
#riders['lon'] = riders.geometry.x
#riders['lat'] = riders.geometry.y
#riders = pd.DataFrame(riders)
#riders = riders.drop(columns = ['geometry'])
#joinStr['lon'] = joinStr.geometry.x
#joinStr['lat'] = joinStr.geometry.y
#joinStr = pd.DataFrame(joinStr).drop(columns = ['geometry','geom'])
#joined = joinStr[['lat','lon','name','oneway','highway','length']]
#riders['geo'] = riders['lat'].astype(str)+'_'+riders['lon'].astype(str)
#joined['geo'] = joined['lat'].astype(str)+'_'+joined['lon'].astype(str)
#joined2 = joined.drop_duplicates(subset = 'geo')
#merged = pd.merge(riders, joined2, on = "geo", how="left")
#merged.to_csv('/content/drive/My Drive/mergedDock.csv')
merged <- read_csv("/Users/zixiliu/Downloads/mergedDock.csv")
## Warning: Missing column names filled in: 'X1' [1]
merged <- merged %>% na.omit()
We take a look at top ten streets that have the most trip counts:
m = merged %>% group_by(name) %>% summarize("cts" = n()) %>% filter(is.na(name)==FALSE)
strNames = m[order(-m$cts),]$name[1:10]
strNames
## [1] "South 22nd Street" "Glaser Road" "Midland Avenue"
## [4] "Woodmont Park Lane" "Rosalind Court" "Stuart Avenue"
## [7] "Fontendleau Way" "Saint Xavier Street" "Cedar Street"
## [10] "Avanti Way"
Create Fishnet
length(unique(merged$Stations))
## [1] 2152
Since we have 4482 stations in Louisville, we need to aggragate some stations into fishnet cells.
Boundary <- st_read("https://opendata.arcgis.com/datasets/70b8b06af5a2470d9d6a9037bac00c4b_0.geojson") %>%
st_transform(crs=102271) %>% filter(COUNTY %in% c("JEFFERSON"))
## Reading layer `70b8b06af5a2470d9d6a9037bac00c4b_0' from data source `https://opendata.arcgis.com/datasets/70b8b06af5a2470d9d6a9037bac00c4b_0.geojson' using driver `GeoJSON'
## Simple feature collection with 427 features and 32 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -89.21661 ymin: 36.50048 xmax: -82.30109 ymax: 39.12202
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
fishnet <-
st_make_grid(Boundary, cellsize = 1000) %>%
st_sf()
fishnet <-
fishnet[Boundary,] %>%
mutate(uniqueID = rownames(.)) %>%
dplyr::select(uniqueID)
ggplot() +
geom_sf(data=fishnet) +
labs(title = "Fishnet in Louisville") +
mapTheme2()
Join the fishnet to data.
merged <- st_join(merged %>% st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326),fishnet %>% st_transform(crs = 4326),join = st_intersects) %>%
mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))
One-Hot Encoding
Some of the categorical variables have too many levels so we need to pre-process the features with one-hot encoding.
merged <- merged %>% mutate(weekend = ifelse(DayOfWeek %in% c("Sun", "Sat"), 1, 0))
merged <- merged %>% mutate( HW = ifelse(highway %in% c("residential", "primary", "secondary"), 1, 0))
merged <- merged %>% mutate( OneW = ifelse(oneway %in% c("TRUE"), 0, 1))
library(fastDummies)
merged <- fastDummies::dummy_cols(merged, select_columns = "NH_NAME")
merged <- merged %>% mutate(TopStr = ifelse(name %in% strNames, 1, 0))
First we have to make sure each unique station and hour/day combo exists in our data set. We start by determining the maximum number of combinations.Then we compare that to the actual number of combinations. We create an data frame study.panel, which is created that has each unique space/time observations.
length(unique(merged$interval60)) * length(unique(merged$uniqueID))
## [1] 387528
study.panel <-
expand.grid(interval60=unique(merged$interval60),
uniqueID = unique(merged$uniqueID)) %>%
left_join(., merged %>%
select( uniqueID, StartLongitude, StartLatitude) %>%
group_by(uniqueID) %>%
slice(1))
## Joining, by = "uniqueID"
## Warning: Column `uniqueID` joining factor and character vector, coercing
## into character vector
nrow(study.panel)
## [1] 387528
We create the full panel by summarizing counts by station for each time interval, keep census info and lat/lon information along for joining later to other data. We create categorical variables to indicate whether there is primary/secondary/residential highway in the cell etc, whether the cell is in Central Business District, and whether there are oneway streets in the cell etc.
ride.panel <-
merged %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, uniqueID, StartLatitude,StartLongitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T),
Str = sum(TopStr, na.rm=T),
weekend = sum(weekend,na.rm=T),
HW = sum(HW,na.rm=T ),
OneW = sum(OneW,na.rm=T ),
CBD = sum(`NH_NAME_CENTRAL BUSINESS DISTRICT`,na.rm=T),
OL = sum(`NH_NAME_OLD LOUISVILLE`,na.rm=T),
BT = sum(`NH_NAME_BUTCHERTOWN`,na.rm=T),
UN = sum(`NH_NAME_UNIVERSITY`,na.rm=T),
PH = sum(`NH_NAME_PHOENIX HILL`,na.rm=T),
CT = sum(`NH_NAME_CHEROKEE TRIANGLE`,na.rm=T),
SJ = sum(`NH_NAME_SAINT JOSEPH`,na.rm=T),
TP = sum(`NH_NAME_TYLER PARK`,na.rm=T)) %>%
mutate(weekend = ifelse(weekend > 0, "1", "0"),
Str = ifelse(Str > 0, "1", "0"),
HW = ifelse(HW > 0, "1", "0"),
OneW = ifelse(OneW > 5, "1", "0"),
CBD = ifelse(CBD> 0, "1", "0"),
OL = ifelse(OL> 0, "1", "0"),
BT = ifelse(BT> 0, "1", "0"),
UN = ifelse(UN> 0, "1", "0"),
PH = ifelse(PH> 0, "1", "0"),
CT = ifelse(CT> 0, "1", "0"),
SJ = ifelse(SJ> 0, "1", "0"),
TP = ifelse(TP> 0, "1", "0")) %>%
left_join(weather.Panel) %>%
ungroup() %>%
mutate(week = week(interval60),
DayOfWeek = wday(interval60))
panel <- st_join(ride.panel %>%
filter(is.na(StartLongitude) == FALSE &
is.na(StartLatitude) == FALSE ) %>%
st_as_sf(., coords = c("StartLongitude", "StartLatitude"), crs = 4326),
LouisvilleTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(StartLongitude = unlist(map(geometry, 1)),
StartLatitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
panel <-
left_join(panel, LouisvilleCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
Creating time lag variables will add additional nuance about the demand during a given time period - hours before and during that day.
We can evaluate the correlations in these lags. They are pretty strong. There’s a high Pearson’s R, but that might be because there are many 0s in our data.
panel <-
panel %>%
arrange(interval60,uniqueID) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24)) %>%
mutate(day = yday(interval60))
as.data.frame(panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
## # A tibble: 6 x 2
## Variable correlation
## <fct> <dbl>
## 1 lagHour 1
## 2 lag2Hours 1
## 3 lag3Hours 1
## 4 lag4Hours 1
## 5 lag12Hours 0.99
## 6 lag1day 0.88
Data Split
We use data before 32 weeks as training data and data after 32 days as test data.
ride.Train <- filter(panel, week <= 32)
ride.Test <- filter(panel, week > 32)
glimpse(panel)
## Observations: 387,528
## Variables: 41
## $ interval60 <dttm> 2019-05-01, 2019-05-01, 2019-05-01,…
## $ uniqueID <chr> "424", "425", "426", "469", "470", "…
## $ Trip_Count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Str <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ weekend <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ HW <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ OneW <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ CBD <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ OL <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ BT <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ UN <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ PH <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ CT <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ SJ <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ TP <chr> "0", "0", "0", "0", "0", "0", "0", "…
## $ Temperature <dbl> 77, 77, 77, 77, 77, 77, 77, 77, 77, …
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Wind_Speed <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, …
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ week <int> 18, 18, 18, 18, 18, 18, 18, 18, 18, …
## $ DayOfWeek <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ Origin.Tract <chr> "21111004500", "21111004500", "21111…
## $ StartLongitude <dbl> -85.797, -85.786, -85.780, -85.775, …
## $ StartLatitude <dbl> 38.164, 38.165, 38.162, 38.172, 38.1…
## $ Total_Pop <dbl> 3224, 3224, 3224, 3644, 7049, 4212, …
## $ Med_Inc <dbl> 31984, 31984, 31984, 45000, 37636, 3…
## $ White_Pop <dbl> 2470, 2470, 2470, 2971, 4657, 3143, …
## $ Travel_Time <dbl> 23905, 23905, 23905, 39095, 56935, 3…
## $ Means_of_Transport <dbl> 1105, 1105, 1105, 1917, 2809, 1934, …
## $ Total_Public_Trans <dbl> 38, 38, 38, 50, 46, 38, 254, 71, 43,…
## $ Med_Age <dbl> 31.8, 31.8, 31.8, 41.6, 35.8, 39.9, …
## $ Percent_White <dbl> 0.7661290, 0.7661290, 0.7661290, 0.8…
## $ Mean_Commute_Time <dbl> 629.0789, 629.0789, 629.0789, 781.90…
## $ Percent_Taking_Public_Trans <dbl> 0.034389140, 0.034389140, 0.03438914…
## $ lagHour <dbl> NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lag2Hours <dbl> NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ lag3Hours <dbl> NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ lag4Hours <dbl> NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0,…
## $ lag12Hours <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lag1day <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ day <int> 121, 121, 121, 121, 121, 121, 121, 1…
Here we select features and use linear regression to build four models:
To overcome the possibility of overfitting the model with uniqueID, I exclude uniqueID in the fourth model.
OLS Regression
reg1 <-
lm(Trip_Count ~ as.character(uniqueID) + hour(interval60) + weekend + Temperature + Wind_Speed + Visibility + Precipitation, data=ride.Train)
reg2 <-
lm(Trip_Count ~ as.character(uniqueID) + hour(interval60) + weekend + Temperature + Wind_Speed + Visibility + Precipitation + Str + CBD + OL + BT + UN + PH + CT +TP , data=ride.Train)
reg3 <-
lm(Trip_Count ~ as.character(uniqueID) + hour(interval60) + weekend + Temperature + Wind_Speed + Visibility + Precipitation + Str + CBD + OL + BT + UN + PH + CT + TP + HW + OneW,
data=ride.Train)
reg4 <-
lm(Trip_Count ~ hour(interval60) + weekend + Temperature + Wind_Speed + Visibility + Precipitation + Str + CBD + OL + BT + UN + PH + CT + TP+ lagHour + lag2Hours + lag3Hours + lag4Hours + lag12Hours + lag1day + HW + OneW,
data=ride.Train)
summary(reg4)
##
## Call:
## lm(formula = Trip_Count ~ hour(interval60) + weekend + Temperature +
## Wind_Speed + Visibility + Precipitation + Str + CBD + OL +
## BT + UN + PH + CT + TP + lagHour + lag2Hours + lag3Hours +
## lag4Hours + lag12Hours + lag1day + HW + OneW, data = ride.Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3631 -0.0019 -0.0004 0.0008 5.8559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.293e-02 4.342e-03 2.978 0.0029 **
## hour(interval60) 1.756e-04 3.639e-05 4.826 1.39e-06 ***
## weekend1 1.928e-01 3.142e-03 61.369 < 2e-16 ***
## Temperature 2.037e-05 2.674e-05 0.762 0.4463
## Wind_Speed 2.273e-05 6.075e-05 0.374 0.7083
## Visibility -1.618e-03 4.133e-04 -3.916 9.02e-05 ***
## Precipitation -8.629e-04 6.773e-04 -1.274 0.2026
## Str1 4.292e-01 3.726e-03 115.181 < 2e-16 ***
## CBD1 3.208e-01 4.112e-03 78.009 < 2e-16 ***
## OL1 2.327e-01 6.557e-03 35.490 < 2e-16 ***
## BT1 5.580e-01 8.391e-03 66.502 < 2e-16 ***
## UN1 7.548e-01 6.080e-03 124.144 < 2e-16 ***
## PH1 1.263e+00 6.345e-03 199.072 < 2e-16 ***
## CT1 4.638e-01 7.355e-03 63.065 < 2e-16 ***
## TP1 2.514e-02 1.434e-02 1.753 0.0796 .
## lagHour -3.169e-03 7.815e-04 -4.056 5.00e-05 ***
## lag2Hours 8.929e-03 7.984e-04 11.183 < 2e-16 ***
## lag3Hours -5.202e-04 7.813e-04 -0.666 0.5055
## lag4Hours -1.379e-03 7.806e-04 -1.767 0.0772 .
## lag12Hours 1.303e-03 7.756e-04 1.680 0.0930 .
## lag1day -3.142e-04 7.755e-04 -0.405 0.6853
## HW1 9.688e-01 3.461e-03 279.907 < 2e-16 ***
## OneW1 5.414e+00 1.198e-02 451.777 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1248 on 328521 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.8031
## F-statistic: 6.092e+04 on 22 and 328521 DF, p-value: < 2.2e-16
In case we may have too many variables and overfit the model, we want to cross-validate our model with lasso regression, which will impose a penalty for having too many variables. It will enable us to find a reduced set of variables resulting to an optimal performing model.
Compute Lasso Regression
library(glmnet)
panel2 = panel
panel2<- panel2%>%na.omit()
panel2$uniqueID = as.character(panel2$uniqueID)
panel2$hour = hour(panel2$interval60)
ride.Train2 <- filter(panel2, week <= 32)
ride.Test2 <- filter(panel2, week > 32)
x = data.matrix(data.frame(ride.Train2[c("uniqueID", "weekend","Temperature", "Wind_Speed","Visibility", "Precipitation", "Str", "CBD","OL","BT","UN","PH","CT","TP","lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day","HW","OneW")],stringsAsFactors =TRUE ))
y = data.matrix(data.frame(ride.Train2[c("Trip_Count")],stringsAsFactors =TRUE ))
cv.lasso <- cv.glmnet(x, y,alpha = 1 )
model = glmnet(x,y, alpha =1, lambda = cv.lasso$lambda.min)
tx = data.matrix(data.frame(ride.Test2[c("uniqueID", "weekend","Temperature", "Wind_Speed","Visibility", "Precipitation", "Str", "CBD","OL","BT","UN","PH","CT","TP","lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day","HW","OneW")],stringsAsFactors =TRUE ))
ty = data.matrix(data.frame(ride.Test2[c("Trip_Count")],stringsAsFactors =TRUE ))
prob <- model %>% predict(newx=tx)
mean(abs(prob - ty), na.rm = TRUE)
## [1] 0.01453942
This mean absolute error is not better than the results from OLS, so I will stick with OLS for predictions.
plot(cv.lasso)
XGBOOST Regression
Since we do not have much time with sophisticated parameter tuning, we will display the default settings.
#install.packages("xgboost")
library("xgboost")
dtrain = xgb.DMatrix(data = x,label = y)
dtest <- xgb.DMatrix(data = tx,label=ty)
params <- list(booster = "gbtree", objective = "reg:squarederror", eta=0.3, gamma=0, max_depth=20, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1] train-rmse:0.389851+0.000367 test-rmse:0.392309+0.002126
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
##
## [11] train-rmse:0.058016+0.001310 test-rmse:0.139364+0.005130
## [21] train-rmse:0.048164+0.001354 test-rmse:0.145368+0.004916
## Stopping. Best iteration:
## [9] train-rmse:0.066456+0.001376 test-rmse:0.138833+0.004765
xgb1 <- xgb.train (params = params, data = dtrain, nrounds = 79, watchlist = list(val=dtest,train=dtrain), print.every.n = 10, early.stop.round = 10, maximize = F , eval_metric = "error")
## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1] val-error:-0.011337 train-error:-0.012020
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
##
## [11] val-error:-0.011283 train-error:-0.012129
## [21] val-error:-0.011283 train-error:-0.012160
## Stopping. Best iteration:
## [16] val-error:-0.011283 train-error:-0.012160
#model prediction
xgbpred <- predict(xgb1,dtest)
mean(abs(xgbpred - ty), na.rm = TRUE)
## [1] 0.01566367
summary(xgbpred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.010788 0.001670 0.001673 0.036851 0.001678 8.474325
The mean absolute error is still around 0.015-0.016, so we will stick with OLS. It may help to visualize feature importance to evaluate the OLS model.
Feature Importance
#view variable importance plot
mat <- xgb.importance (feature_names = colnames(x),model = xgb1)
xgb.plot.importance (importance_matrix = mat[1:20])
We see that HW, Str, OneW, Temperature, Wind_Speed are most significant features, which indicates street information and weather conditions are most important features for predicting riders’ usage patterns. This makes sense since primary/secondary/residential streets are pretty essential for scooter usage.
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
## Warning: All elements of `...` must be named.
## Did you want `data = c(interval60, uniqueID, Trip_Count, Str, weekend, HW, OneW, CBD,
## OL, BT, UN, PH, CT, SJ, TP, Temperature, Precipitation, Wind_Speed,
## Visibility, DayOfWeek, Origin.Tract, StartLongitude, StartLatitude,
## Total_Pop, Med_Inc, White_Pop, Travel_Time, Means_of_Transport,
## Total_Public_Trans, Med_Age, Percent_White, Mean_Commute_Time,
## Percent_Taking_Public_Trans, lagHour, lag2Hours, lag3Hours,
## lag4Hours, lag12Hours, lag1day, day)`?
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
Making Predictions
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
The best models - the time-space model and the timelag models, are accurate to less than an average of one ride per hour, at a glance, that’s pretty good for overall accuracy.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
Time Series Visualization
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID)) %>%
dplyr::select(interval60, uniqueID, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, - uniqueID) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed scooter share time series", subtitle = "Louisville; A test set of 3 weeks", x = "Hour", y= "Station Trips") +
plotTheme
## Warning: `cols` is now required.
## Please use `cols = c(interval60, uniqueID, Observed, Prediction)`
## Warning: Removed 22 rows containing missing values (geom_path).
After comparing the regression results, we conclude that weather, neighborhood information and street information are significant in influencing the scooter share demand.
Moving forward, let’s stick with reg4, which seems to have the best goodness of fit generally.
We can look at our mean absolute errors by station - there seems to be a spatial pattern to our error, but we need to go a bit further to get at the error.
mae <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID),
StartLatitude = map(data, pull, StartLatitude),
StartLongitude = map(data, pull, StartLongitude)) %>%
select(interval60, uniqueID, StartLatitude, StartLongitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags") %>%
group_by(StartLongitude, StartLatitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))
## Warning: `cols` is now required.
## Please use `cols = c(interval60, uniqueID, StartLatitude, StartLongitude, Observed,
## Prediction)`
mae%>%
ggplot(.)+
geom_sf(data = LouisvilleCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = StartLongitude, y = StartLatitude, color = MAE),
fill = "transparent", alpha = 0.6)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$StartLatitude), max(dat_census$StartLatitude))+
xlim(min(dat_census$StartLongitude), max(dat_census$StartLongitude))+
labs(title="Mean Abs Error, Test Set, Model 4")+
mapTheme
Now we want to the spatial element of the error.
Test Prediction Error by Stations
If we plot observed vs. predicted for different times of day during the week and weekend, some patterns begin to emerge.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
uniqueID = map(data, pull, uniqueID),
StartLatitude = map(data, pull, StartLatitude),
StartLongitude = map(data, pull, StartLongitude),
DayOfWeek = map(data, pull, DayOfWeek)) %>%
select(interval60, uniqueID, StartLatitude, StartLongitude, Observed, Prediction,DayOfWeek, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(DayOfWeek %in% c(6,7), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(is.na(Prediction)==FALSE)%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction), size = 0.3)+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme
## Warning: `cols` is now required.
## Please use `cols = c(interval60, uniqueID, StartLatitude, StartLongitude, Observed,
## Prediction, DayOfWeek)`
Seems like these are concentrated in certain areas. We are arriving at better prediction results for weekdays than weekends.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
Origin.Tract = map(data, pull, Origin.Tract),
StartLatitude = map(data, pull, StartLatitude),
StartLongitude = map(data, pull, StartLongitude),
DayOfWeek = map(data, pull, DayOfWeek) ) %>%
select(interval60, Origin.Tract, StartLongitude,
StartLatitude, Observed, Prediction, Regression,
DayOfWeek) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(DayOfWeek %in% c(6,7), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(Origin.Tract, weekend, time_of_day, StartLongitude, StartLatitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = LouisvilleCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = StartLongitude, y = StartLatitude, color = MAE),
fill = "transparent", size = 0.5, alpha = 0.6)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(dat_census$ StartLatitude), max(dat_census$StartLatitude))+
xlim(min(dat_census$StartLongitude), max(dat_census$StartLongitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
## Warning: `cols` is now required.
## Please use `cols = c(interval60, Origin.Tract, StartLongitude, StartLatitude, Observed,
## Prediction, DayOfWeek)`
Let’s focus on the morning commute, where station locations probably relate to likely users, who seem to be commuting downtown to the loop. The reg4 model is performing better for relatively high-income areas.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
Origin.Tract = map(data, pull, Origin.Tract),
StartLatitude = map(data, pull, StartLatitude),
StartLongitude = map(data, pull, StartLongitude),
DayOfWeek = map(data, pull, DayOfWeek),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
select(interval60, Origin.Tract, StartLongitude,
StartLatitude, Observed, Prediction, Regression,
DayOfWeek , Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(DayOfWeek %in% c(6,7), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(Origin.Tract, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-Origin.Tract, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
#geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = value, y = MAE), alpha = 0.5)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
## Warning: `cols` is now required.
## Please use `cols = c(interval60, Origin.Tract, StartLongitude, StartLatitude, Observed,
## Prediction, DayOfWeek, Percent_Taking_Public_Trans, Med_Inc,
## Percent_White)`
Based on our time-series plots and error metrics, we see that we are able to track the time components of demand, but we miss some peaks, and underpredict for some periods of high demand. Based on subsequent maps of our errors, we see that these peaks seem to have some spatial or demographic pattern. It seems that underserved areas are in Downtown.
In terms of generalizeability, we can see that reg4 is performing well and reached good prediction results. From an operations perspective, the problem with underpredicting for high demand lies in weekday commute. As we can see from the time-series plot, we miss the peaks which are mainly during weekdays. This is probably because there is more variation on weekday commutes such as road closures, traffic conjestion and public events(marathons) etc.
In order to depress the errors, there are some next steps such as collecting more real-time data from scooter company’s API such as Bird, Lime, and Spin. We can also compute Global Moran’s I or Local Moran’s I for trip volume to identify differences in spatial autocorrelation. In that case, we can spatially explore them further to understand where we are predicting poorly.
We can also add point of interests such as school districts and education level to our model. Intuitively, university students and people with higher education are more likely to use scooter. We may be able to better capture the peaks if we include more information with the fishnet.