The data is read in, and all NA records are omitted to eliminate any entries that weather data is not present for.
dat <- read.csv("~/INST377/Food_Inspection_Build/dc2020/dots_data.csv")
dat <- na.omit(dat)
For this model, we are focusing on one location to simplify it, the South Gate, South View. The data is then partitioned randomly into a 20/80 split of testing/training data. The knots forthe model are also created, simpler than the final model to make the model smoother.
dat_filter <- filter(dat, location == 'South_Gate_South_View')
set.seed(123)
training.samples <- dat_filter$time_of_day %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- dat_filter[training.samples, ]
test.data <- dat_filter[-training.samples, ]
knots <- quantile(train.data$time_of_day, p = c(0, 0.25, 0.5, 0.75, 1))
The model is built using spline regression to emulate the curves of the dataset, while temperature is used as a linear component.
cmodel <- lm(cars ~ bs(time_of_day, knots = knots) + Temperature, data = train.data)
ggplot(train.data, aes(time_of_day, cars), time_of_day) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 5))
predictions <- cmodel %>% predict(test.data)
## Warning in predict.lm(., test.data): prediction from a rank-deficient fit
## may be misleading
data.frame( RMSE = RMSE(predictions, test.data$cars),
R2 = R2(predictions, test.data$cars) )
## RMSE R2
## 1 136.4363 0.830443
summary(cmodel)
##
## Call:
## lm(formula = cars ~ bs(time_of_day, knots = knots) + Temperature,
## data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -546.52 -75.94 -3.26 82.51 489.27
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 239.4151 34.8120 6.877 1.03e-11 ***
## bs(time_of_day, knots = knots)1 -73.1228 37.1033 -1.971 0.0490 *
## bs(time_of_day, knots = knots)2 -25.3999 40.2329 -0.631 0.5280
## bs(time_of_day, knots = knots)3 -363.8199 36.5769 -9.947 < 2e-16 ***
## bs(time_of_day, knots = knots)4 -98.0190 39.4937 -2.482 0.0132 *
## bs(time_of_day, knots = knots)5 686.8060 32.8200 20.926 < 2e-16 ***
## bs(time_of_day, knots = knots)6 630.4445 47.6572 13.229 < 2e-16 ***
## bs(time_of_day, knots = knots)7 NA NA NA NA
## bs(time_of_day, knots = knots)8 NA NA NA NA
## Temperature 0.2050 0.5025 0.408 0.6833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 144.8 on 1071 degrees of freedom
## Multiple R-squared: 0.8092, Adjusted R-squared: 0.8079
## F-statistic: 648.9 on 7 and 1071 DF, p-value: < 2.2e-16
Testing the data shows a RMSE of 136.4, and an R^2 of 0.830. The coefficient of Temperature is 0.205 with a pvalue of 0.6844, meaning that temperature is not a significant predictor for cars driving through the South Gate, South View.
The model is built the same way as above, but there are spikes in pedestrian traffic as classes transition during the week, making a model less reliable.
tmodel <- lm(pedestrians ~ bs(time_of_day, knots = knots) + Temperature, data = train.data)
ggplot(train.data, aes(time_of_day, pedestrians), time_of_day) +
geom_point() +
stat_smooth(method = lm, formula = y ~ splines::bs(x, df = 5))
predictions <- tmodel %>% predict(test.data)
## Warning in predict.lm(., test.data): prediction from a rank-deficient fit
## may be misleading
data.frame( RMSE = RMSE(predictions, test.data$cars),
R2 = R2(predictions, test.data$cars) )
## RMSE R2
## 1 331.4444 0.7901993
summary(tmodel)
##
## Call:
## lm(formula = pedestrians ~ bs(time_of_day, knots = knots) + Temperature,
## data = train.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.11 -36.48 -6.08 19.73 698.71
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -31.1998 23.3243 -1.338 0.181
## bs(time_of_day, knots = knots)1 32.9545 24.8596 1.326 0.185
## bs(time_of_day, knots = knots)2 -19.5978 26.9564 -0.727 0.467
## bs(time_of_day, knots = knots)3 11.7775 24.5068 0.481 0.631
## bs(time_of_day, knots = knots)4 -132.1387 26.4611 -4.994 6.91e-07 ***
## bs(time_of_day, knots = knots)5 397.1060 21.9897 18.059 < 2e-16 ***
## bs(time_of_day, knots = knots)6 299.0134 31.9308 9.364 < 2e-16 ***
## bs(time_of_day, knots = knots)7 NA NA NA NA
## bs(time_of_day, knots = knots)8 NA NA NA NA
## Temperature 1.4470 0.3367 4.298 1.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97 on 1071 degrees of freedom
## Multiple R-squared: 0.6389, Adjusted R-squared: 0.6365
## F-statistic: 270.7 on 7 and 1071 DF, p-value: < 2.2e-16
Testing the data shows a RMSE of 331.4, and an R^2 of 0.7902. The coefficient of Temperature is 1.47 with a pvalue of approximately 0, meaning that at with an alpha of 0.01, for every additional degree Fahrenheit, there are 1.47 more people walking every 15 minutes at the South Gate, South View. Due to outliers and spikes previously mentioned, any change in the seed drastically changes this data, but the Temperature coefficient tends to stay positive.
As temperature rises, cars decrease and pedestrians increase at the South Gate, South View. However, the effect of this is minute compared to the effect of time of day, so for the final model, temperature will not be included.