Bootstrapped Resampling Regression & Revenue Forecasting

By Alex Labuda

November 28, 2022

In this post, I’ll begin by visualizing marketing related time-series data which includes revenue and spend for several direct marketing channels.

I will fit a simple regression model and examine the effect of each channel on revenue.

In order to better understand the reliability of each coefficient, I will fit many bootstrapped resamples to develop a more robust estimate of each coefficient.

Lastly, I will create a forecasting model to predict future revenue. Enjoy!

library(modeltime)
library(timetk)
library(lubridate)

interactive <- FALSE

The Data

df <- 
  read_csv(
    file = "data/marketing_data.csv"
  ) %>% 
  mutate(
    date = mdy(date)
  )

Visualize the Data

First lets visualize our target variable, revenue

df %>% 
  ggplot(aes(date, revenue)) +
  geom_line() +
  labs(title = "Revenue",
       x = "")

Fit: Simple Regression Model

First we’ll start by fitting a simple regression model to the data. This is just the grand mean of our data.

df %>%
    plot_time_series_regression(
        .date_var     = date,
        .formula      = revenue ~ as.numeric(date),
        .interactive  = interactive,
        .show_summary = FALSE
    )

We can make a more accurate model by using month of year as independent variables for our model

df %>%
    plot_time_series_regression(
        .date_var     = date,
        .formula      = revenue ~ as.numeric(date) + month(date, label = TRUE),
        .interactive  = interactive,
        .show_summary = FALSE
    )

We can also introduce our spend variables as input to our model to study each channel’s effect on revenue

df %>%
    plot_time_series_regression(
        .date_var     = date,
        .formula      = revenue ~ as.numeric(date) + month(date, label = TRUE) + tv_spend +
          billboard_spend + print_spend + search_spend + facebook_spend + competitor_sales,
        .interactive  = interactive,
        .show_summary = FALSE
    )

A Simple Linear Model

Now lets take a look at our simple model output that includes direct marketing features and examine their effect of revenue

We can see that:

TV spend, print spend, competitor sales and Facebook spend are statistically significant
Print spend has the largest positive effect at increasing revenue of each of the direct marketing inputs
A large positive intercept suggests that this company already has a strong baseline of sales

# forcing intercept to 0 to just show the gap from 0 instead of a base
revenue_fit <- lm(revenue ~ as.numeric(date) + tv_spend + billboard_spend + print_spend + 
                    search_spend + facebook_spend + competitor_sales, data = df)
summary(revenue_fit)

## 
## Call:
## lm(formula = revenue ~ as.numeric(date) + tv_spend + billboard_spend + 
##     print_spend + search_spend + facebook_spend + competitor_sales, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -440321  -86030  -59261  -12997 1618906 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.345e+05  1.070e+06   0.687   0.4932    
## as.numeric(date) -3.416e+01  5.754e+01  -0.594   0.5534    
## tv_spend          5.007e-01  9.069e-02   5.521 1.04e-07 ***
## billboard_spend   4.183e-02  1.216e-01   0.344   0.7313    
## print_spend       8.755e-01  3.950e-01   2.216   0.0278 *  
## search_spend      5.364e-01  6.683e-01   0.803   0.4231    
## facebook_spend    3.577e-01  2.095e-01   1.707   0.0893 .  
## competitor_sales  2.868e-01  1.154e-02  24.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 264700 on 200 degrees of freedom
## Multiple R-squared:  0.8681, Adjusted R-squared:  0.8634 
## F-statistic:   188 on 7 and 200 DF,  p-value: < 2.2e-16

Viz: SLM coefficients

We can plot the coefficient and confidence intervals to make it easier to see the reliability of each estimate

Print spend has the largest positive coefficient, but also the widest confidence interval
Our model is very confident in the estimated effect of competitor sales

library(broom)
library(dotwhisker)

tidy(revenue_fit) %>% 
  filter(p.value < 0.1) %>% 
  mutate(
         term = fct_reorder(term, -estimate)) %>% 
  dwplot(
    vars_order = levels(.$term),
    dot_args = list(size = 3, color = "darkred"),
  whisker_args = list(color = "darkred", alpha = 0.75)
  ) +
  labs(x = "Coefficient by channel",
       y = NULL,
       title = "Simple Linear Model Coefficients")

Fit: Bootstrap Resampling

How reliable are our coefficients?

We can fit a model using bootstrapped resamples of our data. This allows us to determine the stability of our coefficient estimates. Essentially this is fitting many small models to examine the variation of each channel’s coefficient.

By default reg_intervals uses 1,001 bootstrap samples for t-intervals and 2,001 for percentile intervals.

library(rsample)

revenue_intervals <-
  reg_intervals(revenue ~ as.numeric(date) + tv_spend + billboard_spend + print_spend + 
                    search_spend + facebook_spend + competitor_sales,
                data = df, keep_reps = TRUE)

revenue_intervals

## # A tibble: 7 × 7
##   term                .lower .estimate .upper .alpha .method         .replicates
##   <chr>                <dbl>     <dbl>  <dbl>  <dbl> <chr>     <list<tibble[,2]>
## 1 as.numeric(date) -138.      -35.1    64.7     0.05 student-t       [1,001 × 2]
## 2 billboard_spend    -0.123     0.0403  0.167   0.05 student-t       [1,001 × 2]
## 3 competitor_sales    0.263     0.287   0.304   0.05 student-t       [1,001 × 2]
## 4 facebook_spend     -0.0439    0.368   0.707   0.05 student-t       [1,001 × 2]
## 5 print_spend         0.263     0.875   1.42    0.05 student-t       [1,001 × 2]
## 6 search_spend       -0.639     0.575   1.71    0.05 student-t       [1,001 × 2]
## 7 tv_spend            0.199     0.513   0.763   0.05 student-t       [1,001 × 2]

Viz: Bootstrapped Resampled Coefficients

Crossbar chart

Here we can see the wide confidence interval surrounding search_spend.
This makes sense since the coefficient for this channel in our first linear model was not statistically significant.

revenue_intervals %>%
  filter(!term == "as.numeric(date)") %>% 
    mutate(
        term = fct_reorder(term, .estimate)
    ) %>%
    ggplot(aes(.estimate, term)) +
    geom_crossbar(aes(xmin = .lower, xmax = .upper),
                  color = "darkred", alpha = 0.8) +
  labs(x = "Coefficient by channel", 
       y = NULL, 
       title = "Bootstrapped Resampled Model Coefficients", 
       subtitle = "95% Confidence Intervals")

Time-Series Forecasting

Lets build our forecasting model!

df_ts <- 
  df %>% 
  select(date, revenue)

df_ts %>% 
  ggplot(aes(date, revenue)) +
  geom_line() +
  labs(title = "Revenue",
       x = "")

Training & Testing Splits

First, I’ll split the data into training and testing sets

splits <- 
  df_ts %>% 
  time_series_split(assess = "12 months", cumulative = TRUE)

splits %>% 
  tk_time_series_cv_plan() %>% 
  plot_time_series_cv_plan(date, revenue, .interactive = interactive) +
  labs(title = "Timeseries - Training / Testing Split")

Modeling

Now we can create the recipe for our model.

After we bake, we can see what the output looks like.

library(tidymodels)
recipe_spec_timeseries <- 
  recipe(revenue ~., data = training(splits)) %>%
    step_timeseries_signature(date) 

bake(prep(recipe_spec_timeseries), new_data = training(splits))

## # A tibble: 156 × 29
##    date        revenue date_in…¹ date_…² date_…³ date_…⁴ date_…⁵ date_…⁶ date_…⁷
##    <date>        <dbl>     <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
##  1 2018-11-23 2754372.    1.54e9    2018    2018       2       4      11      10
##  2 2018-11-30 2584277.    1.54e9    2018    2018       2       4      11      10
##  3 2018-12-07 2547387.    1.54e9    2018    2018       2       4      12      11
##  4 2018-12-14 2875220     1.54e9    2018    2018       2       4      12      11
##  5 2018-12-21 2215953.    1.55e9    2018    2018       2       4      12      11
##  6 2018-12-28 2569922.    1.55e9    2018    2018       2       4      12      11
##  7 2019-01-04 2171507.    1.55e9    2019    2019       1       1       1       0
##  8 2019-01-11 2464132.    1.55e9    2019    2019       1       1       1       0
##  9 2019-01-18 2012520     1.55e9    2019    2019       1       1       1       0
## 10 2019-01-25 1738912.    1.55e9    2019    2019       1       1       1       0
## # … with 146 more rows, 20 more variables: date_month.lbl <ord>,
## #   date_day <int>, date_hour <int>, date_minute <int>, date_second <int>,
## #   date_hour12 <int>, date_am.pm <int>, date_wday <int>, date_wday.xts <int>,
## #   date_wday.lbl <ord>, date_mday <int>, date_qday <int>, date_yday <int>,
## #   date_mweek <int>, date_week <int>, date_week.iso <int>, date_week2 <int>,
## #   date_week3 <int>, date_week4 <int>, date_mday7 <int>, and abbreviated
## #   variable names ¹date_index.num, ²date_year, ³date_year.iso, ⁴date_half, …

Preprocessing Steps

Now we’ll add a few additional preprocessing steps to our recipe:

Convert a date column into a fourier series
Remove date
Normalize the inputs
one-hot encode categorical inputs (add dummies)

recipe_spec_final <- recipe_spec_timeseries %>%
    step_fourier(date, period = 365, K = 1) %>%
    step_rm(date) %>%
    step_rm(contains("iso"), contains("minute"), contains("hour"),
            contains("am.pm"), contains("xts")) %>%
    step_normalize(contains("index.num"), date_year) %>%
    step_dummy(contains("lbl"), one_hot = TRUE) 

recipe_spec_prophet <- recipe_spec_timeseries

juice(prep(recipe_spec_final))

## # A tibble: 156 × 39
##     revenue date_index…¹ date_…² date_…³ date_…⁴ date_…⁵ date_…⁶ date_…⁷ date_…⁸
##       <dbl>        <dbl>   <dbl>   <int>   <int>   <int>   <int>   <int>   <int>
##  1 2754372.        -1.71   -2.14       2       4      11      23       0       6
##  2 2584277.        -1.69   -2.14       2       4      11      30       0       6
##  3 2547387.        -1.67   -2.14       2       4      12       7       0       6
##  4 2875220         -1.65   -2.14       2       4      12      14       0       6
##  5 2215953.        -1.62   -2.14       2       4      12      21       0       6
##  6 2569922.        -1.60   -2.14       2       4      12      28       0       6
##  7 2171507.        -1.58   -1.01       1       1       1       4       0       6
##  8 2464132.        -1.56   -1.01       1       1       1      11       0       6
##  9 2012520         -1.54   -1.01       1       1       1      18       0       6
## 10 1738912.        -1.51   -1.01       1       1       1      25       0       6
## # … with 146 more rows, 30 more variables: date_mday <int>, date_qday <int>,
## #   date_yday <int>, date_mweek <int>, date_week <int>, date_week2 <int>,
## #   date_week3 <int>, date_week4 <int>, date_mday7 <int>, date_sin365_K1 <dbl>,
## #   date_cos365_K1 <dbl>, date_month.lbl_01 <dbl>, date_month.lbl_02 <dbl>,
## #   date_month.lbl_03 <dbl>, date_month.lbl_04 <dbl>, date_month.lbl_05 <dbl>,
## #   date_month.lbl_06 <dbl>, date_month.lbl_07 <dbl>, date_month.lbl_08 <dbl>,
## #   date_month.lbl_09 <dbl>, date_month.lbl_10 <dbl>, …

Model Specs

linear regression
xgboost

model_spec_lm <- linear_reg(mode = "regression") %>%
    set_engine("lm")

model_spec_xgb <- boost_tree(mode = "regression") %>% 
  set_engine("xgboost")

model_spec_prophet <- prophet_reg(mode = "regression") %>% 
  set_engine("prophet")

Workflow

We can add our recipe and model to a workflow

workflow_lm <- workflow() %>%
    add_recipe(recipe_spec_final) %>%
    add_model(model_spec_lm)

workflow_xgb <- workflow() %>% 
  add_recipe(recipe_spec_final) %>% 
  add_model(model_spec_xgb)

workflow_prophet <- workflow() %>% 
  add_recipe(recipe_spec_prophet) %>% 
  add_model(model_spec_prophet)

Fit our Model

workflow_fit_lm <- 
  workflow_lm %>% 
  fit(data = training(splits))

workflow_fit_xgb <- 
  workflow_xgb %>% 
  fit(data = training(splits))

workflow_fit_prophet <-
  workflow_prophet %>% 
  fit(data = training(splits))

model_table <- modeltime_table(
  workflow_fit_lm,
  workflow_fit_xgb,
  workflow_fit_prophet
) 

calibration_table <- model_table %>%
  modeltime_calibrate(testing(splits))

Measure Accuracy

calibration_table %>% 
    modeltime_accuracy(acc_by_id = FALSE) %>% 
    table_modeltime_accuracy(.interactive = FALSE)

.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
Accuracy Table
1	LM	Test	242994.3	16.19	0.90	15.14	313485.5	0.80
2	XGBOOST	Test	260391.4	18.67	0.97	16.33	352541.8	0.76
3	PROPHET W/ REGRESSORS	Test	306623.9	21.62	1.14	18.85	384166.2	0.79

Forcasting Results

Lets visualize each model’s prediction accuracy

calibration_table %>%
  modeltime_forecast(actual_data = df_ts) %>%
  plot_modeltime_forecast(.interactive = FALSE) +
  labs(
    y = "Revenue",
    title = "Revenue",
    subtitle = "Actual vs Predicted"
  ) +
  theme(legend.position = "none")

Sources: Matt Dancho, Julia Silge

Posted on:: November 28, 2022

Length:: 8 minute read, 1641 words

Tags:: machine learning regression feature importance time-series forecasting

See Also:: Predicting customer lifetime value; Predicting chocolate bar ratings