Need help on poor forecast results

#2673

prophet

facebook

Issue Details

2 months ago

aranes-rc

View on GitHub

I Want to Work on This Issue

aranes-rc

opened 2 months ago

Author

Hi, I'm pretty new to the scene and I've been stuck for days how to fix this random spikes on my forecasts.

My data

This is how my dataset looks like

Date 	        Arrivals
2008-01-01 	279338
2008-02-01 	265827
2008-03-01 	263862
2008-04-01 	235895
2008-05-01 	242822
...
2025-03-01      ...

As you can see it's a monthly data, I've followed most tips the docs have provided when it comes to non-daily data.

Even my holidays are adjusted to my aggregated data (I'm not sure if this is what the docs is telling me to do):

Maundy Thursday: 2008-03-20 -> Maundy Thursday: 2008-03-01

Plotting the data gives the following:

(Jan 2022 is a missing value from my dataset that I kinda just filled with a temporary 'get the mean of neighboring months' solution)

In monthly data, yearly seasonality can also be modeled with binary extra regressors. In particular, the model can use 12 extra regressors like is_jan, is_feb, etc. where is_jan is 1 if the date is in Jan and 0 otherwise. This approach would avoid the within-month unidentifiability seen above. Be sure to use yearly_seasonality=False if monthly extra regressors are being added.

I also did the following tip ^^

months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

for i, month in enumerate(months, 1):
    prophet_df[f'is_{month}'] = (prophet_df['ds'].dt.month == i).astype(int)

for month in months:
    model.add_regressor(f'is_{month}')

# ...

Handling the COVID shock

My dataset is also affected by the COVID19 pandemic. So I followed again most tips from the docs.

I mainly used the following tips

Treating COVID-19 lockdowns as a one-off holidays

lockdowns = pd.DataFrame([
    {'holiday': 'lockdown_1', 'ds': '2020-03-01', 'lower_window': 0, 'ds_upper': '2020-07-01'},
    {'holiday': 'lockdown_2', 'ds': '2020-08-01', 'lower_window': 0, 'ds_upper': '2021-03-01'},
    {'holiday': 'lockdown_3', 'ds': '2021-04-01', 'lower_window': 0, 'ds_upper': '2021-12-01'},
    {'holiday': 'recovery_phase', 'ds': '2022-01-01', 'lower_window': 0, 'ds_upper': '2022-07-01'},
    {'holiday': 'recovery_phase_2', 'ds': '2022-08-01', 'lower_window': 0, 'ds_upper': '2022-09-01'},
])

for t_col in ['ds', 'ds_upper']:
    lockdowns[t_col] = pd.to_datetime(lockdowns[t_col])

lockdowns['upper_window'] = (lockdowns['ds_upper'] - lockdowns['ds']).dt.days
lockdowns

Changes in seasonality between pre- and post-COVID

Here I'm not quite sure how to tweak the custom monthly seasonality I added here. I might need help :/

covid_outbreak_date = '2020-03-21'
prophet_df['pre_covid'] = pd.to_datetime(prophet_df['ds']) < pd.to_datetime(covid_outbreak_date)
prophet_df['post_covid'] = ~prophet_df['pre_covid']

monthly_period = 30.5
fourier_order = 5
model.add_seasonality(name='monthly_pre_covid', period=monthly_period, fourier_order=fourier_order, condition_name='pre_covid')
model.add_seasonality(name='monthly_post_covid', period=monthly_period, fourier_order=fourier_order, condition_name='post_covid')

# ...

My model

model = Prophet(
    yearly_seasonality=False,
    seasonality_mode='multiplicative',
    holidays=pd.concat([lockdowns, holiday_adjusted])
)

months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

for i, month in enumerate(months, 1):
    prophet_df[f'is_{month}'] = (prophet_df['ds'].dt.month == i).astype(int)

for month in months:
    model.add_regressor(f'is_{month}')

covid_outbreak_date = '2020-03-21'
prophet_df['pre_covid'] = pd.to_datetime(prophet_df['ds']) < pd.to_datetime(covid_outbreak_date)
prophet_df['post_covid'] = ~prophet_df['pre_covid']

monthly_period = 30.5
fourier_order = 10
model.add_seasonality(name='monthly_pre_covid', period=monthly_period, fourier_order=fourier_order, condition_name='pre_covid')
model.add_seasonality(name='monthly_post_covid', period=monthly_period, fourier_order=fourier_order, condition_name='post_covid')

model.fit(prophet_df)

future = model.make_future_dataframe(periods=12*6, freq='MS')

future['pre_covid'] = pd.to_datetime(future['ds']) < pd.to_datetime(covid_outbreak_date)
future['post_covid'] = ~future['pre_covid']

for i, month in enumerate(months, 1):
    future[f'is_{month}'] = (future['ds'].dt.month == i).astype(int)

forecast = model.predict(future)

plot_components displays the following:

Cross-validation results are insanely high.

I need help!!!

Is it overfitting?
Also, would like to ask how can I tell if my model is overfitting or not?

I Want to Work on This Issue

prophet

facebook

Issue Details

2 months ago

aranes-rc

View on GitHub

I Want to Work on This Issue

Need help on poor forecast results

#2673

aranes-rc

opened 2 months ago

Author

Hi, I'm pretty new to the scene and I've been stuck for days how to fix this random spikes on my forecasts.

My data

This is how my dataset looks like

Date 	        Arrivals
2008-01-01 	279338
2008-02-01 	265827
2008-03-01 	263862
2008-04-01 	235895
2008-05-01 	242822
...
2025-03-01      ...

As you can see it's a monthly data, I've followed most tips the docs have provided when it comes to non-daily data.

Even my holidays are adjusted to my aggregated data (I'm not sure if this is what the docs is telling me to do):

Maundy Thursday: 2008-03-20 -> Maundy Thursday: 2008-03-01

Plotting the data gives the following:

(Jan 2022 is a missing value from my dataset that I kinda just filled with a temporary 'get the mean of neighboring months' solution)

In monthly data, yearly seasonality can also be modeled with binary extra regressors. In particular, the model can use 12 extra regressors like is_jan, is_feb, etc. where is_jan is 1 if the date is in Jan and 0 otherwise. This approach would avoid the within-month unidentifiability seen above. Be sure to use yearly_seasonality=False if monthly extra regressors are being added.

I also did the following tip ^^

months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

for i, month in enumerate(months, 1):
    prophet_df[f'is_{month}'] = (prophet_df['ds'].dt.month == i).astype(int)

for month in months:
    model.add_regressor(f'is_{month}')

# ...

Handling the COVID shock

My dataset is also affected by the COVID19 pandemic. So I followed again most tips from the docs.

I mainly used the following tips

Treating COVID-19 lockdowns as a one-off holidays

lockdowns = pd.DataFrame([
    {'holiday': 'lockdown_1', 'ds': '2020-03-01', 'lower_window': 0, 'ds_upper': '2020-07-01'},
    {'holiday': 'lockdown_2', 'ds': '2020-08-01', 'lower_window': 0, 'ds_upper': '2021-03-01'},
    {'holiday': 'lockdown_3', 'ds': '2021-04-01', 'lower_window': 0, 'ds_upper': '2021-12-01'},
    {'holiday': 'recovery_phase', 'ds': '2022-01-01', 'lower_window': 0, 'ds_upper': '2022-07-01'},
    {'holiday': 'recovery_phase_2', 'ds': '2022-08-01', 'lower_window': 0, 'ds_upper': '2022-09-01'},
])

for t_col in ['ds', 'ds_upper']:
    lockdowns[t_col] = pd.to_datetime(lockdowns[t_col])

lockdowns['upper_window'] = (lockdowns['ds_upper'] - lockdowns['ds']).dt.days
lockdowns

Changes in seasonality between pre- and post-COVID

Here I'm not quite sure how to tweak the custom monthly seasonality I added here. I might need help :/

covid_outbreak_date = '2020-03-21'
prophet_df['pre_covid'] = pd.to_datetime(prophet_df['ds']) < pd.to_datetime(covid_outbreak_date)
prophet_df['post_covid'] = ~prophet_df['pre_covid']

monthly_period = 30.5
fourier_order = 5
model.add_seasonality(name='monthly_pre_covid', period=monthly_period, fourier_order=fourier_order, condition_name='pre_covid')
model.add_seasonality(name='monthly_post_covid', period=monthly_period, fourier_order=fourier_order, condition_name='post_covid')

# ...

My model

model = Prophet(
    yearly_seasonality=False,
    seasonality_mode='multiplicative',
    holidays=pd.concat([lockdowns, holiday_adjusted])
)

months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

for i, month in enumerate(months, 1):
    prophet_df[f'is_{month}'] = (prophet_df['ds'].dt.month == i).astype(int)

for month in months:
    model.add_regressor(f'is_{month}')

covid_outbreak_date = '2020-03-21'
prophet_df['pre_covid'] = pd.to_datetime(prophet_df['ds']) < pd.to_datetime(covid_outbreak_date)
prophet_df['post_covid'] = ~prophet_df['pre_covid']

monthly_period = 30.5
fourier_order = 10
model.add_seasonality(name='monthly_pre_covid', period=monthly_period, fourier_order=fourier_order, condition_name='pre_covid')
model.add_seasonality(name='monthly_post_covid', period=monthly_period, fourier_order=fourier_order, condition_name='post_covid')

model.fit(prophet_df)

future = model.make_future_dataframe(periods=12*6, freq='MS')

future['pre_covid'] = pd.to_datetime(future['ds']) < pd.to_datetime(covid_outbreak_date)
future['post_covid'] = ~future['pre_covid']

for i, month in enumerate(months, 1):
    future[f'is_{month}'] = (future['ds'].dt.month == i).astype(int)

forecast = model.predict(future)

plot_components displays the following:

Cross-validation results are insanely high.

I need help!!!

Is it overfitting?
Also, would like to ask how can I tell if my model is overfitting or not?

I Want to Work on This Issue