An application of a hierarchical time series model on a dataset of historical soda sales in Greece.
On a hot summer day, you walk into the local store, looking for something to quench your thirst. You make a turn down the beverage aisle, and are greeted by a fully stocked shelf with a wide selection of ice cold sodas. It’s almost something that we take for granted, that the shelves never seem to go empty. Behind the scenes, forecasting is a critical part of every store’s operations to ensure that inventory levels are kept at just the right level. In this article, we explore the benefits of using a more granular method of time series forecasting, hierarchical time series.
The code used to create the graphics and models described below can be found here.
We will be using sales data for a chain of stores around Greece from 2012–2018. The dataset contains information about individual product sales on a monthly basis, at each of the store’s 6 locations. We are also given exogenous variables such as the population of the city in which each store is located, and the price of each product.
Using the first 6 years of data (2012–2017) as our training set, we can validate our model forecast on the last year’s (2018) data.
At each shop, 5 different brands of soda are sold; and each brand of soda comes in 3 different container sizes. Using this relationship, and an awesome R package called fable, we can formulate a hierarchical structure on which we’ll build our model.
The hierarchy we’ve chosen starts from the total aggregated sales at the top, which is then divided by individual store sales, then divided by brand, then by individual product. Of course, the levels of the hierarchy could be ordered differently if one chose to do so (and might possibly result in a more accurate model), but if you think of what goes through a consumer’s mind when buying a drink, this is quite an intuitive hierarchy. First, you decide which store to go to, then what kind of drink you want to buy, then lastly, how much (or what container) you want to buy. In total, there are 6 shops which each sell 5 different brands, which each have 3 different container sizes for a total of 90 products; or up to 90 separate time series.
We can also visualize our time series at each level of the hierarchy. First, viewing our total sales, we can observe some clear seasonality. Sales appear to rise in the spring, peaking and dropping in the summer, briefly rising again in the fall before reaching a trough in the winter.
Moving downwards in our hierarchy to the shop level, we can still observe strong seasonality, but note that we can now sometimes see some irregularities in the pattern — there are larger differences in peak heights, and the some years have more jagged sales patterns than others.
Digging another level deeper: each plot below represents the sales of a certain brand at a given shop. It’s tiny, but if you squint, you can see even more irregularities.
At the bottom level of the hierarchy, the time series become even more irregular. It would be quite difficult to fit all 90 plots into one picture, so below are the time series for each of the individual products sold at shop 1 (it looks similar for other shops).
As we deepen the hierarchy, there appears to be more and more noise at the bottom level. This makes sense, as when you move up the hierarchy, the time series are aggregated (in a sense, averaged), which hides some of the noise.
Now that we’ve specified the hierarchical structure and explored the data a little bit; there are two things we specify next: the forecasting model for each individual time series, and how the forecasts are reconciled to match our hierarchical structure.
To keep things simple, an ARIMA model is used to model each individual time series. Note that each individual bottom level series is fitted with its own ARIMA model. Of course, other forecasting models (e.g., exponential smoothing, etc.) can be used, but our objective is to compare a non-hierarchical forecast with a hierarchical one.
The reconciliation of the bottom level forecasts is important, because it ensures that our forecasts is consistent with the hierarchy structure we’ve specified. One might ask “isn’t it sufficient to simply add up all the bottom level forecasts?” While that is certainly possible (“bottom-up approach”), it is not optimal. We will be using the Minimum Trace optimal reconciliation approach (MinT), which I believe is best explained in this book by Hyndman and Athanasopoulos. As hinted by the name, this approach minimizes the trace of the variance-covariance matrix, thus minimizing the variance of the aggregated forecast errors.
Implementation and Results
As the time required to generate the forecasts increases as we extend our hierarchy deeper, we will start from the top and work our way down.
Using the shop-level forecasts as our bottom level, we get some pretty promising results. In the image below, the sales forecasts for each shop (and total aggregated sales) are plotted on top of the actual training and test data. Visually, it’s almost impossible to tell the difference!
Let’s zoom in on just the forecast portion of the total sales, so it’s easier to examine. In the image below, the forecasts resulting from two reconciliation approaches are shown. The blue brand represents the 95% prediction interval for the hierarchical forecasts reconciled using the MinT approach, and the slightly wider red band represents the forecasts if we were to forecast the total sales directly (non-hierarchical method). The slightly bumpier black line is the actual test data.
To further compare the two methods, we can use an error metric such as MASE (a wide variety of metrics could be used, the relative performance of the two methods will be similar). The hierarchical forecast achieved had a MASE of 0.817 while the non-hierarchical forecast came in with a slightly lower error at 0.793.
We extend our hierarchy such that the bottom level is now made of brand-level forecasts. This means we are now aggregating 30 (6 shops x 5 brands) forecasts. The aggregation occurs at the two higher levels (shop-level and total sales level).
In the below picture, we see the forecasts at shop 6. Since we have gone lower in our hierarchy, there is more noise, and some forecast reconciliation methods are not working as well. Particularly, the green line in each of the plots below represent the forecasts which would be used under the bottom-up approach, which deviate quite a bit from the actual data.
We see a similar result when these forecasts are aggregated at the shop level, and at the top level.
Even though the bottom approach seems to be poor, when we compare the MinT reconciled hierarchical forecast with the non-hierarchical forecast, the results are still quite good. Again, we see that the prediction intervals are narrower under the hierarchical approach. MASE under the hierarchical approach is 0.821 (compared to the same non-hierarchical MASE of 0.793).
While it would be easy to extend the hierarchy further to the product-level, the computing time becomes quite a burden. This hierarchy level would require computing a total of 90 forecasts, in addition to the reconciliations. As such, we will forego this exercise, though the code required to do so has already been included in the link above.
At this point we’ve observed that there appears to be a bias-variance trade-off between a hierarchical forecast vs. a non-hierarchical forecast. We also see that the bias increased when we extended the depth of the hierarchy structure.
It is not quite visible in the plots above, but the variance of the errors has also decreased as we extended the depth of the hierarchy. Creating forecasts from the brand-level resulted in a variance that was about 5% lower than forecasts created at the shop-level.
Conclusions and Further Considerations
Hierarchical forecasting is a powerful tool to extract more information out of data which has an intrinsic data structure. The additional information results in a lower variance of the resulting forecasts, coming at the cost of slightly increased bias resulting from noise at the lower levels of the hierarchy.
In practice, many time series exist in an aggregated form, though data at a more granular level is hard to obtain, or the subgroups may be unclear. In these cases, clustering analysis might be used as a preliminary step to identify an effective hierarchical structure.