A Forecasting Ensemble That Actually Ships

The number that moved the business was not an accuracy number. It was weeks of inventory on hand, and it came down hard enough that the finance team asked me twice if the report was wrong. That is the only forecasting result anyone in the room actually cared about, and it is worth remembering before any of the model talk, because the model talk is the part that nearly seduced me into solving the wrong problem.

This was a marketplace I led, forecasting demand across a catalog in the hundreds of thousands of SKUs, feeding replenishment and warehouse allocation. The brief sounded simple. Predict next-week demand per SKU per region, well enough that we could hold less stock without running dry. Hold too much and you are financing dead inventory in a warehouse. Hold too little and you stock out, lose the sale, and (the part that hurts later) corrupt your own training data, because a stockout looks exactly like low demand to a model that only sees units sold.

Why no single model was enough

I tried to make one model carry it. I tried hard, because one model is so much easier to serve, retrain, and explain to a skeptical ops director. None of them was good enough alone, and they failed in different places, which turned out to be the whole point.

The classical statistical model, a seasonal ARIMA per series, was excellent on the high-volume SKUs with clean, regular history. Stable demand, strong weekly seasonality, it nailed those. It fell apart on anything sparse or spiky, and it had no way to see a promotion coming because it only knew its own past.

The sequence model, an LSTM trained across many series at once, learned shared patterns the per-series models never could. It borrowed strength from neighbors, which is exactly what you want for a SKU with three months of history. It was also a moody overfitter that would invent confident demand for items that had simply gone quiet.

Gradient boosting (XGBoost, fed with lags, calendar features, price, and promo flags) was the workhorse. It ate the messy external signals the other two could not touch. On its own it was jumpy week to week and weak on long seasonal cycles.

Here is the part nobody tells you. The ensemble did not win because three models are smarter than one. It won because the models were wrong about different SKUs, and averaging away uncorrelated errors is close to a free lunch in this game. The lift over the best single model was real and it was boring: the blend was never the best on any one series, and it was rarely the worst, and across the whole catalog that consistency is what let us drop the safety stock.

The blend, written out by hand

I did not want a black-box stacker that I could not reason about at 2am during a stockout. So the blend is a per-segment weighted average, with weights learned on a recent validation window, not the full history. Recency matters because the right weighting drifts: promos got more aggressive over the year, and the boosting model’s weight crept up as it learned them.

import numpy as np
from scipy.optimize import minimize

def fit_blend_weights(preds, actual):
    # preds: dict of model_name -> array of predictions over the
    # validation window. actual: the true demand over that window.
    # We learn one set of weights PER SEGMENT (fast/slow/new movers),
    # because a weighting that's right for a steady SKU is wrong for
    # a three-week-old one. Global weights were the first thing I tried
    # and the first thing the slow movers punished me for.
    names = sorted(preds)
    P = np.vstack([preds[n] for n in names]).T   # rows = weeks, cols = models

    def weighted_mae(w):
        w = np.clip(w, 0, None)
        w = w / w.sum()                          # keep it a real average
        blend = P @ w
        # MAE, not MSE. Squared error lets a couple of promo spikes
        # bully the weights toward whichever model chases spikes,
        # and that model is usually wrong the other fifty weeks.
        return np.mean(np.abs(blend - actual))

    w0 = np.ones(len(names)) / len(names)
    res = minimize(weighted_mae, w0, method="Nelder-Mead")
    w = np.clip(res.x, 0, None)
    return dict(zip(names, w / w.sum()))

That is the entire clever part, and it is not very clever. Non-negative weights that sum to one, optimized for absolute error on a rolling window, fit separately for fast movers, slow movers, and new SKUs. We retrained the weights weekly. We retrained the base models much less often, because the base models were stable and the right blend was not.

The serving headache nobody budgets for

One model is one artifact, one latency profile, one thing to monitor. An ensemble is three model lifecycles that have to agree on a schedule, plus a blending layer, plus the truth that your forecast is only as fresh as your slowest model. The LSTM needed a GPU box and a batch window. ARIMA was cheap but had to be fit per series, which at this catalog size is a lot of tiny fits. XGBoost was fast to score and slow to feature-engineer.

We ran them as separate batch jobs writing predictions to a shared table, keyed by SKU, region, and forecast week, with the model name as a column. The blender read that table, applied the per-segment weights, and wrote the final number. The discipline that saved us: if any base model was stale or missing for a SKU, we did not silently blend the survivors. We fell back to a named, logged default and flagged it, because a quietly degraded forecast feeding replenishment is worse than an obviously broken one. (We learned this the week the LSTM job failed and nobody noticed for two cycles. The blend kept producing perfectly plausible, perfectly worse numbers.)

The boring problems that actually decided it

Now the real work, the part that moved the inventory number more than any model swap.

Stockouts. A day with zero sales because you were sold out is not zero demand, but your sales table cannot tell the difference. Train on raw units sold and you teach the model to under-forecast exactly the items that already hurt you, a feedback loop that quietly starves your bestsellers. We reconstructed censored demand: where availability dropped below a threshold, we treated the observation as a lower bound and imputed from the pre-stockout run rate. Crude, and it mattered more than the LSTM.

Promotions. A promo spike is real demand, but it is demand the model should attribute to the promo, not to next Tuesday. Without a clean promo calendar as a feature, every model learns that random Tuesdays sometimes triple, and then hedges upward forever. Getting the marketing team’s promo plan into the feature pipeline, reliably and ahead of time, did more for accuracy than anything I trained.

Holidays. Obvious in theory, painful in practice across regions with different calendars and a lunar new year that walks around the solar one. A holiday feature that is off by a few days is worse than no holiday feature, because the model anchors to the wrong week.

def clean_demand(units_sold, in_stock_ratio, run_rate, floor=0.2):
    # If we were mostly out of stock, units_sold is a lower bound on
    # demand, not demand. Impute from the pre-stockout run rate so the
    # model doesn't learn to starve the exact SKUs that stocked out.
    # This one function changed the inventory number more than swapping
    # any base model did. Sit with that.
    if in_stock_ratio < floor:
        return max(units_sold, run_rate)
    return units_sold

If I could give one thing to a team starting this, it would not be a model. It would be a clean, trustworthy record of when each SKU was actually available to be bought, and a promo calendar that lands in the feature store before the promo runs, not after. Get those, and a dull three-model average will cut your inventory. Skip them, and the fanciest architecture on the leaderboard will confidently forecast the past you accidentally taught it.

So which model should you reach for? Wrong question. Ask which of your data problems is silently training the model to be wrong, and fix that first.