The Real Cost of a Customer Data Platform

The pitch is always the same, and it is always seductive. One view of the customer. Stop the four teams from each storing their own half-right copy of who this person is. Marketing stops paying to acquire users it already has. Support stops asking for an order number the system should already know. Everyone nods. The board likes it. Then you go build it and find out that a customer data platform is not really a data project at all.

I have built one of these across several business units of a consumer group: a marketplace, a payments arm, a logistics operation, and a lending product, each with its own database, its own idea of what a “user” is, and its own quiet conviction that its copy is the real one. The savings were real. We clawed back a serious chunk of marketing spend by stopping the group from buying back its own customers. But the part nobody puts in the business case is that a CDP is a standing cost, not a project with an end date. You do not finish it. You operate it forever.

The collector is the easy hard part

Start with the firehose, because everyone wants to start with the firehose. A real CDP ingests behavioral events: page views, taps, add-to-carts, payments, deliveries, app opens. At group scale this is a meaningful river of data per day, every day, with sharp peaks during sales. The collector that sits in front of it has one job and it is harder than it sounds: accept everything, lose nothing, and never become the reason a checkout is slow.

That last clause is the whole design. The collector cannot be in the critical path of anything that makes money. So it is asynchronous, it writes to a durable log first and asks questions later, and it is engineered to fail open. If the enrichment service is down, the event still lands on the log raw, and we backfill. The cardinal sin is dropping an event because some downstream consumer was unhappy. Events are facts that already happened. You do not get to refuse them.

Here is the part nobody tells you about the collector: the hard problem is not throughput, it is the schema of what flows through it. Throughput you can buy. Kafka and a few partitions will eat almost anything you throw at it. What you cannot buy is agreement on what an add_to_cart event means when four teams emit one and three of them spell it differently.

Event schema governance, or the slow death by a thousand fields

Every team will invent its own event shapes if you let them, and you did let them, which is why you are here. One team sends userId, another sends user_id, a third sends customerRef and means something subtly different. One team’s price is in cents, another’s is in dollars, a third’s includes tax. Multiply that by every event type and a few years of nobody enforcing anything, and your single source of truth is a single source of plausible-looking garbage.

So you put a contract in front of the collector. Events are validated against a registered schema at the door, and an event that does not match a known shape gets quarantined, not silently accepted. This is unglamorous and it is the most valuable thing in the system. (We treated a schema change like an API change, with versioning and a deprecation window, because that is what it is.) The day we started rejecting malformed events at ingest instead of cleaning them up in a nightly batch was the day the data stopped lying to us.

Identity resolution is the actual product

This is where the project earns the word “platform.” A raw event has identifiers attached: a logged-in user id, a device id, an email, a phone number, a cookie, a hashed card fingerprint. None of them are complete. The logged-out browsing session has a cookie and nothing else. The payment has a card and a phone. The support ticket has an email. Your job is to decide, continuously, which of these belong to the same human.

Event sources flow into a collector, then identity resolution stitches identifiers into one profile that feeds ML and marketing

Four business units feed one collector; the schema gate rejects malformed events before they pollute anything; identity resolution stitches fragments into a profile that ML and marketing both read from.

We ran it as an identity graph. Identifiers are nodes, observed co-occurrences are edges (this card was used by this logged-in account, this device signed into this email), and a connected component is a person. The graph is never done. New edges arrive forever, two components that looked separate suddenly merge when one event links them, and a merge can rewrite history for thousands of profiles at once.

And you will get it wrong. The scar I carry from this is a bad merge: a shared family device and a recycled phone number collapsed two real people into one profile. Suddenly a customer is seeing recommendations for someone else’s purchases, and in a payments context that is not a cute bug, it is a trust and privacy incident. So identity resolution needs the thing every demo skips: a confidence threshold, a way to split a component back apart, and an audit trail of why two things were ever joined. Merging is easy. Un-merging, cleanly, after a marketing campaign already fired on the bad profile, is the engineering you actually get paid for.

Real-time and batch are two different products wearing one name

There is a fork everyone wants to avoid and nobody can. Some of this has to be real-time: a fraud check needs the profile as it is right now, a session-based recommendation is worthless thirty minutes late. Some of it is fine in batch: the rebuild of the full identity graph, the heavy attribute computation, the audience export to the marketing tool overnight.

We ran both off the same event log and accepted that they would disagree. The streaming path maintains a fast, approximate profile for decisions that cannot wait. The batch path rebuilds the authoritative profile from the full history and corrects the streaming view’s sins. This is a lambda-ish split and I am not proud of the duplication, but the alternative, forcing one path to serve both latency profiles, gives you a system that is too slow for fraud and too expensive for the nightly rebuild. Pick your two products and let them reconcile.

The political cost is the real cost

Now the part the architecture diagram cannot show you. To have one customer record, one team has to own it. The moment that is true, every other team has lost authority over its own definition of its own customer. The lending team can no longer decide unilaterally what “active” means. The marketing team cannot quietly add a field that suits this quarter’s campaign. The payments team, which is right to be paranoid, now has to trust that another team’s pipeline will not leak a card fingerprint into a marketing audience.

This is where these projects die, and they die quietly. Not in a design review, in a hallway. The data is fine and the meeting goes badly anyway, because you have asked four kingdoms to accept a shared currency they did not mint. What kept ours alive was treating the central team as a service provider with an SLA, not a tax authority. Other teams kept their systems of record. The CDP subscribed to their events and offered back something they could not build alone, the resolved cross-business profile, rather than demanding they surrender their tables.

The savings are real and they are large enough to fund the whole thing. Just go in knowing what you are signing: a system that needs a team forever, a schema that needs a referee forever, and an identity graph that will, on some quiet Tuesday, merge two people who should never have met. Single source of truth is not a state you reach. It is a thing you maintain, against entropy, for as long as the business exists.