Logo

There needs to be a better way to stitch datasets together

Winston Li
Feature image

Last week, I met someone for lunch and during our conversation, they asked me how I came up with the idea of building the Synthetic Society. My answer was simple: There needs to be a more versatile way to stitch datasets together.

What do I mean by that? Well, in the consumer analytics space, data comes in all shapes and forms, and it needs to be worked into something that can be used in many different ways. Plus, it has to connect with what’s happening at the national level. Some datasets are snapshots of certain groups, others are designed to represent the whole country. Some datasets are long, meaning that they have a few columns and many rows. For example, card transactions and mobility data often have this format. You get a ton of records per day, but each record comes with just a few columns like ID, timestamp, lat/long, or transaction amount. Others are wide with fewer rows but lots of columns. Survey data is a bit like this. You can ask respondents a lot of questions, but you can't sample too many of them. As a concrete comparison, one of the large US panels we license has a sample size of 100,000 with 10,000 questions (columns) per respondent. Our US mobility data, on the other hand, provides us with tracking of 180 million devices with between 1-3 billion readings per day. However, each reading only comes with 4 useful things: timestamp, device ID, latitude and longitude.

If our goal is to create a data fusion with all these sources so that information can cross-pollinate, how do we even start? There is no guarantee that individuals from Source A will overlap with those from Source B, and even if the intersection is non-empty, it'll likely be small. Let's assume Source A has 10% of the population (this would be a very large sample by market research standards) and Source B also has 10%, if stratified random sampling is used, then the overlap would only be about 1%. This means if we want to have a dataset with 10,000 people containing variables from both sources, then we would need to match two datasets with one million people each. Adding a third source with a similar sample size reduces the overlap down to just 0.1%. As you can see, sparsity becomes a problem really quickly.

Datasets venn diagram

But wait! That's not the only problem. Granularity is also a problem. Some datasets are individual-level while others are aggregated. Individual level data, called microdata in statistics, contains records of each respondent while in the aggregated case, respondents are averaged to a geography of tens or hundreds of people (such as postal codes or census blocks). Let's imagine this: A brand has data about what products each customer buys, and they wish to combine that with an aggregated source for family income. How do they match a customer to a geo with a few hundred people? Do they assume that a customer has a family income equal to the average of their postal code? Or do they assume everyone in the census block are potential buyer of their product because one person made a purchase last week?

To make things even more complicated, scalability is another issue. We have many datasets, some are long, some are wide, some are microdata, and some are aggregated. How do we combine them all together? On top of all this, we can't use PII to match anyone because of data privacy law.

The solution is the SynC algorithm, which takes aggregated data and reverse engineers the process to reconstruct microdata. Census is the most comprehensive source available so we always begin there and use that as the basis to create a 1:1 statistical reconstruction of the true population. If the census says there are 40 million people in Canada, then the reconstruction needs to have 40 million synthetic individuals. If the census says Bibb County in Alabama has such and such averages, then synthetic individuals there need to have the same averages. Some people have trouble wrapping their heads around the concept and think synthetic individuals are fake people. This is not the case. Think of it like any output produced by GenAI, it has no real authorships but is close enough to real observations and can still provide lots of value. ChatGPT learns from human-written texts and replicates synthetic texts; Midjourney takes photos of humans and generates synthetic photos; for us, SynC picks up patterns from real datasets and creates or enriches synthetic individuals.

synthetic population

Once the base population is established, the rest is easy. For bigger datasets, it's simply matching to the base population (and even without PII, we can still, with a high level of statistical confidence, match with age, gender, behaviors, income, visitations, etc.); for smaller datasets, we can train machine learning models based on them and then project them to the entire base population. Other aggregated sources can be joined to the census by geo and downscaled in the exact same way as downscaling the census. This gives us a universal way of stitching datasets of all formats.

Think of this like soup making. Census is the soup base - it's less rich in flavour (fewer variables) but fills the pot homogeneously (no underrepresentation). Each additional source you add is another ingredient - it's more flavourful but needs to be “melted” into the soup base. You can decide what the best soup is by customizing it with your ingredients. The end product is a synthetic national population available at the individual level.