As big time data nerds, we (unsurprisingly) believe that good data is the cornerstone of good predictions, and good predictions lead to good decisions. But sometimes, real-world data doesn’t exist, or isn’t available.
To give an example: when COVID restrictions were in place, people were behaving and consuming in ways that were very different from how they act out of quarantine. As a result, a lot of data collected in 2020 and 2021 isn’t super useful for predicting consumer behavior in 2024. There are other instances where data sets may be incomplete or unavailable, or when data sets might be troublesome to use for privacy reasons.
So what can you do when you don’t have the information you need to make a decision? Increasingly, data scientists have been experimenting with synthetic data sets. Synthetic data is, in theory, a useful way to fill in the gaps and get an understanding of trends and patterns when the picture is incomplete. Though traditionally thought of as ‘cheating’ by some professionals, synthetic data is now considered the real deal – and you can use its results to inform your strategic decision-making.
So how effective is synthetic data, really? And how can we get more out of it?
Where does synthetic data come from?
Synthetic data is created by generating numerical values and spreading them over a statistical distribution in a way that – hopefully – emulates the patterns that appear in real-world data sets. Historically, these sets have been limited, because past data generation techniques didn’t always yield data that reflected real patterns.
Recently, machine learning techniques have helped make synthetic datasets more accurate and interesting. Machine learning algorithms can take multiple historical data sets into consideration, and use them simultaneously to create sets that are grounded in precedent. Machine learning also makes it easier to introduce variability into synthetic data sets – it’s faster now to generate different distributions, and to process synthetic data through multiple models that can be explored and tested against.
And using multiple models can be key to getting useful insights from synthetic data. Because they’re artificially generated, synthetic data sets will sometimes lack the kind of outliers and erratic jags that are usually present in real data. Using a combination of models – some that generate data based on old trends, and some that produce totally new sets – can be a good way to work around this problem. Additionally, editing the models to take different time spans and trends into consideration can introduce even more complexity to synthetic datasets, making them more nuanced and useful.
Why do we love synthetic data?
Synthetic data can be a great tool for predicting customer responses to a big event, like a product release. Using machine learning, data scientists can generate multiple datasets to mimic different outcomes, and then make predictions and plan around insights from these sets. It can also be useful for planning campaigns, and predicting metrics like CPC and CPA.
Synthetic data can also be generated far more quickly, with fewer resources, than traditional data. It can help your data teams get up and running with model development and R&D without having to wait for other teams to source the data. And the faster your model goes live, the sooner you can add new data to the algorithm to make it even more accurate.
Another benefit of synthetic data? Privacy. Because all of the individual data in a synthetic dataset has been generated by a computer, and there’s no actual human data in play, there’s no need to worry about clean rooms or privacy regulations. This makes processing data a little less complicated, especially if you’re wanting to use information that would be personal if it belonged to real people.
Introducing irregularity into synthetic data sets is an important step to making them useful. This helps to account for those outliers, jags, and sudden new trends that you’d observe in real life. But even more importantly, the secret is – you guessed it – making sure your real-world data has been collected with good practices. If you’re using synthetic data in the first place, it’s probably because you don’t have everything you need. But if you’re incorporating any real-world data at all, or planning to apply synthetic insights in a real-world way, you need to make sure that your data is cleaned, accurate, and comprehensible.
So I can just use synthetic data all the time, right?
Well…no! Synthetic data is a great tool for setting up and running models, testing accuracy, and making predictions when real data is lacking. But at the end of the day, there’s no replacement for real-world data. Collecting data should always be the priority.
On a recent episode of The WARC Podcast, WARC’s Anna Hamill explored how leading companies like Mondelez International are using synthetic data to create audiences for product development and new flavor creation. Oliver Feldwick, Chief Innovation Officer at The&Partnership, also discussed how his agency uses synthetic data at various development stages of a new campaign or strategy, mainly to help them refine their ideas before they apply them to real research and present them to their clients.
However, the podcast also stressed that synthetic data is no substitute for the genuine article – instead, it should be an additive used to fill in gaps and infer when real research is unavailable.
How can I start generating synthetic data today?
Your first step is to set up proper data governance. You need a watertight plan that directs how you manage your data, from the policies on how you gather and store information to the tools that streamline your workflows.
Good data hygiene is a must – keep your data accurate by validating and updating it, removing errors, duplicates, outdated or incomplete information, and more. You need to make sure the best data is going in to make the best synthetic sets. For more info on how to achieve high quality business data, explore our free, in-depth whitepaper, “Prepping for AI.”
Similarly, you mustn’t forget an important ingredient: qualitative data. Facts and figures will form the base of your synthetic datasets, but non-numerical information like survey responses, interview transcripts, social media posts and online reviews will add important insights into customers’ attitudes, beliefs and behaviors. Be sure to use machine learning tools to sift through this valuable material and integrate it into your synthetic data.
If you’re confident about your data readiness, you can now go about creating your synthetic data. There are several ways to achieve this, but four of the most popular include:
- Asking generative AI, which has already learned the patterns within real data, to produce similarly distributed synthetic sets.
- Using rules engines to generate artificial data based on defined business rules, keeping relationships between data elements for relational integrity.
- Cloning data entities by extracting data from the source system of a specific business entity, like a customer, and masking it for compliance purposes.
- Partnering with a specialist data company that can deliver synthetic data solutions, custom designed for your business.