Dispelling misconceptions about synthetic data sets
In November last year, Amazon released AWS Data Exchange, which made it easier for AWS customers to use and share data. Meanwhile, recently Google Data Search came out of beta which makes it easy for researchers to access more than 25 million publicly available data sets.
What can we make of these moves by big tech to open up the sharing of data? It is one I've observed over the last number of years (and one that makes me especially optimistic about the future growth of the world economies) - data is becoming a service.
To put data and its importance into context, McKinsey estimates that about 1.7 megabytes a second of new data will be created for every person globally this year. Considering how central banking and financial services are to daily life, with 65% of customers now interacting with their banks via digital channels, it is fair to assume that the sector will play a pivotal role in the development of this trend. In fact, it is estimated that just under half (47%) of people will check in on their online banking every single day.
This puts incredible pressure on these organisations in critical ways, including accountability to comply with data and privacy regulations (such as GDPR introduced into the EU in 2018), while also catering to consumers in an “always-on” digital environment.
When it comes to data provisioning for development and testing purposes, historically there have been two categories relied upon by financial organisations - original and anonymous. Original refers to all personally identifiable information (PII), such as a customer’s name and a transaction’s details being available. Meanwhile, anonymous data (generally speaking) removes such PII, but includes transactional data. Clearly, with both types of data, there are significant challenges for financial organisations to protect this information, while also remaining compliant with ever more complex regulations around data protection.
Yet there is an alternative AI-driven approach for financial organisations to consider in development - so-called synthetic or synthesized data.
What Is Synthetic Data?
In essence, synthetic data is computer-synthesized data (powered by cutting-edge machine learning technology) that mimics original data.
When implemented accurately, the benefits of this approach include full data privacy compliance and reducing the time needed for product development and testing; synthesizing data can take as little as 10 minutes. As a result, extremely sensitive data can be unlocked to turbocharge product or service development with no actual risks around a potential data breach.
Even still, to truly understand the impact of synthetic data and the impact it could have on the financial industry, it is important to consider what I believe are two common misconceptions related to it.
1. Synthetic data refers to anonymised data
When data anonymisation is discussed, “synthetic” frequently refers to “modified” (and interchangeably so), meaning original data is altered in some systematic way to make it more difficult to identify the original data points.
In fact, there are three main approaches to data provisioning currently available on the market:
anonymised data - produced by a 1-to-1 transformation from original data,
artificial data - produced by a probabilistic model, based on a sample of data
fully synthetic data - produced by a generative model of original data which “understands” how original data should look like.
So powerful is synthetic data that recent research from MIT found that by using high-quality data synthesized by an advanced machine learning engine, it is possible to get the same results for a data-driven project as using original data.
2. Synthetic data can only ever be worse than real data
There can at times be an assumption that data which mimics real data is a poor copy of the original. The overall aim of synthetic data is that it wants to be just as good as providing such key insights as original data. When used right, synthetic data can be just as insightful as needed, what is really critical is that the way the data is created is optimised for what financial organisations are looking for.
By using agile data synthesized by an algorithm (i.e. created and examined in minutes thanks to advanced technologies) the potential to free up a financial organisation’s staff for other tasks is obvious, as even collecting original data is time-consuming, with an estimated 12.5% of development time taken by the process.
The future of data, which is increasingly available on-demand, is actually already here. The financial organisations that face the data challenge directly, and utilise the power and efficiencies of synthetic data, will be the winners for digital customers, both now and into the future.
By Dr Nicolai Baldin, CEO & Founder, Synthesized