Make your company data infinitely portable and 100% private!
Properly executed privacy engineering can generate data sets that accurately represent your real world customers without any risk of privacy breaches and with a minimal reduction of only 2-5% accuracy.

Introduction

Protecting customer data is a fundamental aspect of any SaaS business and is a well understood requirement and largely a solved problem so long as established codes of practice are followed.

More recently I’ve also had to consider how best to anonymize that data so that my teams can use it to build data science infrastructure and machine learning models for specific use cases within our business.
Within Snapfix we’ve amassed a considering volume of image, video and textual messaging data as a whole, but for some of our use cases the volume of anonymized customer data needs to be augmented with additional synthetic data in order to realize the volumes needed to train our current and future machine learning models.

The issue of anonymization and synthetic data generation becomes even more important when my team needs to get advice or have a specific task carried out by an outside data analyst or ML engineer.

As our volume of data has increased we’ve become more aware of the need for the above. It’s a common issue for data dependent SaaS companies such as Snapfix and there is a catchall term for this activity called Privacy Engineering.

Privacy Engineering biases

The deeper I’ve gone down the rabbit hole of Privacy Engineering, the more I realize how important synthetic data is. It can even get to a point where an equal volume of synthetic data can lead to better learning outcomes than using a equivalent amount of real world customer data. We can tweak our synthetic data generators to filter out biases within the data so that it better represents the customer profiles we intend to target within the marketplace. In the case of Snapfix, we have a lot of real world data that’s representative of the hospitality and build environment space and we can use this real world data to generate artificial data sets that are similar but tweaked for other industry verticals. We can remove biases so long as they can be generalized. For example, we may observe that 70% of our cohort within one of our real world datasets are female, but we intend to target another vertical with the same use case, but in that vertical we know that the male to female ratio is closer to 40/60. In this case, we can adjust the biases in the data based on what data scientists have done in the past to address such imbalances.

Differential privacy and privacy guarantees

Before I considered extensive use of synthetic data I was considering other techniques to make our real world data private. Noise can be inserted into a data set so that it becomes mathematically impossible to reverse engineer the data in ways that could reveal a connection from that data to any individual or company. I remember in the mid 2000s reading about statistical analysis that linked many data sets together and in this way revealed the location or other attributes of a data set that could extrapolate what individual the ‘anonymized data’ was referring to. Differential privacy methods make his kind of statistical attack mathematically impossible. Apple use this technique for real time user prediction such as predictions on what emoji you will use next. The US Census department use it to anonymize citizens data. The disadvantage for Snapfix or for any SaaS business that’s less than 5 years old is that a massive amount of real world data is required if you want to preserve a high level of accuracy within the anonymized data. This was an issue for Snapfix and is a major reason why I chose the synthetic data route rather than relying on differential privacy techniques.

Releasing data to the outside world

No matter how well you anonymize your data, there is always a risk of a reverse engineering attack that could result in some personally identifiable data becoming apparent with the dataset. Privacy engineering greatly reduces or even eliminates this risk. To date, Snapfix has never released any data to outside sources. With our increasing reliance on synthetic data, it’s unlikely we’ll ever release any real world data. Instead we intend to rely completely on synthetic representations of our data sets anytime we will have to seek outside advice or outside data analysis or ML. It’s even becoming likely that we’ll rarely or maybe never use our real data for anything other than running the live versions of our mobile and web apps!

Load testing

Surprisingly, I haven’t come across any other company or service that mentions the benefits of synthetic data for load testing and API testing. Maybe it’s just an obvious usage, but in our case, it saves a lot of time and effort. When we have a reasonably representative synthetic data generation process in place it will remove the need to tweak our testing data and convert it into a form suited to load testing. We can instead rely on the de-risked synthetic data. We can sacrifice 2-5% of accuracy for 100% privacy within the data.

Who’s using synthetic data sets the most?

There aren’t that many industries using synthetic data in any meaningful way. I’ve had to rely on the medical industry to get most of my information on this. The financial space is also starting to use synthetic data as is the gaming industry.

Other benefits

One of the other major benefits in understanding synthetic data generation is that it gives you valuable insights into ways to filter and analyze your incoming data for signs of fake users or bots. The gaming industry in particular have been able to apply their synthetic data expertise into building algorithms that detect bogus players. It’s something that may be applicable to many SaaS businesses in the future.

The next steps for Snapfix

I’ve already made exploratory use of synthetic data in manual processes that have proven its effectiveness. Over the coming weeks I intend to try automatically generating fully synthetic versions of our MySql operational database and a synthetic version of our BigQuery event data lake. There are some services in the marketplace that make this kind of approach easier than it first appears. It would be beneficial and time saving if we can establish a fully automated process to produce a digital twin of our operations and transactions data sets that is fully synthetic and has no link whatsoever back to our real world data. This will ensure 100% privacy but still be representative of our use cases. We can potentially plug these data sets into our existing processes with no other work needed to adapt our (staging) apps, automated and manual testing, load testing and other processes and services into the synthetic data. Fundamentally, within Snapfix, we will have less of a reliance on gatekeeping and maintaining encryption and more of a reliance on generated data for all of our purposes outside of our live databases. When creating test and staging servers we will synthetically generate them instead of taking copies of the live data and anonymizing it.

Some emerging Synthetic Data Services:

Datagen Cvedia Hazy AI Reverie Anyverse