Transcript of "The potential power of synthetic data in government"

JAMES POULTEN: Good morning everybody, hello, thank you very much for joining us, taking time out of your busy day to come along to this webinar.


Today we are going to be exploring the potential power of synthetic data in government and how we can use synthetic data to supercharge our analytics process.


I am James Poulten, I am going to be your host for today. Without further ado, let’s get underway.


So, who are we? Made Tech is a consultancy firm, technology consultancy firm working exclusively with the public sector be that central government – MoJ, Cabinet Office, to local government councils such as Hackney and City of Westminster. Also, a number of NHS and health consultancy contracts. Our whole ethos is doing good work for good causes. Working ourselves out of a job and really trying to have a positive impact on all our projects.


My name is Dr James Poulten. I am the lead Data Scientist at Made Tech. I stayed at university far too long, so I have a PhD in experimental quantum matter physics and a masters in physics from the University of Royal Holloway. In my time as a Data Science Consultant, I have worked in a number of government departments on a whole range of projects with the Met Office, the Home Office, Number 10, the Cabinet Office with the levelling up project, Digital Passenger Services – helping with Passport Control.


Today we are talking about synthetic data. What is it? How can we use it?


Synthetic data is at the root of it all, artificially generated data that we can use in an analytics process that sidesteps or removes the concerns around sensitive data leaks and privacy issues. If you think artificially generated data is coming from a random state and then being created into a state that can be used to perform all of the complex analytics that you would expect to with a real data set. Because we’re not collecting that data through direct means when generating it, we don’t have to worry about leaking personal information or identifiable information in our reporting.


A quick overview of the agenda today. We’re going to have an overview of synthetic data, a quick background. I’m going to walk through some of the different types of synthetic data. We’re then going to focus in on how we can make synthetic data useful and how we currently make synthetic data, and how we can apply that to our needs. Then we are going to talk a little bit about why we should be using synthetic data.


Why synthetic data? What is it, how can we use it and what do we use it for?


Synthetic data has massive benefits in a few key areas. It enables collaboration between departments. It removes concerns around GDPR and leaking of sensitive or private information. It can be used to really enhance the analytics process to speed up your analysis and improve collaboration.


Synthetic data is basically data that is modelled on real data and is able to preserve within it all of the multivariant relationships, the interplay between various features of your dataset, and various categories and different users and different cases. All of that interesting analytical information that we look to interrogate and draw conclusions from is maintained within synthetic data. However, that synthetic data is referencing people who are not real. So, there is no privacy concern, there are no data breach concerns. Think of it if you will as though you are simulating the population of a city in a computer game. You can then perform your analytics on that simulated population. As long as that simulated population has been created in such a way that it is a good facsimile or mirror of real data, of real populations, then all of your analytics steps and processes that you develop for your simulated population can be directly transferred to your real data. You can have a good level of confidence that analysis and those steps will still fit, apply and work with your real data.


Another fantastic bonus of synthetic data is because you are removing all of those privacy and GDPR concerns, all the guardrails that surround sensitive information become less of an issue. It becomes a lot easier to share your data between departments, build collaborative working environments that you can really start to harness extra data as more people begin to collaborate in that space.


What are the levels of synthetic data? We are going to go through a few of the levels now. This is not a complete list but hopefully it will give us a good understanding of what is out there currently.


Let’s start by saying it’s important to remember that synthetic data is not a new idea. As far back as 1993 is when the idea really took hold and started to be applied to more sensitive information like census data or masking sensitive variables. Although we actually go back to 1986 when some of these techniques were first employed. It’s not a particularly new field but with the mass explosion in computing power, as I’m sure you can understand, it’s becoming far more accessible, and far easier to implement into your own datasets.


Currently, when we are discussing how we mitigate some of the concerns or risks with sharing data, we talk about a number of techniques. We talk about redaction (just removing sections of the data, features, or sensitive fields), we talk about replacing or masking (introducing dummy variables, dummy values, encrypted columns to strip all meaning). We can talk about coarsening the data, so removing a number of significant figures, or stripping a postcode down to the first three characters.


We can do mimical simulation, more applied statistical methods to generate this data – so very simple synthetic data. Each of these current obfuscations have the drawback that they will remove the analytic potential of your dataset. So, if we are removing fields or columns, or redacting sections of your data, then we can no longer perform analysis on it, so it is essentially invisible to your data analyst teams.


What are the different levels of synthetic data generation?


First of all, we have simple, structural synthetic data. Here we are just recording the data model, essentially. What are the data types, what are the field names, how do they relate to other datasets? This is primarily used in code testing, the testing of APIs and systems. It has no real analytics value if you are looking to actually interrogate the data because it doesn’t contain any data. It is just schemas of various data fields. Unless you have particularly sensitive field names, typically, there is very little disclosure risk if your data model was released into the wider world. It is unlikely that anyone would be able to draw anything too sensitive from it.


We then go into synthetic valid. This is something that you may already be doing, where you have your schemas, your data tables, your model, and you just fill it with random values. These tend to be plausible values, but you may be testing edge cases of ingests and transforms, and you’re making sure that the weird ones work as well.


This is no use if you are looking to do any real analytics against this data because these are just random values, probably fairly uniformly distributed. Again, you’re basically just one step away from just a structural synthetic view of your data. There’s very little risk of any sensitive data being released.


The next is plausible synthetic data. Here what we are doing is we are applying simple distribution models to particular fields or characteristics within your dataset, so, your age field suddenly is populated by the distributional ages that you would expect to see. Still, this doesn’t provide us with much analytical benefit. Although you may be able to pull out some very simple statistics like the mean distribution, the spread, and the skew of your data, this is only univariant analysis. You are only looking at one field at a time in isolation.


If you are looking to build any more complex analysis steps or machine learning steps, then really you are not going to be able to do it just by producing plausible synthetic data.


What we are talking about today instead is robust multivariant synthetic data. This means that now, each of your distributions within your fields, each of your spreads, your skews, not only represent the true data of that field, but also hold within it the relationships. So, how does sex affect the age of the data you are collecting, and that sort of thing. This is where those hidden relationships and hidden distributions – we can really start to break down the data and perform some far more complex and in-depth analysis. This is what we are talking about when we are talking about synthetic data. It’s multivariant, it contains a lot of information and can be interrogated quite extensively.


Obviously with each of these levels of data synthesis, the analytic value increases but also, so does the disclosure risk. Although we are removing any personal identifiable information, any sensitive information that maps directly to data entities, the analytical value, the analytics of that data set remain intact. Although you no longer have to worry about GDPR, for example, you still do have to take care of your synthetic data, make sure you have it stored in the correct environments and that you have the correct security around it. Because now you have a dataset that is a snapshot of your real data.


Now we are going to have a look at a few different methods around producing synthetic data. SMOTE is an interesting one, that is used in classification problems. Where you are looking at clustering your data, identifying user groups. The typical example that is always given is identifying spam emails. Data where you are trying to classify which group each data point belongs to.


Here you can see – is it a blue datapoint or is it an orange datapoint? SMOTE was first proposed back in 2002. The idea behind Synthetic Minority Oversampling is that you draw vectors between each of your data points within a class or classification, and you simply generate a point that sits on each of those vectors. If you look at the diagrams on the right-hand side, I think it’s fairly well depicted. You start with your four blue datapoints, you draw lines between those four data points, and then you place your new data points on those lines. This is obviously in two dimensions; you can imagine this being in a hundred dimensions. Computers have no issue creating hundred, thousand-dimension vectors. That’s just what they are good at. So, you can do this to whatever extent you are required with your feature space.


Now you have a far greater prevalence of that classification within your dataset, and that allows you to train your classification models using a far more evenly weighted set of classifieds. The issue with this in our instance, where we are talking about using synthetic data to remove the sensitivity issues, is that you are still using your real data there as the majority of your data. What you are doing here is you are increasing the signal of a certain area of your dataset. So, you are increasing the number of spam emails within your dataset, so that your classifier model has an easier time of identifying, or training to see spam emails. But you still have the rest of your emails as part of the dataset.


So, a really powerful technique, really widely used but not ideal for generating synthetic data with the means to share or collaborate.


Next we are going to look at auto-encoders. These were established back in 1986. These are really interesting. What these essentially do is they compress your feature space into a form of component analysis. So, it tries to create meaningful vectors of your features. Rather than three features being represented by three individual columns, those three features would be compressed down into a single vector column, so now you’ve gone from three columns to one column. You have a reduced feature space, but hopefully all of that information is still maintained within that vector.


So, you have the encoder, which looks to reduce down, and then you have the decoder, which looks to explode those features back up. You take that one feature vector and explode that back out into three distinct features. That is how you reconstruct your dataset. This is a really interesting method and is widely used in the generation of synthetic data. If you have smaller sets of data and you need to increase your volume so that you can better train machine-learning models, then this is a really powerful way of doing it.


Again, we run into the same privacy issues or concerns because the encoder sees your real data. There’s no reason you couldn’t find your data reconstructed after the encoder. There’s no reason you wouldn’t be able to see or be able to map real entities or people back from your reconstructed data into your original dataset. So, a really powerful technique, widely used to bulk up or bolster existing datasets to help with machine learning model training but still not ideal if you are looking to collaborate, share and navigate privacy concerns.


Finally we are going to be talking about generative adversarial networks or GANs. That’s where we are going to spend the majority of our remaining time today. These GANs consist of two opposing neural networks fighting against each other. They were proposed back in 2014 by Ian Goodfellow et al. These are the models that have really exploded in recent years. If you’ve seen anything about generative AI art online, it is more than likely using a GAN. They are used in a whole range of different fields currently, from generating high resolution images to hexed blocks, to synthetic tabular data, which is what we are going to be using it for. A really interesting, really cool piece of technology. The picture on the side there is one of the first portraits that was generated and then sold for not a small amount of money back in 2018 at Christie’s.


How then, do we create our synthetic data, and how do we enact these GANs?


Let’s understand first what a Generative Adversarial Network is. What you essentially have is two neural network models working in opposition to each other. You have the generator, which takes a random starting point, a random input, passes it through all of its transformation layers and generates a piece of synthetic data. That synthetic data is then passed to the discriminator, which tries to assess whether or not it is generated data.


The discriminator has view of the real data, and is trained using the real data, so it knows what the real data looks like. The synthetic data is passed to the discriminator and the discriminator says yes or no to that piece of synthetic data. With the rejected pieces of synthetic data, it then passes a signal back to the generator which says no, that’s not good enough, try again. The generator updates all of its internal parameters and tries again. It tries to generate another piece of data.


Where the real power of a GAN comes in, is theoretically, the generator never actually sees the real data. So, you can build these knowing that the generator is starting from a wholly random place. It’s starting from white noise being passed in through translation layers. It’s not copying or learning any actual specifics about the real data; it is learning about the form of the real data.


That’s where the power is. Let’s now break that down a bit more. Let’s take images. Images are easier to conceptualise than tabular data which isn’t always the most fun to play with these rather heavy concepts with. With an image generator, we start with a random sheet of noise. We pass that into our generator. Our generator performs a whole bunch of transforms on that data and says let’s see if this is what the data looks like. It passes to the discriminator a slightly less noisy piece of noise.


So, that first face in the rejected attempt on the right-hand side of the screen. The discriminator says, no, I can easily detect that is a generated piece of data. That’s a generated piece of data because if I look at my real data set, none of my real data sets actually have any noise in them. That’s fake.


It then passes it back to the generator and the generator updates and tries to learn what is creating that rejection, what is signalling to the discriminator. It passes in another image, another portrait. The discriminator says well, no, I can still tell because it’s still fuzzy, but it’s less fuzzy. It feeds back to the generator that it is improving, that it is moving in the right direction, but it is still not there yet.


The generator generates a third image and this time it passes it to the discriminator and the discriminator says, ah, yes, that’s just one of my training set, don’t worry about it. And through it goes. And there you have your generated data that is completely unique and distinct from all of your real data but is 100 percent synthetic. It has come from random noise inputs and been passed through a translation layer to produce a piece of data that fools the discriminator model.


I like looking at these, these are excellent. This is where we were back in 2014 when these were first introduced. You can see that it is definitely a face. I think it’s possibly better than 2015 to be honest, which looks like PS1 era graphics. We’ve come a long way, basically. In 2017 we are already at that photorealistic level. We are a number of years on from 2017 now, so yes, they are really quite powerful.


In fact, all of the images that we were using there in that example, these have all been – none of these people exist. These are all entirely generated faces. You can do it yourself if you go to ‘thispersondoesnotexist.com.’ You can play around with their face generator. Actually, it’s really useful for training documents where rather than having to have sign off or purchase doc photos, you can now have randomly generated faces.


What actually is going on? How can we go from generating faces or portraits to generating tabular data? If we actually look at what the networks are seeing if we look at portrait data, they are seeing a massive matrix or matrices of pixel intensities and pixel colours, so they are already seeing tabular data. The square of pixels can very easily be transformed out into a row of pixels which is just a list of intensities, which then automatically becomes a table.


So now you have your training set of a thousand pictures becomes a thousand rows. Now it is tabular data. So, in fact, the complexity of doing this for picture data is far higher than it is for doing it for tabular data in a lot of ways. The structures behind image data are far more complex and far harder to learn than it is for tabular data.


So, we can apply all the techniques we use to generate images to generating synthetic tabular data. And we are. People are doing this already. The data science campus has been looking into generating synthetic data and they are finding it quite useful. We’ve got the real data and the synthetic data taken from the numerical and categorical features in the US Adult Census dataset. They have generated synthetic data using GANs and then they have used that dataset to train models and build analytic processes and then apply those processes to the real data set. They have found that they are already at 80 percent performance just by training off that synthetic data. The idea is that you can very quickly spin up testing environments to pass out these synthetic datasets to analyst teams, data scientists and machine learning experts.


Whilst you are working out all of the sensitivity issues, the privacy issues, you can already be disseminating this data to your teams, who can start working on the processes.


Then they can take those processes and translate them directly to real data.


Let’s look now at using synthetic data in government.


Honestly, the big impact here is going to be on GDPR concerns. When you are dealing with synthetic data, you are no longer dealing with real people’s data. You are dealing with simulated people’s data so there is no need to anonymise or pseudo-anonymise your specific data records. The data cannot be mapped back to real individuals.


Obviously there needs to be a level of tuning, depending on how broad your dataset is, how big an area, how big a net you are casting. You need to make sure there is a certain level of fuzziness to the data, to make sure you don’t just end up recreating people who are too realistic, too good a facsimile of real people. As long as you bear that in mind, you can quite successfully generate plausible data that you can run all of your analytics on, develop all of your processes on and you don’t have to worry about releasing sensitive information.


As I’ve been saying throughout this talk, collaboration is a huge unlock for synthetic data. Imagine being able to build a joint analytics workspace that has access to this synthetic data. You no longer have to worry about other teams coming in and collaborating with your analytics team and accessing your real production data. They can now simply join your team, access this synthetic environment, see the data, form analysis, help develop new tools and processes. Then you can choose at which point – if at all, you allow them to see the real data. Or at what point you wish the analysis processes to be applied to the real data.


Huge potential with collaboration. Also, there are no longer the restrictions around using onshore data hubs. A lot of the more cutting-edge tools are often developed in foreign countries. Currently there is a requirement that data be stored and housed and analysed within UK data hubs, onshore data hubs. With synthetic data, those restrictions are more relaxed, so it unlocks huge potential there.


Finally and slightly selfishly, as a data scientist, synthetic data is a huge unlock for any data professional joining a project. You are no longer bound quite so tightly by all of the privacy and GDPR concerns. You don’t have to go through the anonymisation or pseudo-anonymisation process which can take a long time, to ensure that subjects within your data are correctly protected. Synthetic data allows you to skip over all of that and immediately start working to refine your feature sets and build in any of those transforms that are required to develop the machine learning models.


Huge potential there. It’s probably a good idea to touch on some of the limitations to synthetic data. Synthetic data is a tool. All tools have their place and limitations. While it does mimic real data, it doesn’t, and it shouldn’t reproduce it exactly. It’s a good place for you to develop your analysis processes, to build your models, train your models in a first iteration. If you want to go on to real insight from your data, then it’s important to apply those models and analytics to the real data, and make sure that it still holds up, and that the analysis still makes sense there.


It’s a good tool for collaboration but you still need to take it back and apply it to the real dataset. It’s also unlikely to capture anomalies. Typically, these models will smooth out your data and that’s often not how real data is. It’s not going to capture the anomalies, the edge cases, the weird instances that sometimes warrant more investigation.


Synthetic data always needs to be validated against the real data.


Finally, it’s always worth noting that these things can be hard to build. GANs are still a relatively new technology, they are still very complicated and complex. You will still likely need a data scientist or a data team to look into building your synthetic data, which has its own costs and drawbacks to it.


That is all we have time for today. Thank you all for listening, thank you for your time. I hope you have found that insightful and interesting. I’m going to give a shout out to the Pipeline, which is a Made Tech webinar series. I would ask you all to subscribe. If you have any further questions, feel free to reach out. There is my email address, my LinkedIn. If you are still using Twitter, there is Made Tech’s Twitter. Reach out to me or to Made Tech and we would love to chat to you about your data requirements or data problems. Thank you everyone, have a good day.

Back to the episode