Generating Synthetic Data with LLMs
A guide to using AI to generate synthetic data for your database and application using any LLM that is available at an endpoint.
May 21st, 2024
AI data privacy is a hot topic these days and there are two emerging ways of protecting sensitive data when working with LLMs: synthetic data and tokenization.
In this blog, we'll cover both synthetic data and tokenization and their use-cases.
Let's jump in.
Synthetic data is getting more and more attention these days with the rise of AI/ML and LLMs. It's being used by most popular foundation models to train them such aws Microsoft's PHI-3 mode. Increasingly, more companies are using synthetic data for security and privacy reasons as well as to train models. We think of this as Synthetic Data Engineering. In the simplest definition, synthetic data is data that a machine has completely made up from scratch. For example, you can program a pseudo-random number generator (PRNG) to randomly select 5 numbers between 0 and 25. You can then use these randomly selected numbers as indexes in the alphabet to randomly select 5 letters. If you put those 5 letters together, you've created a synthetic string!
Obviously, this is a very simple example but the point remains. You can write programs to create data that "looks" just like real data. Additionally, using machine learning models such as generative adversarial networks or GANs, you can create synthetic data that has the same statistical characteristics as your real data. There are a number of synthetic data generators available that address different use cases. The key is to balance generation speed and accuracy. Some use cases such as analytics call for more accuracy (statistically) while other use cases, developer testing, call for more generation speed.
Synthetic data is being used by developers to build applications and machine learning engineers to train models. For developers, synthetic data is helpful:
We're seeing new use cases come up for synthetic data all of the time as more attention and time is being spent on methods to generate higher-quality synthetic data for developers and ML engineers.
Tokenization is encryption's lesser known cousin! Tokenization is the process of converting a piece of data to another representation by using a look-up table of pre-generated tokens that have no relation to the original data. Another way to think about tokenization is to imagine a casino. When you go into a casino, you exchange cash for chips then use those chips in the casino and then when you're done, you swap the chips for cash again. Tokenization works in a similar way. You have some data that you give to a look up table, the look up table returns back a randomly generated token, then you can use that data until you're ready to swap it back for the original data.
The main difference between tokenization and encryption is that encryption is reversible from ciphertext -> plain-text while tokenization is not reversible. Meaning that if you have a token, there is no possible way to retrieve the original value without the look up table. No public or private key can reverse the data for you. This is why in some cases, tokenization can be more secure than encryption (at least theoretically). Encryption can be reversed with a key, tokenization cannot.
Lastly, similar to encryption, you can create different types of tokens that preserve the length, format and other characteristics of the input data. This is particularly useful for data processing such as lookups across databases.
Tokenization is widely used to protect sensitive data in financial services and card networks and is starting to gain popularity in other use cases as well.
Now that we've understood what encryption, tokenization and synthetic data are, let's look at the differences and similarities and better understand their use cases.
Feature | Synthetic Data | Tokenization |
---|---|---|
Input Format | Raw data from real datasets | Sensitive data (e.g., PII, payment info) |
Output Format | New, non-real dataset mimicking original patterns | Non-sensitive token or identifier |
Is Reversible | Not applicable (generates new data) | Conditional (secure lookup required) |
Generation Method | Data modeling and algorithms | Mapping to unique tokens |
Use-Cases | Seeding databases, bug bashing, machine learning training | Payment processing, data anonymization |
Data Utility | High (preserves statistical properties) | Low to medium (depends on tokenization scheme) |
Risk of Data Exposure | Low (no direct link to real data) | Low (tokens are not meaningful) |
Regulatory Compliance | Can aid in compliance by avoiding use of real data | Often used for PCI-DSS, GDPR compliance |
One of the main use-cases that we see with regards to data privacy and LLMs is the ability to anonymize or generate synthetic data for training or fine-tuning use-cases. In those situations a machine learning engineer can use either synthetic data or tokenization. What we know today is that data that is used to train LLMs is tokenized and vectorized before it's trained. This results in a high-dimensional graph where distances between tokens indict similarity.
For example, "queen" and "king" will be closer in distance in the graph than "queen" and "269745d2-1e30-4f98-beb3-3cace04769d2" or "queen" and "iosjdf09aufjw2". Why? Because semantically, queen and king have some relation while queen and UUID or queen and random string do not.
So what is important here is that we don't destroy the semantic meaning of the data that we want to anonymize/generate synthetic data for since that's how the model categorizes data!
Let's take a simple scenario. If I have a database with 3 columns and 3 rows of data that I want to protect as I train my model, it might look something like this:
first_name | last_name | age | height | |
---|---|---|---|---|
john@doe.com | chris | doe | 23 | 68 |
bill@frank.com | bill | frank | 18 | 63 |
chris@johnson.com | chris | johnson | 39 | 72 |
Let's take a look at what this looks like with synthetic data and with tokenization.
Given that our goal is to protect sensitive data without losing the semantic meaning of that data, we can create synthetic data that looks just like it but is synthetic! Here's what it would look like:
first_name | last_name | age | height | |
---|---|---|---|---|
evan@guill.com | evan | guill | 28 | 65 |
joe@storm.com | joe | storm | 19 | 64 |
larry@resan.com | larry | resan | 36 | 71 |
We've updated our PII to generate new emails, first names and last names and then for the age and height columns we've generated a random value within a +/- range of 3. This data is brand new and does not leak any of our existing sensitive data. Semantically, it still makes sense. We've introduced a little more noise in the age and height columns but we have total control over that and how we update it.
That's the power of synthetic data. You're able to generate net new data that looks like our sensitive data and maintains generally the same semantic meaning but is privacy-safe.
Looking at tokenization, there are several different types of tokenization. Length-preserving and format-preserving tokenization are powerful features that mimic the format and length of the original data but the data itself is not retained. Let's see:
first_name | last_name | age | height | |
---|---|---|---|---|
ewaqewe@wfeweas.fewfe | ioqw | adfce | 91 | 19 |
8i778@qewf.wfe | qwd | eafs | 23 | 45 |
7kj7k8@wdqoij.jwfeoi | qpzme | ewaw | 12 | 01 |
Tokenization can still anonymize sensitive data however you will likely lose the semantic meaning of the data since the tokens it generates are typically nonsense. For some use-cases this might be fine, but every bit of noise you add to the model, you make the model less accurate. So there is a trade-off and tokenization tends to add more noise than synthetic data.
Tokenization and synthetic data have similarities but also have key differences that are important for engineering and security teams to understand. As LLMs become more widely used, it's important that developers and machine learning engineers understand the tools they have available and when/how to use them. Generally, when interacting with machine learning models that care about semantic meaning, synthetic data is a great choice.
A guide to using AI to generate synthetic data for your database and application using any LLM that is available at an endpoint.
May 21st, 2024
A walkthrough of how we reduced the time it takes to generate data in Neosync by 50%. Also see our updated performance benchmarks.
April 10th, 2024