How to protect your data when using ChatGPT and other LLMs

Robert John
5 min readJan 8, 2024

--

Safeguarding Your Information: A Deep Dive into Protecting Data from Generative AI

Step by step process of Microsoft Presido de-identify PII. Image source

In recent times, the widespread utilization of LLMs has become almost ubiquitous, with people engaging in various forms, whether through direct communication with ChatGPT, other chatbots, or integration into products via APIs. The core of these interactions lies in the submission of prompts, which sometimes include Personal Identifying Information (PII). This prompts are pivotal in shaping the responses generated by the language models.

It’s crucial to be aware that the prompts submitted during these interactions is often stored by the companies offering Generative AI services. For instance, OpenAI preserves your data for a duration of up to 30 days, and Anthropic follows suit by retaining your information for up to 28 days. Notably, this retention period may extend even further when the data is utilized in the retraining processes of these language models. In this article, we’ll explore effective strategies and best practices to safeguard your data when engaging with ChatGPT and similar LLMs, ensuring a balance between the utility of these powerful tools and the protection of your personal information.

Image source

You might be wondering about the potential risks associated with companies like OpenAI or other AI companies storing your prompts. There are several reasons why you to be very careful of the information you provide to them

1. Data Breach: A Data Breach occurs when your confidential or sensitive information, embedded in your prompt, is exposed to an unsecured environment. This threat became glaringly apparent in March 2023 when OpenAI encountered a significant data breach, allowing unauthorized users to view your prompts.

Let’s break it down — when you engage with any Language Model like ChatGPT, certain details from your interaction are stored temporarily. Now, the catch is, this stored information becomes a potential target for hackers with malicious intent. Imagine a scenario where individuals with harmful motives gain access to the temporary data stash, putting your personal information at risk.

2. Leveraging Prompt Data to retain model: One prevalent practice among major AI companies involves utilizing user prompts to retrain their models. The propose goal behind this strategy is to enhance your overall experience with their products and services. However, this seemingly innocuous approach has triggered a wave of concerns, prompting several companies and even entire countries to impose bans on the usage of Large Language Models (LLMs). To illustrate, Samsung took a decisive step by banning Claude AI from Anthropic, citing concerns related to the retraining of models using user data. Similarly, Italy implemented a ban on the use of ChatGPT within its borders, reflecting a broader global sentiment regarding the potential risks associated with the retraining practices employed by AI companies.

3. Extracting Training Data from LLMs : Ever wondered how people manage to extract information from those Language Models? Well, there’s a whole world of research papers exploring different ways to do it. Take, for example, “Scalable extraction of training data from production language models” by DeepMind and several universities. They’ve cracked open some methods to get that training data. And guess what? There are even more techniques out there!

So, let’s break it down. If your data was used to train a model, and someone figures out how to extract it, that’s a red flag. It means your data isn’t as safe as you might think. In this article, we’ll dive into the various methods researchers are using to pull training data from Language Models and why it’s a concern for your data’s security.

I get it, you’re not here for the scoop on data privacy and security challenges with LLMs. Well, buckle up because we’re about to dive into some smart practices that will empower you to safeguard your precious information. There are two major ways to protect your data while interacting with LLMs.

Running LLMs Locally
When it comes to safeguarding your data while interacting with Language Models (LLMs), running them locally stands out as one of the most effective methods. This means you can operate your model without being connected to the internet, ensuring a higher level of control over your data. Several tools have emerged to simplify the process, making it more accessible. Among these, two noteworthy options are Jan AI and private GPT. For the purpose of this discussion, We will focus on Jan AI. It is a user-friendly tool that also eliminates the need to clone a GitHub repository. Here’s how you can get started on your Mac, Windows, or Linux computer:

  1. Download Jan AI : Obtain Jan AI, which is available for Mac, Windows, and Linux operating systems.
  2. Setup and Model Download: Set up the application and download an open-source models through the app.
  3. Select Your Model: In the top right corner, expand the tab and choose the specific model you want to use.
  4. Customize Instructions: Tailor the model’s behavior by adding custom instructions based on your preferences.
  5. Monitor Resource Usage: Below, you can find details about the memory consumption, giving you insights into the resources utilized.

It’s important to note that running LLMs locally may result in slightly slower performance and could require a bit more of your computer’s resources. However, the trade-off is a heightened level of data security, providing you with more control and peace of mind.

Image source

Data anonymization and de-identification

Data anonymization is a process where you make your data anonymous before including it in your prompt for a language model like GPT-4. This is important for safeguarding privacy and keeping information confidential. If you don’t anonymize the data, sensitive details such as names, addresses, contact numbers, or other identifiers linked to specific individuals might be discovered and misused. Tools like Microsoft’s Presidio, which is integrated into Langchain, can help with this process.

In this article, we dove into the world of data privacy challenges when engaging with generative AI. But fear not! I’ve got your back with two powerful strategies to shield your data. Not just that, I spilled the beans on some nifty tools for each approach, making sure your data stays as secure as Fort Knox. So, let’s turn the tables on privacy concerns and make interacting with AI a breeze!

--

--

Robert John
Robert John

Written by Robert John

I develop machine learning models and deploy them to production using cloud services.