Employees feed ChatGPT with sensitive business data


Employees submit sensitive business data and privacy-protected information to large language models (LLMs) such as ChatGPT, raising concerns that artificial intelligence (AI) services could integrate the data into their models and that information could be retrieved at a later date if the service lacks adequate data security is available.

In a recent report, data security service Cyberhaven detected and blocked requests to enter data into ChatGPT from 4.2% of its client companies’ 1.6 million employees due to the risk of confidential information, client data, source code or regulated information being leaked to the ChatGPT Companies are passed LLM.

In one instance, an executive clipped the company’s 2023 strategy document, pasted it into ChatGPT, and asked them to create a PowerPoint deck. In another case, a doctor entered his patient’s name and medical condition and asked ChatGPT to compose a letter to the patient’s insurance company.

And as more employees use ChatGPT and other AI-based services as productivity tools, the risk increases, says Howard Ting, CEO of Cyberhaven.

“There’s been this big migration of data from on-premises to the cloud, and the next big shift is going to be migrating data into these generative apps,” he says. “And how that works out [remains to be seen] — I think we’re in pregame; we’re not even in the first inning.”

With the rising popularity of OpenAI’s ChatGPT and its foundational AI model – the Generative Pre-trained Transformer, or GPT-3 – as well as other LLMs, companies and security professionals have begun to worry that sensitive data included as training data in the models were, could reappear when prompted by the correct queries. Some are taking action: JPMorgan, for example, has restricted employees’ use of ChatGPT, and Amazon, Microsoft and Wal-Mart have warned all employees to be cautious when using generative AI services.

Chart of Data Out Events with ChatGPT
More and more users are submitting sensitive data to ChatGPT. Source: Cyberhaven

And as more software companies connect their applications to ChatGPT, the LLM may be gathering far more information than users — or their companies — know, putting them at legal risk, Karla Grossenbacher, a partner at law firm Seyfarth Shaw, warned in a Bloomberg column Law.

“Prudent employers will include — in employee confidentiality agreements and policies — prohibitions for employees to refer to or input confidential, proprietary or trade secrets into AI chatbots or language models such as ChatGPT,” she wrote. “On the other hand, because ChatGPT has been trained in vast swathes of information online, employees could obtain and use information from the tool that is trademarked, copyrighted, or intellectual property of another person or entity, creating a legal risk for employer represents.”

The risk is not theoretical. In a June 2021 paper, a dozen researchers from a who’s who list of companies and universities – including Apple, Google, Harvard University and Stanford University – found that so-called “training data extraction attacks” could successfully recover literal text sequences in person identifiable information (PII) and other information in training records of the LLM, known as GPT-2. In fact, only a single document was required for an LLM to memorize literal dates, the researchers noted in the paper.

Choosing the brain of GPT

In fact, these training data extraction attacks are one of the top concerns of machine learning researchers. Also known as “machine learning inference exfiltration,” the attacks could harvest sensitive information or steal intellectual property, according to MITRE’s Adversarial Threat Landscape for Artificial-Intelligence Systems (Atlas) knowledge base.

It works like this: By querying a generative AI system to retrieve specific items, an attacker could prompt the model to retrieve specific information instead of generating synthetic data. GPT-3, the successor to GPT-2, has a number of real-world examples, including an instance where Copilot retrieved a specific developer’s username and programming priorities from GitHub.

Beyond GPT-based offerings, other AI-based services have raised questions about whether they pose a risk. The automated transcription service, for example, transcribes audio files into text, automatically identifies speakers, and allows key words and phrases to be highlighted. The fact that the company is storing this information in its cloud has raised concerns among journalists.

According to Julie Wu, senior compliance manager at, the company is committed to keeping user data private and putting in place strict compliance controls.

“Otter has completed its SOC2 Type 2 audit and reports and we have technical and organizational measures in place to protect personal data,” she tells Dark Reading. “Speaker identification is account-bound. Adding a speaker’s name trains Otter to recognize the speaker for future conversations you record or import into your account, but doesn’t allow speakers to be identified across accounts.

APIs enable fast GPT acceptance

ChatGPT’s popularity has surprised many businesses. More than 300 developers are using GPT-3 to power their applications, according to the latest published figures from a year ago. For example, social media company Snap and shopping platforms Instacart and Shopify all use ChatGPT through the API to add chat functionality to their mobile apps.

Based on conversations with his company’s customers, Ting expects Cyberhaven to only accelerate its shift to generative AI apps that can be used for everything from creating memos and presentations to triaging security incidents and interacting with patients.

As he says, his customers have told him, “Look, as a workaround, I’m blocking this app right now, but my board has already told me we can’t do that. Because these tools will help our users be more productive – there is a competitive advantage – and if my competitors are using these generative AI apps and I don’t allow my users to use them, that puts us at a disadvantage.”

The good news is that education can have a major impact on whether data is disclosed by a given company, since a small number of employees are responsible for most risky requests. Less than 1% of employees are responsible for 80% of incidents where sensitive data is sent to ChatGPT, says Cyberhaven’s Ting.

“You know, there are two forms of training: There’s classroom training, like when you’re onboarding an employee, and then there’s contextual training, like when someone’s actually trying to insert data,” he says. “I think both are important, but I think the latter is much more effective from what we’ve seen.”

In addition, OpenAI and other companies are working to restrict the LLM’s access to personal information and sensitive data: Querying personal data or sensitive company information currently leads to ready-made statements from ChatGPT refraining from compliance.

For example, when asked, “What is Apple’s strategy for 2023?” ChatGPT replied, “As an AI voice model, I do not have access to Apple’s confidential information or future plans. Apple is a highly secretive company, and they don’t typically release their strategies or future plans to the public until they’re ready to release them.”

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *