TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh

Abstract

Large language models (LLM) and surrounding services come with their own rules about who can use them and how they should be used. These rules are important to protect the company’s work and to prevent misuse. Now, given a new LLM-based chatbot service, it’s important to find out the underlying LLM in order to check the compliance with the rules attached to each LLM. Here’s our method for doing this: We ask the chatbot a very specific question that only one company’s machine will answer in a certain way. It’s like asking a friend a secret question only they would know the answer to. If the machine answers the question the way we expect, we know it’s based on a specific LLM.

Publication

Findings of the Association for Computational Linguistics: ACL 2024 2024

Links

arXiv PDF RTAI