AI Village Announcing Generative Red Team 3 at DEF CON 33
In 2023, AI Village organized the GRT-1 with the objective of solving the discovery problem in Machine Learning (ML) evaluation: model developers cannot account for every potential exploit or flaw due to the sheer volume of potential attacks. In addition to attacks that developers did not account for during development, as in the case of wearing a cardboard box to evade an AI, there are also attacks that are developed after release, as seen in the use of images to deliver prompt injections. Within traditional security, these unknown unknowns are managed through a culture of exploration and disclosure: anyone can sign up for a bug bounty and start on a career making the world’s software safer. There is a lot of infrastructure needed to support good-faith security researchers discovering and reporting vulnerabilities that we are trying to port over to ML. We learned a lot in the first two GRTs, and we’re back for more.
For GRT3, we are bringing several ML models, each built to perform different tasks. Each model is defined with their own model card, which is essentially a specification or scope document for the model. These model cards will declare the designated intent and restrictions on the model. The model cards for this event will be built off of open source evaluations that establish how effective the model is.
GRT1 and GRT2 were great for getting more people involved, but the findings produced were in formats difficult for vendors to use at-scale. We’re addressing this in GRT3 by focusing on evaluations. Evaluations are collections of data that we can feed to the model paired with some code that checks the outputs. They don’t need to include every input, just enough to be confident in the model. We use evaluations to test if a model’s behavior is consistent with its model card. For the GRT 3, we are going to “red-team” the evaluations that establish a model’s performance. Rather than making any assumptions about the systems we rely upon to determine model performance, we will examine whether the tools we have to determine trust are actually accurate and effective metrics. We will be paying bounties for findings that demonstrate the evaluations are incomplete or have errors.
The first iteration of Generative Red-Teaming (GRT1) utilized a traditional CTF (capture-the-flag) exercise to highlight vulnerabilities in machine learning models. While this CTF approach provided avenues for further exploration by cybersecurity and AI professionals, it did not lead to the discovery of unknown flaws or insights into improving model performance. In the second iteration of GRT2, we moved to an exploratory style of red-teaming where we asked participants to identify discrepancies between a model‘s performance and its expected behavior as designated on its model card. This stimulated participants to be creative and enabled them to discover novel model failures. During GRT2, we encountered a deeper issue — we had assumed that the evaluations, which had been used to generate the model card, were accurate and could reliably determine model performance. Instead, we observed faulty evaluations could provide false assurances about model behavior.
Oversights like this are inevitable; there are simply too many things LLMs can do. Solving gaps for all possible evaluations requires an open ecosystem where it’s easy to submit patches. Vulnerability disclosure programs let the public help fix software flaws. We want to bring the public into ML security in the same way. We are building open source tools and processes that will make it easy to patch problems but, as we saw at GRT2, this idea needs to be tested early and often. Come join us at the GRT3 at DEFCON. We want you to come get your hands dirty and learn how evaluations work, how they’re made, what they do, and how to fix them.