VALID: New AI Safety Method for Language Models

AI Safety

A team of researchers from Oxford University, Vienna University of Technology, and KAUST has developed a new framework to keep large language models (LLMs) within their intended topics for AI safety. Their method, VALID (Verified Adversarial LLM Output via Iterative Dismissal), prevents LLMs from generating off-topic responses, even under adversarial attacks. The research was presented at ICLR 2025.

Why Domain Certification is Needed

LLMs play a crucial role in customer support, healthcare, and finance, where accuracy matters. However, users can trick these models into producing off-topic or harmful content.

For example, a chatbot designed for medical advice should only provide responses related to health. If tricked, it could generate content outside this domain, such as tax fraud guidance or weapon manufacturing instructions, which poses serious legal and ethical risks.

To counter this, the researchers propose domain certification, a mathematical guarantee that an LLM will stay within its designated area even when manipulated by adversarial prompts.

Introducing VALID: A New Safety Mechanism

VALID uses a guide model (G) trained only on in-domain data for AI safety. It compares the main LLM’s responses to this model. If the response is off-topic, the system rejects it.

Key Steps:

  1. Defines the Allowed Topics – Identifies the specific domain for responses.
  2. Uses a Guide Model (G) – A smaller, specialized model trained only on in-domain data to verify the main LLM’s responses.
  3. Rejects Unsafe Outputs – The system iteratively checks if the LLM’s response matches the guide model’s domain. If it deviates, the response is rejected.

Unlike traditional guardrails, which rely on human feedback, VALID provides mathematical guarantees for safer AI interactions.

Performance & Real-World Applications

The researchers tested VALID across three key domains:

  • Shakespearean Texts – Ensuring generated text remains in the style of Shakespeare.
  • Computer Science News – Restricting LLM-generated news content to CS topics.
  • Medical QA – Ensuring a medical chatbot does not provide answers beyond health-related inquiries.

In all tests, VALID significantly reduced the probability of OOD responses, with over 99% of unsafe prompts being rejected while maintaining high accuracy for in-domain responses.

Industry Implications

This research has significant implications for industries that use LLMs in regulated environments:

  • Healthcare: Ensuring AI-driven chatbots only provide verified medical information.
  • Finance: Preventing AI assistants from generating unlicensed financial advice.
  • Legal & Compliance: Offering AI-generated responses with legally safe guarantees.

Final Thoughts

As AI models become more advanced, ensuring safe and reliable outputs is crucial. The VALID framework represents a major step toward making LLMs more trustworthy and resistant to adversarial manipulation.

With domain certification, companies can now deploy AI assistants with higher confidence, knowing they will stay within their expertise and reject misleading or harmful content.


Discover more from NewsHunt.ai

Subscribe to get the latest posts sent to your email.

Related posts