Ensuring end-to-end safety in Generative AI: a comprehensive approach
Generative Artificial intelligence (GenAI) is at the forefront of technological innovation, promising to revolutionise industries by emulating human creativity. Yet, this powerful capability necessitates a vigilant approach to safety to prevent misuse, such as spreading misinformation or creating biased content.
Generative Artificial intelligence (GenAI) is at the forefront of technological innovation, promising to revolutionise industries by emulating human creativity. Yet, this powerful capability necessitates a vigilant approach to safety to prevent misuse, such as spreading misinformation or creating biased content. Prioritising end-to-end safety is essential for harnessing GenAI’s benefits responsibly.
In this article, we outline the actionable steps you can follow to ensure the security, reliability, and trustworthiness of your GenAI models and address key concerns at every stage of the development lifecycle.
Pre-release testing: Laying the foundation for safety and security
Just like other forms of AI, attackers can exploit vulnerabilities in GenAI models to manipulate or deceive the system. Testing models regularly with Adversarial Prompting or Prompt Injection techniques is an effective way to combat this issue and prevent unsafe or unexpected results.
Using this approach, misleading instructions are given to a system to test its accuracy or its potential to return sensitive information it has either learned during training or has access to through a Retrieval Augmented Generation (RAG) or agent-based setup.
Adversarial Prompting or Prompt Injection can be approached in several ways, including:
Jailbreaking – prompting a model into a “Developer Mode” state with limited guardrails,
Obfuscation – replacing trigger words with synonyms or typos to evade filters,
Abstraction – altering the context of a prompt while maintaining the intention,
Suffix Injection – appending specific suffixes to malicious prompts to induce a response, and
Code Injection – prompting the model to run arbitrary code, either evaluating it itself or sending it to an interpreter.
These attacks can become more elaborate when split across multiple models or modalities, for example incorporating traditional vision-based adversarial attacks, but most mimic traditional system vulnerabilities, such as Structured Query Language (SQL) injection.
Generative AI systems should therefore include input and output sanitisation such as filtering and character escaping on user input. This input should then be encapsulated, ensuring it is clearly distinguishable from the application instructions.
Post-prompting is another method of achieving this. As models tend to pay more attention to the last set of instructions, systems can insert a final instruction to the prompt, clarifying the core functionality required. Ultimately, the best defence is to limit the privileges of the model or any associated agents or to remove any direct user interaction with the model.
Has the wider system and life-cycle been considered?
The above mainly focuses on so-called ‘direct attacks’, where the adversarial instructions are provided by user interaction, but it is important to also consider indirect attacks. This is where a malicious prompt is injected via a third-party data source i.e. through an API call or content retrieved through RAG. The same defences can be applied, but it’s important to remember AI models are not built in isolation. Leveraging secure-by-design infrastructure while also implementing governance and monitoring at a system-wide level is critical for ensuring security.
Model design and use: Maximising quality and trustworthiness
Despite considerable progress in reducing Generative AI bias, relevancy, and hallucinations (incorrect or misleading results generated by AI models) it remains possible for models to confidently provide incorrect information or discriminatory or offensive content to users.
The main reason for this is that the datasets used to train the foundation models are huge, unmoderated scrapes of the internet that cannot be verified for accuracy and safety.
One way to reduce the chance of this leaking into your application is by fine-tuning with your own data. This involves inserting and up-weighing correct information in the model’s internal state while reducing its preference for producing harmful content.
A more effective approach, however, is to use a combination of Retrieval Augmented Generation (RAG) and moderation guardrails. With RAG, relevant information from an internal knowledge store is first retrieved based on the user’s need. This curated information is then used by the model to formulate its response rather than any internal knowledge. Moderation guardrails can be applied before and after this to provide further protection by detecting harmful content in both the initial user interaction and the model’s response.
What can I do to test GenAI's output so I can understand and trust it more?
Generative AI models are some of the largest and most complex created to date, with GPT4 for example estimated to have around 1.8 trillion parameters. As a result these models really are ‘black boxes’, making it nearly impossible to understand why they produce a particular outcome given a specific prompt.
There are, however, some approaches that can facilitate transparency and auditability, which are essential features of a trustworthy system. One method is to simply ask the model to ‘show its working’. In this way, users can understand which parts of the input it focused on and the reasoning it followed to get to the response, meaning a user can easily check if the model made a mistake or misinterpreted something. Another approach is to break the task down into smaller, more explainable chunks. These can then be easily logged and tested in a similar way to functions in a program.
Tracking performance post deployment: Ensuring continuous evaluation and maintenance
A core facet of GenAI is its ability to generalise to new tasks, as evidenced by the emerging capabilities we have seen in different models to date.
However, when models are fine-tuned, their ability to generalise can be greatly reduced or lost. Therefore, when fine-tuning for a new domain, it’s important to recognize the process is focusing on a particular set of inputs and outputs and away from tasks uncommon to the domain.
As a result, comprehensive evaluation sets are crucial to identifying whether certain tasks are underperforming or disrupting downstream processes. These evaluation sets should include:
Task-specific benchmarking,
Statistical evaluation measuring inherent properties of the responses,
Model-based evaluation using a separate model to evaluate responses, and
Human evaluation via feedback from domain experts.
Having a well-established suite of metrics that can be applied in production means models can use user feedback to continuously improve, protecting against concept and data drift.
Summary: Embedding safety from start to finish
To make GenAI secure, robust and trustworthy, it is critical to consider safety throughout the system, end to end. Here is a roundup of the key areas to focus on:
Implement robust testing: Conduct thorough testing of GenAI models before release using techniques like Adversarial Prompting or Prompt Injection. Additionally, establish a secure-by-design infrastructure to prevent indirect attacks when integrating third-party data sources.
Fine-tune your data: Use a combination of Retrieval Augmented Generation (RAG) and moderation guardrails to fine-tune your data to enhance the accuracy and reliability of your models.
Increase transparency and auditability: Prompt models to explain their reasoning or break tasks into smaller, explainable chunks to make them easier to understand, evaluate and log.
Track and monitor performance: Continuously monitor the performance of your GenAI models to detect and address concept and data drift. Employ comprehensive evaluation methods, including task-specific benchmarking, statistical evaluation, and model-based assessment.
By taking this comprehensive approach, and prioritising safety end-to-end, you can maximise the potential of GenAI for your business while minimising its risks. As a result, you can ensure your models are designed and built to be robust, secure, and ultimately safe for everyone to use.
Contact us to discuss how Faculty can help your organisation integrate these essential safety measures and elevate your Generative AI strategy to new heights of security and reliability.