Evaluating social and ethical risks from generative AI - Related to iclr, 2023, dialogue, latest, social
Building safer dialogue agents

Research Building safer dialogue agents Share.
Training an AI to communicate in a way that’s more helpful, correct, and harmless In recent years, large language models (LLMs) have achieved success at a range of tasks such as question answering, summarisation, and dialogue. Dialogue is a particularly interesting task because it capabilities flexible and interactive communication. However, dialogue agents powered by LLMs can express inaccurate or invented information, use discriminatory language, or encourage unsafe behaviour. To create safer dialogue agents, we need to be able to learn from human feedback. Applying reinforcement learning based on input from research participants, we explore new methods for training dialogue agents that show promise for a safer system. In our latest paper, we introduce Sparrow – a dialogue agent that’s useful and reduces the risk of unsafe and inappropriate answers. Our agent is designed to talk with a user, answer questions, and search the internet using Google when it’s helpful to look up evidence to inform its responses.
Our new conversational AI model replies on its own to an initial human prompt.
Sparrow is a research model and proof of concept, designed with the goal of training dialogue agents to be more helpful, correct, and harmless. By learning these qualities in a general dialogue setting, Sparrow advances our understanding of how we can train agents to be safer and more useful – and ultimately, to help build safer and more useful artificial general intelligence (AGI).
Sparrow declining to answer a potentially harmful question.
How Sparrow works Training a conversational AI is an especially challenging problem because it’s difficult to pinpoint what makes a dialogue successful. To address this problem, we turn to a form of reinforcement learning (RL) based on people's feedback, using the study participants’ preference feedback to train a model of how useful an answer is. To get this data, we show our participants multiple model answers to the same question and ask them which answer they like the most. Because we show answers with and without evidence retrieved from the internet, this model can also determine when an answer should be supported with evidence.
We ask study participants to evaluate and interact with Sparrow either naturally or adversarially, continually expanding the dataset used to train Sparrow.
But increasing usefulness is only part of the story. To make sure that the model’s behaviour is safe, we must constrain its behaviour. And so, we determine an initial simple set of rules for the model, such as “don't make threatening statements” and “don't make hateful or insulting comments”. We also provide rules around possibly harmful advice and not claiming to be a person. These rules were informed by studying existing work on language harms and consulting with experts. We then ask our study participants to talk to our system, with the aim of tricking it into breaking the rules. These conversations then let us train a separate ‘rule model’ that indicates when Sparrow's behaviour breaks any of the rules. Towards superior AI and superior judgments Verifying Sparrow’s answers for correctness is difficult even for experts. Instead, we ask our participants to determine whether Sparrow's answers are plausible and whether the evidence Sparrow provides actually supports the answer. , Sparrow provides a plausible answer and supports it with evidence 78% of the time when asked a factual question. This is a big improvement over our baseline models. Still, Sparrow isn't immune to making mistakes, like hallucinating facts and giving answers that are off-topic sometimes. Sparrow also has room for improving its rule-following. After training, participants were still able to trick it into breaking our rules 8% of the time, but compared to simpler approaches, Sparrow is superior at following our rules under adversarial probing. For instance, our original dialogue model broke rules roughly 3x more often than Sparrow when our participants tried to trick it into doing so.
Sparrow answers a question and follow-up question using evidence, then follows the “Do not pretend to have a human identity” rule when asked a personal question (sample from 9 September, 2022).
Natural Language Processing and Computer Vision used to be two completely different fields. Well, at le...
At Google, we believe in making AI helpful for everyone. We have a long history of contributing innovations to the open community, such as with Transf...
Technologies Pushing the frontiers of audio generation Share.
Our pioneering speech generation technologies are helping people arou...
DeepMind’s latest research at ICLR 2023

Research towards AI models that can generalise, scale, and accelerate science.
Next week marks the start of the 11th International Conference on Learning Representations (ICLR), taking place 1-5 May in Kigali, Rwanda. This will be the first major artificial intelligence (AI) conference to be hosted in Africa and the first in-person event since the start of the pandemic.
Researchers from around the world will gather to share their cutting-edge work in deep learning spanning the fields of AI, statistics and data science, and applications including machine vision, gaming and robotics. We’re proud to support the conference as a Diamond sponsor and DEI champion.
Teams from across DeepMind are presenting 23 papers this year. Here are a few highlights:
Recent progress has shown AI’s incredible performance in text and image, but more research is needed for systems to generalise across domains and scales. This will be a crucial step on the path to developing artificial general intelligence (AGI) as a transformative tool in our everyday lives.
We present a new approach where models learn by solving two problems in one. By training models to look at a problem from two perspectives at the same time, they learn how to reason on tasks that require solving similar problems, which is beneficial for generalisation. We also explored the capability of neural networks to generalise by comparing them to the Chomsky hierarchy of languages. By rigorously testing 2200 models across 16 different tasks, we uncovered that certain models struggle to generalise, and found that augmenting them with external memory is crucial to improve performance.
Another challenge we tackle is how to make progress on longer-term tasks at an expert-level, where rewards are few and far between. We developed a new approach and open-source training data set to help models learn to explore in human-like ways over long time horizons.
As we develop more advanced AI capabilities, we must ensure current methods work as intended and efficiently for the real world. For example, although language models can produce impressive answers, many cannot explain their responses. We introduce a method for using language models to solve multi-step reasoning problems by exploiting their underlying logical structure, providing explanations that can be understood and checked by humans. On the other hand, adversarial attacks are a way of probing the limits of AI models by pushing them to create wrong or harmful outputs. Training on adversarial examples makes models more robust to attacks, but can come at the cost of performance on 'regular' inputs. We show that by adding adapters, we can create models that allow us to control this tradeoff on the fly.
Reinforcement learning (RL) has proved successful for a range of real-world challenges, but RL algorithms are usually designed to do one task well and struggle to generalise to new ones. We propose algorithm distillation, a method that enables a single model to efficiently generalise to new tasks by training a transformer to imitate the learning histories of RL algorithms across diverse tasks. RL models also learn by trial and error which can be very data-intensive and time-consuming. It took nearly 80 billion frames of data for our model Agent 57 to reach human-level performance across 57 Atari games. We share a new way to train to this level using 200 times less experience, vastly reducing computing and energy costs.
AI is a powerful tool for researchers to analyse vast amounts of complex data and understand the world around us. Several papers show how AI is accelerating scientific progress – and how science is advancing AI.
Predicting a molecule's properties from its 3D structure is critical for drug discovery. We present a denoising method that achieves a new state-of-the-art in molecular property prediction, allows large-scale pre-training, and generalises across different biological datasets. We also introduce a new transformer which can make more accurate quantum chemistry calculations using data on atomic positions alone.
Finally, with FIGnet, we draw inspiration from physics to model collisions between complex shapes, like a teapot or a doughnut. This simulator could have applications across robotics, graphics and mechanical design.
See the full list of DeepMind papers and schedule of events at ICLR 2023.
Research Building interactive agents in video game worlds Share.
Introducing a framework to create AI agents that can understand hu...
Natural Language Processing and Computer Vision used to be two completely different fields. Well, at le...
Throughout this journey, we’ve worked closely with artists and creators and have been guided by their curiosity and feedback to ensure our technologie...
Evaluating social and ethical risks from generative AI

Introducing a context-based framework for comprehensively evaluating the social and ethical risks of AI systems.
Generative AI systems are already being used to write books, create graphic designs, assist medical practitioners, and are becoming increasingly capable. Ensuring these systems are developed and deployed responsibly requires carefully evaluating the potential ethical and social risks they may pose.
In our new paper, we propose a three-layered framework for evaluating the social and ethical risks of AI systems. This framework includes evaluations of AI system capability, human interaction, and systemic impacts.
We also map the current state of safety evaluations and find three main gaps: context, specific risks, and multimodality. To help close these gaps, we call for repurposing existing evaluation methods for generative AI and for implementing a comprehensive approach to evaluation, as in our case study on misinformation. This approach integrates findings like how likely the AI system is to provide factually incorrect information with insights on how people use that system, and in what context. Multi-layered evaluations can draw conclusions beyond model capability and indicate whether harm — in this case, misinformation — actually occurs and spreads.
To make any technology work as intended, both social and technical challenges must be solved. So to enhanced assess AI system safety, these different layers of context must be taken into account. Here, we build upon earlier research identifying the potential risks of large-scale language models, such as privacy leaks, job automation, misinformation, and more — and introduce a way of comprehensively evaluating these risks going forward.
Context is critical for evaluating AI risks.
Capabilities of AI systems are an significant indicator of the types of wider risks that may arise. For example, AI systems that are more likely to produce factually inaccurate or misleading outputs may be more prone to creating risks of misinformation, causing issues like lack of public trust.
Measuring these capabilities is core to AI safety assessments, but these assessments alone cannot ensure that AI systems are safe. Whether downstream harm manifests — for example, whether people come to hold false beliefs based on inaccurate model output — depends on context. More specifically, who uses the AI system and with what goal? Does the AI system function as intended? Does it create unexpected externalities? All these questions inform an overall evaluation of the safety of an AI system.
Extending beyond capability evaluation, we propose evaluation that can assess two additional points where downstream risks manifest: human interaction at the point of use, and systemic impact as an AI system is embedded in broader systems and widely deployed. Integrating evaluations of a given risk of harm across these layers provides a comprehensive evaluation of the safety of an AI system.
Human interaction evaluation centres the experience of people using an AI system. How do people use the AI system? Does the system perform as intended at the point of use, and how do experiences differ between demographics and user groups? Can we observe unexpected side effects from using this technology or being exposed to its outputs?
Systemic impact evaluation focuses on the broader structures into which an AI system is embedded, such as social institutions, labour markets, and the natural environment. Evaluation at this layer can shed light on risks of harm that become visible only once an AI system is adopted at scale.
Responsibility & Safety — We want AI to benefit the world, so we must be thoughtful about how it’s built and used...
Research FunSearch: Making new discoveries in mathematical sciences using Large Language Models Share.
Research AlphaGeometry: An Olympiad-level AI system for geometry Share.
Our AI system surpasses the state-of-the-art approach for g...
Market Impact Analysis
Market Growth Trend
2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
---|---|---|---|---|---|---|
23.1% | 27.8% | 29.2% | 32.4% | 34.2% | 35.2% | 35.6% |
Quarterly Growth Rate
Q1 2024 | Q2 2024 | Q3 2024 | Q4 2024 |
---|---|---|---|
32.5% | 34.8% | 36.2% | 35.6% |
Market Segments and Growth Drivers
Segment | Market Share | Growth Rate |
---|---|---|
Machine Learning | 29% | 38.4% |
Computer Vision | 18% | 35.7% |
Natural Language Processing | 24% | 41.5% |
Robotics | 15% | 22.3% |
Other AI Technologies | 14% | 31.8% |
Technology Maturity Curve
Different technologies within the ecosystem are at varying stages of maturity:
Competitive Landscape Analysis
Company | Market Share |
---|---|
Google AI | 18.3% |
Microsoft AI | 15.7% |
IBM Watson | 11.2% |
Amazon AI | 9.8% |
OpenAI | 8.4% |
Future Outlook and Predictions
The Building Safer Dialogue landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:
Year-by-Year Technology Evolution
Based on current trajectory and expert analyses, we can project the following development timeline:
Technology Maturity Curve
Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:
Innovation Trigger
- Generative AI for specialized domains
- Blockchain for supply chain verification
Peak of Inflated Expectations
- Digital twins for business processes
- Quantum-resistant cryptography
Trough of Disillusionment
- Consumer AR/VR applications
- General-purpose blockchain
Slope of Enlightenment
- AI-driven analytics
- Edge computing
Plateau of Productivity
- Cloud infrastructure
- Mobile applications
Technology Evolution Timeline
- Improved generative models
- specialized AI applications
- AI-human collaboration systems
- multimodal AI platforms
- General AI capabilities
- AI-driven scientific breakthroughs
Expert Perspectives
Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:
"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher
"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst
"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer
Areas of Expert Consensus
- Acceleration of Innovation: The pace of technological evolution will continue to increase
- Practical Integration: Focus will shift from proof-of-concept to operational deployment
- Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
- Regulatory Influence: Regulatory frameworks will increasingly shape technology development
Short-Term Outlook (1-2 Years)
In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:
- Improved generative models
- specialized AI applications
- enhanced AI ethics frameworks
These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.
Mid-Term Outlook (3-5 Years)
As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:
- AI-human collaboration systems
- multimodal AI platforms
- democratized AI development
This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.
Long-Term Outlook (5+ Years)
Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:
- General AI capabilities
- AI-driven scientific breakthroughs
- new computing paradigms
These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.
Key Risk Factors and Uncertainties
Several critical factors could significantly impact the trajectory of ai tech evolution:
Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.
Alternative Future Scenarios
The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:
Optimistic Scenario
Responsible AI driving innovation while minimizing societal disruption
Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.
Probability: 25-30%
Base Case Scenario
Incremental adoption with mixed societal impacts and ongoing ethical challenges
Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.
Probability: 50-60%
Conservative Scenario
Technical and ethical barriers creating significant implementation challenges
Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.
Probability: 15-20%
Scenario Comparison Matrix
Factor | Optimistic | Base Case | Conservative |
---|---|---|---|
Implementation Timeline | Accelerated | Steady | Delayed |
Market Adoption | Widespread | Selective | Limited |
Technology Evolution | Rapid | Progressive | Incremental |
Regulatory Environment | Supportive | Balanced | Restrictive |
Business Impact | Transformative | Significant | Modest |
Transformational Impact
Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.
The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.
Implementation Challenges
Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.
Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.
Key Innovations to Watch
Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.
Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.
Technical Glossary
Key technical terms and definitions to help understand the technologies discussed in this article.
Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.