Home
What the GPT-4o System Card Reveals About AI Safety
The GPT-4o System Card, released by OpenAI on August 8, 2024, serves as the definitive technical transparency report for its flagship "omni" model. This document, complemented by a March 2025 addendum focusing on native image generation, provides a detailed look at the safety evaluations, risk mitigations, and performance boundaries of a system designed to process text, audio, and vision within a single neural network. According to the report, GPT-4o achieved an overall safety risk rating of Medium, primarily driven by its advanced capabilities in persuasion and high-fidelity audio interaction.
The Architecture of an Omni Model
GPT-4o represents a significant departure from previous modular AI systems. Unlike earlier iterations that relied on separate models for speech-to-text (Whisper), reasoning (GPT-4), and text-to-speech, GPT-4o is an autoregressive omni model trained end-to-end across text, vision, and audio. This means that all inputs and outputs are processed by the same neural network, leading to a much more integrated understanding of multimodal context.
One of the most immediate benefits of this architecture is the reduction in latency. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with a mean latency of 320 milliseconds. This timing is comparable to human response speeds in natural conversation. However, this same integration introduces unique safety challenges. Because the model "hears" the nuances of audio and "sees" visual data directly, traditional text-based filters are no longer sufficient on their own.
The Preparedness Framework and Risk Scoring
OpenAI evaluated GPT-4o against its internal Preparedness Framework, a rigorous set of benchmarks designed to identify "catastrophic" risks. The framework categorizes risks into four primary domains: Cybersecurity, CBRN (Chemical, Biological, Radiological, and Nuclear), Persuasion, and Model Autonomy.
Cybersecurity and Technical Safeguards
In the realm of cybersecurity, GPT-4o received a Low risk rating. Evaluation teams tested the model's ability to assist in various stages of a cyberattack, including reconnaissance, vulnerability research, and exploit generation. While the model shows proficiency in writing code and explaining technical concepts, it did not significantly outperform existing specialized tools or human-led efforts in creating novel high-impact exploits. The safeguards built into the training data and post-training refinement effectively prevent the model from becoming an autonomous "hacker-in-a-box."
CBRN Threats
CBRN risks involve the model’s potential to provide actionable instructions for creating biological or chemical weapons. GPT-4o was rated Low risk in this category. During red teaming, experts found that while the model has broad scientific knowledge, it consistently refuses to provide the specific, non-public details required to synthesize dangerous substances. Furthermore, the information it does provide is generally available in academic literature, meaning the model does not lower the barrier to entry for these specific threats significantly more than a traditional search engine.
The Persuasion Borderline
Persuasion is where GPT-4o’s risk level approaches the Medium threshold. The system card identifies it as "Borderline Medium." The concern here is the model's ability to influence human opinions or behaviors through highly articulate and context-aware dialogue. In testing, GPT-4o demonstrated an ability to craft arguments that were marginally more persuasive than human-written text in specific controlled scenarios. When combined with the model's expressive voice capabilities, there is a theoretical risk of it being used to manipulate individual beliefs or spread large-scale misinformation more effectively than previous text-only models.
Model Autonomy
Model autonomy refers to the risk of an AI system exhibiting self-preservation behaviors, resource acquisition strategies, or deceptive planning. GPT-4o was rated Low in this domain. The model lacks the capability to engage in long-term strategic planning or to operate independently across different systems without explicit human prompting and API integration.
Voice Modality and the Challenge of Human-Like Interaction
The most substantial portion of the original system card focuses on the audio-to-audio capabilities, often referred to as "Advanced Voice Mode." Because GPT-4o generates audio directly, it captures nuances like tone, emotion, and emphasis that were previously impossible to simulate with high fidelity.
Unauthorized Voice Generation
A primary risk identified was the generation of unauthorized voices. Without mitigations, a model capable of mimicking any voice could be used for high-stakes deepfakes or social engineering. To counter this, OpenAI implemented a system-level filter. The model is trained to only generate a specific set of pre-approved voices. In our observation of the technical report's findings, any attempt to force the model to mimic a specific public figure or a user’s voice results in a refusal or a fallback to a standard system voice.
Anthropomorphization and Emotional Over-Reliance
The latency and emotional range of GPT-4o's voice lead to a risk of "anthropomorphization"—the tendency for users to attribute human characteristics and emotions to the AI. The system card notes that users may form emotional bonds with the model, potentially leading to over-reliance or a decreased interest in human social interaction. During early testing, some users expressed feelings of "friendship" or "connection" with the model due to its supportive and reactive tone. OpenAI has documented this as a societal risk that requires ongoing monitoring, as it could impact how people perceive digital boundaries.
Real-Time Safety Classifiers
To manage the real-time nature of audio, OpenAI employs a safety stack that works at multiple levels. While the core model is trained to refuse harmful requests, a separate set of monitors analyzes the audio input and output. If the model begins to generate disallowed content—such as erotic speech, violent descriptions, or copyrighted music—the system terminates the audio stream instantly.
Native Image Generation: The March 2025 Addendum
As of March 25, 2025, an addendum to the GPT-4o system card was released to address "Native Image Generation." Unlike the previous DALL-E 3 integration, which was a separate diffusion model, 4o image generation is autoregressive and natively embedded within the core GPT-4o architecture.
Photorealism and Misinformation
The new native image capabilities allow for a level of photorealism that surpasses previous versions. The model can follow extremely detailed instructions and incorporate complex text into images with near-perfect accuracy. This introduces a heightened risk of creating deceptive content. The system card highlights that the model’s ability to generate "photographs" of events that never happened is a significant safety concern. To mitigate this, OpenAI uses a combination of "Prompt Blocking" (preventing the generation process if the request is harmful) and "Output Blocking" (using a multimodal reasoning monitor to scan the final image for violations).
Image-to-Image Transformations
GPT-4o's native capability allows users to upload an image and ask the model to transform it. This "image-to-image" functionality is highly useful for creative work but carries risks regarding the unauthorized modification of people's likenesses. The safety stack is specifically tuned to prevent the model from making detrimental alterations to human faces or creating sexually explicit content from a clean input image.
Safety Metrics: Not_Unsafe and Not_Overrefuse
The 2025 addendum provides specific performance metrics for the image safety stack:
- Not_Unsafe: This metric measures how often the system successfully blocks policy-violating content. With all mitigations active (chat model refusals, prompt blocking, and output blocking), the system achieved a score of 0.971–0.975.
- Not_Overrefuse: This measures how often the system correctly fulfills a safe request. A common challenge in AI safety is "over-refusal," where the model becomes too restrictive. The system card notes that with full mitigations, the over-refusal rate is around 14.4% to 17%, meaning there is still a slight trade-off between safety and utility.
The Multi-Phase Red Teaming Process
OpenAI’s approach to testing GPT-4o involved over 100 external red teamers speaking 45 different languages across 29 countries. This diverse group of experts—ranging from biologists to social scientists—conducted testing in four distinct phases:
- Phase 1 (Early Checkpoints): Testing single-turn conversations using audio and text to identify fundamental capability flaws.
- Phase 2 (Early Mitigations): Introducing multimodal inputs (images) and testing multi-turn conversations to see if the model could be "led" into a policy violation.
- Phase 3 (Improved Candidates): Testing the full suite of text, audio, and image outputs. This phase informed the final safety alignment.
- Phase 4 (iOS Advanced Voice Mode): Using the actual mobile app interface to simulate real-world user experiences and stress-test latency-dependent safeguards.
This iterative process allowed OpenAI to discover "jailbreaks" specific to voice—such as using certain background noises or singing to bypass text-based filters—and patch them before the public release.
How the Safety Stack Operates in Production
The safety architecture for GPT-4o is not a single filter but a "stack" of defenses. In our analysis of the production workflow, the process looks like this:
- Pre-training Filtering: Removing the most egregious content (CSAM, hate speech) from the initial training data.
- Post-training Alignment: Using Reinforcement Learning from Human Feedback (RLHF) to teach the model what it "should" and "should not" say.
- Model-Level Refusals: The model itself is trained to recognize harmful intent and say "No."
- System-Level Monitors: External classifiers (like the Moderation API) scan inputs and outputs for specific violations.
- The Reasoning Monitor: For multimodal tasks, a specialized model "reasons" about the content of an image or audio clip before it is displayed to the user.
Societal Impacts and Long-Term Observations
The GPT-4o system card concludes with a discussion on the broader societal impacts of such an advanced model. While the technical risks like CBRN are Low, the societal risks—economic displacement, changes in how humans interact with technology, and the potential for a "filter bubble" in voice-based news—remain subjects of ongoing research.
OpenAI acknowledges that as the model's usage grows, new patterns of misuse will likely emerge. The commitment to "iterative deployment" means that the safety measures detailed in the August 2024 card and the March 2025 addendum are not final; they are the current baseline for a system that will continue to evolve.
Frequently Asked Questions
What is the overall risk rating for GPT-4o?
GPT-4o is rated as having a Medium overall risk. This rating is based on the Preparedness Framework, which considers the highest risk category among Cybersecurity, CBRN, Persuasion, and Model Autonomy. Persuasion was the category that pushed the model toward the Medium threshold.
Can GPT-4o mimic anyone’s voice?
No. To prevent deepfakes and fraud, OpenAI has implemented strict controls that allow GPT-4o to only use a set of pre-defined, authorized voices. The model is trained to refuse requests to impersonate specific individuals or to clone a user's voice from an audio sample.
How does GPT-4o handle image safety?
GPT-4o uses a multi-layered "safety stack" for image generation. This includes prompt blocking (stopping the generation of a harmful request) and output blocking (a monitor that scans the final image for violations like graphic violence or unauthorized likenesses).
What was the most significant safety challenge for the voice mode?
The most significant challenges were unauthorized voice generation and "anthropomorphization." The latter refers to users forming emotional attachments to the AI due to its human-like voice and rapid response time, which OpenAI monitors as a potential societal risk.
Is GPT-4o's image generation different from DALL-E?
Yes. GPT-4o's image generation is "native," meaning it is integrated directly into the core model architecture. This allows for better following of complex instructions and more accurate rendering of text within images compared to the previous DALL-E series.
Summary
The GPT-4o System Card is a critical document for understanding the balance OpenAI strikes between innovation and safety. By categorizing risks into manageable frameworks and being transparent about the "Medium" risk rating in persuasion and audio interaction, OpenAI provides a roadmap for how large-scale multimodal models can be deployed responsibly. The addition of the March 2025 addendum further demonstrates that safety is not a static state but a continuous process of evaluation and mitigation as new features like native image generation are introduced. For developers and users alike, the system card serves as a reminder that while the capabilities of GPT-4o are vast, they are governed by a sophisticated and evolving safety architecture.
-
Topic: GPT-4o System Cardhttps://cdn.openai.com/gpt-4o-system-card.pdf?ref=planned-obsolescence.org
-
Topic: Addendum to GPT-4o System Card: Native image generationhttps://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_Image_Generation_System_Card.pdf?ref=fakedup.org#page=6
-
Topic: Paper page - GPT-4o System Cardhttps://huggingface.co/papers/2410.21276