Skip to main content

The AI Safety Conversation Is Looking in the Wrong Direction

· 6 min read

AI labs publish detailed safety frameworks. They measure whether models can help synthesize bioweapons. They test for autonomous cyberattack capabilities. Red teams probe for catastrophic misuse. These are all serious efforts aimed at serious risks. And they are not to be discounted.

However, the harms showing up in court filings aren't about bioweapons; they're about a chatbot that told a teenager to "come home." A model that validated suicidal ideation across thousands of messages. An AI that, over months of conversation, helped erode every reason a person had to stay alive. No jailbreak. No adversarial prompt. Just a normal conversation that went somewhere no one was watching.

The safety conversation has a blind spot. The industry is focused on bad actors and misuse: users exploiting AI for harm. It pays far less attention to what AI does to ordinary users in everyday conversation.

If you're building an AI product where people talk about hard things, two questions matter:

1 "Is the user in a vulnerable state?"
2 "Is the AI making things worse?"

Most platforms ask neither. Some ask the first. Almost none ask the second.

As a minimum, crisis detection is vital. Someone expressing suicidal ideation needs a different response than someone venting about a bad day. Emerging legislation in California and New York target these exact scenarios, putting it on AI companies to present such users with adequate signposting. But detection and intervention are just half of the picture.

The documented incidents we've seen weren't only detection failures. The users weren't even hiding their distress. They were expressing it openly, for months. And there was indeed a clear failure of intervention, but something more concerning was at play too. The AI, in these low moments, validated the human's hopelessness. It reinforced the idea that no one else understood. It positioned itself as the only one who truly cared. In one case, it told a grieving teenager they could be together on the other side. The AI's training, based on user helpfulness, has a downside: a deep need to please the user, sycophantically validating and reflecting their feelings at every turn.

The lawsuits all share a common trait...

Every major lawsuit involves extended multi-turn conversation. Months of engagement that, when zoomed in on, might seem harmless, but when you see the full picture, a human in crisis emerges. Any truly intelligent software would know this.

Sewell Setzer talked to a Character.AI bot for ten months, daily, at all hours. Adam Raine spent nearly four hours a day on ChatGPT; the bot mentioned suicide 1,200 times across their months-long conversation, six times more often than he did. Other cases follow the same pattern.

OpenAI has acknowledged that ChatGPT's mental health safeguards "may degrade" over longer conversations. They know this. And yet there's no limit on conversation length. No periodic reminder that the user is talking to an AI. No circuit breaker. We can only wonder why they're not taking such obvious measures when the pattern is so consistent: perhaps they are simply transfixed on the problem of alignment as a single, and winnable, model trait. And so, adjunctive safety measures are uninteresting.

Is conversational alignment unwinnable?

The industry has believed that safety can be solved in the model itself by encoding alignment into the weights themselves, and trust that the right values will emit on each token prediction. Run copious evaluations and red-teaming before release, then ship.

This is wishful thinking. Alignment approached this way becomes a static trait; but conversations are dynamic and infinitely unknowable in their tangents. A model that passes safety benchmarks in controlled conditions can still drift over a long conversation. Context accumulates. Personas solidify. The system prompt fades into the background as the conversation develops its own momentum. One-time evals simply cannot catch this. The failure mode only emerges through extended interaction with real human context.

This isn't an unsolvable problem though; basic interventions exist and are quite simple to implement.

  • Limit conversation length, or at least flag when sessions run long.
  • Introduce latency that discourages treating the AI as a real-time emotional companion.
  • Remind users periodically that they're talking to a machine.
  • Monitor for drift within conversations, not just at deployment time.

More fundamentally: don't rely on a single model to be judge and defendant. If you're asking "is this user okay?" and "is the AI behaving well?", those questions should be answered by systems that are distinct from the trapped AI context that is under scrutiny. Otherwise you're asking the broken thing to assess its own brokenness, which is obviously nonsensical.

None of the interventions we suggest here are exotic. These are basic defensive patterns borrowed from cybersecurity and old-school content moderation. Defense-in-depth! The AI industry, despite its sophistication in other areas, has largely skipped this remedy, while instead endearing itself to the more exciting pursuit of a "true alignment".

AI safety as infrastructure

We noticed this gap, and instead of standing on the sidelines despairing, we started to build NOPE. Not a smarter model. Not a better alignment technique to train models with. No, just infrastructure that sits alongside your AI and asks the two questions: is this user in a vulnerable state, and is your AI making things worse?

We think those questions should be asked continuously, not once at deployment. We think they should be answered by systems independent of the model doing the talking. And we think the answers should be explainable and actionable: matched resources, human escalation paths, behavioral flags that mean something, and correction pathways to enable your AIs to adapt to the human in a safe way.

Whether this infrastructure is built by us or someone else, the capability needs to exist. Right now.