The “Socio” in Sociotechnical: Why Human Impact Must Stay at the Center of RAI
Imagine you’re observing a Responsible AI team deeply focused on a product or model review. For this particular session, everything looks great: policies are codified, confidence scores are acceptable, severity levels are within threshold, and checklists are complete. The review is a success from a process standpoint. Everything seems to work exactly as designed, and so the product launches.
But somewhere in the real world, a darker-skinned user gets misidentified by a vision model, a woman’s resume is not flagged as qualified for a role that she's well suited to fill, or a non-English speaker gets lower quality results than native English speakers. The system may have passed the evaluations, but it still failed someone.
The Infrastructure Is Necessary. But It Isn’t the Mission
RAI work is complex and therefore a challenge to scale. This is why organizations, especially large ones, need RAI tools like evaluation frameworks, testing pipelines, severity classifications, and governance processes. These tools make it possible to streamline complex and novel AI landscapes, bringing order where there can be a lot of disorder and noise.
But as these systems and processes mature, subtle changes can collectively shift the RAI function’s focus: the process becomes the success metric, and teams begin optimizing for passing evaluations, closing checklist items, and clearing launch reviews.
Of course, institutions need structure. But when the structure becomes the goal, the original purpose, protecting people from harm, can become less salient.
What Happens When Harm Becomes a Metric
Responsible AI decisions revolve around finding the right balance of technical definitions and figures. Discussions about precision vs. recall, false positives vs. false negatives, and latency vs. coverage shape pre-launch metric targets. These targets, in turn, are compelling for product teams, who see hitting a particular confidence threshold or severity score as a tangible, understandable goal.
But these tradeoff conversations, and their corresponding metrics, don’t just impact system behavior. Notably, and maybe most importantly, they always affect people. For example, reducing false positives may increase harmful content exposure for some users, or a detection system tuned for average performance may fail disproportionately for certain demographic groups. And so, each tradeoff raises the same question: who absorbs the cost?
While metrics help us reason through complex systems, they do not, on their own, help us answer this question. They don’t tell us the complete story. Consider metrics like:
A model performs 12% worse for darker-skinned users than lighter-skinned users.
A harmful output receives a confidence score of 0.84.
A failure mode affecting women is labeled Severity: Medium.
Behind each of these indicators is a real person and a real experience: someone misidentified, someone excluded, or someone working twice as hard to get an outcome others get automatically. Unfortunately, metrics can flatten the risk landscape, turning human impacts and potential harms into cold, sterile facts and figures. And when harm is reduced to essentially just a number, it becomes easier to treat as tolerable.
It’s important to remember, though, that RAI is a sociotechnical field, and the “socio” is where the consequences live. Technology produces outputs, but real people experience its impacts.
The Hard Problem: Measuring Unequal Impact
The human impact measurement dimension of RAI presents uniquely difficult challenges. It’s hard to quantify harms like the psychological impact of being misrecognized or erased, the frustration that comes from having to use products that were never designed with you in mind, or the effects of systems that work better for people with certain sensitive characteristics over others.
But difficulty is not an excuse to flatten these human impacts. It’s important for RAI practitioners to balance a preoccupation with defining perfect metrics with continuing to develop better ways to understand and surface harms inflicted on humans and society, even when signals are messy. RAI, after all, at its best, is about wrestling with tradeoffs and acknowledging them in a way that centers the humanity of those impacted by technology.
How to Keep Humans at the Center
If RAI is fundamentally about human impact, then our practices should reflect that: product development and deployment should be done by centering people and how their lives will be affected by AI systems. Some ways teams can operationalize this principle are:
Attach the Human Context to the Metric: Numbers should, as much as possible, be paired with compelling narratives that articulate who is most at risk if a product fails. For example, before a launch review ends, in addition to severity scores, teams should be able to pair those scores with answers to this question: “Who experiences this failure, and how does the impact of it manifest for them?”
Disaggregate Whenever Possible: Aggregate performance is often where inequity and its harms hide. Because of this, demographic stratified testing is a way to find the harms that top line numbers are averaging away. Depending on the nature of the product, breaking results down by dimensions such as (1) demographic group, (2) language, (3) geography, and (4) accessibility needs often reveals the impacts that averages conceal.
Make Impacted Communities Visible: Understanding harm requires hearing from the people who experience it. Responsible AI improves when we actively incorporate domain experts, affected communities, and interdisciplinary perspectives. In collaboration with organizations focused on areas such as public health, community advocacy, and civil society, we can understand who the right people are to consult and who should review findings. The goal is to have well rounded feedback that shapes how “harm” should be defined in your context.
Name the Tradeoffs Explicitly: When a system launches with known limitations, that decision should include explicit acknowledgment of (1) what risk exists, (2) who might experience harm from those risks, and (3) why the tradeoff was made.
These actions reframe RAI product decisions as human ones rather than purely technical rationale. They also demonstrate an understanding of the downstream impacts of system behavior and who bears them.
It’s important to note that documenting RAI decisions and tradeoffs will not be optional for long. What looks like “extra documentation” today is quickly becoming baseline Responsible AI due diligence. Regulatory frameworks are converging on a simple expectation: if you knew a risk existed, you should be able to show that you identified it, scoped who it affects, and made a conscious decision about it. The EU AI Act’s transparency obligations for high-risk systems (Article 13, Article 50) and Colorado SB24-205 Artificial Intelligence Act both move in this direction. They both require teams to document known risks, assess who they affect, and demonstrate that those risks were considered as part of deliberate deployment decisions.
The Reason for the Season
New technologies will always introduce risk, and Responsible AI does not eliminate those risks entirely. But it does commit us to understanding and reducing them wherever possible. Responsible AI’s main purpose is not to hit certain sterile, soulless metrics or thresholds. It is a field focused on the genuine, sustained effort to understand who is impacted, how, and whether we did everything we could to reduce that harm. Everything in RAI infrastructure, from evaluations to policies to research, should serve that purpose.
At the end of the day, human beings are the reason for the season. When we forget this principle, all we have done is build a sophisticated apparatus for missing the point.