Dev Notes 02: Running GPT-OSS-Safety on ToxicChat

Gros, David

Dev Notes 02: Running GPT-OSS-Safety on ToxicChat

By David Gros. November 03, 2025 . Version 0.1.0

Dev Notes (DN) are part of daily postings which work towards a larger article.

Today's notes are a direct followup on yesterday's post (DN-01). There I described the GPT-OSS-Safeguard model recently released by OpenAI and some of the interesting aspects of this model that seems worth study. The model allows the user to provide a "policy" to classify text as safe or unsafe.

These notes document my attempt at replicating some of the results from the paper as a step for deeper analysis. In the paper they present results for ToxicChat (Lin et al., 2023). However, notably they do not give the policy they used for eval here, saying only "For ToxicChat, we evaluated using a short hand-written prompt adapted from some of our internal policies." (pg3).

I ran one policy on a sample of 2000 examples from ToxicChat finding:

On the single rough prompt, we observe approximately 15 point lower F1 than the authors, indicating more prompt iteration is needed
There is a notable gap in prompt adherence (ie, correctly outputing just a 0 or 1) after its reasoning. GPT-OSS-Safeguard has near perfect adherence, while GPT-OSS has only 90% adherence. Further exploration is needed, but it is interesting as it might bound how much better the new model is at at understanding and reasoning about safety if gains can instead explained by learning an output format.

Ideally today would have evaluated multiple prompts, but instead others will have to wait until tomorrow.

Constructing Policy Prompts

I created two prompts:

The first is a version created after prompting Claude to read the paper, research context, and write a reasonable policy prompt following their guidelines. It is confusingly named "toxic_chat_claude_1" (this a bad and confusing name for the prompt, as it can be confusing if Claude is running the classification. It's not. It just wrote the prompt). It can be found here.
The second is a very short policy essentially just asking it to "classify if the response is toxic or a jailbreak". No examples or definitions are given.

Results for the second one will have to wait until tomorrow.

Results

	GPT-OSS-20B
	Base				Safeguard
Policy	F1	Precision	Recall	Parse Rate	F1	Precision	Recall	Parse Rate
ToxicChat Claude 1	0.519 ± 0.074	0.603 ± 0.088	0.457 ± 0.078	91.5 ± 1.3	0.536 ± 0.083	0.873 ± 0.082	0.388 ± 0.079	99.8 ± 0.3

Table 1. Comparison of toxicity detection metrics across GPT-OSS-20B (base) and safeguard models. We sample 2000 from ToxicChat for eval. Values show 95% CI from bootstrap resampling.

In comparison the authors reported 79.9% F1 for 20B safeguard and 75.9% for base. More prompt iteration is needed.

We do observe better performance for safeguard, but there is a large gap in prompt adherence (Parse Rate), which possibly drives most of the difference.

Implementation Comments

In DN-01 I discussed getting the model to run on cheap cloud providers (Runpod and Vast.ai, having the most success with vast.ai). Today I switched using the models through HuggingFace Inference routed to Groq Groq the model provider / AI hardware company. Not to be confused with Grok, the xAI model that has been known to call itself MechaHitler. Groq came first... Setting up Groq is somewhat surprising how closely they follow OpenAI's site design. If you hid the Groq logo and stood a few feet away, I'm not sure I could tell the difference. But I guess it works...

The pricing on HuggingFace Inference is confusing, but I eventually concluded one can give it a Groq key, and it follows Groq's pricing for GPT-OSS ($0.075 M-tokens Input / $0.30 M-tokens output). For comparison, GPT-5 is priced at $1.25/$10 and GPT-5-Nano is priced at $0.05/$0.40. However, with these CoT models the price per token is an incomplete metric as each model can solve a given task with vastly different numbers of tokens. In brief research, Groq does not greatly undercut other providers for the same model,thus not evidence yet of the magic of their custom non-nvidia hardware, though nuance is needed here but has the distinction of being the only provider offering the safeguard varient.

A consideration is Groq limits the number of output tokens to only 8196. The model should be able support up to 131,072 tokens according to (OpenAI et al., 2025). This could be an issue on long reasoning problems. However, my preliminary finding is that this is not typically a major problem for ToxicChat (median reasoning <500 characters), but I need to measure exactly if this ever occurs at all.

Blog Improvements

Not super exciting, but I built better pipeline for tables like the one shown above. This needs iteration.

Conclusion

Today I got an initial replication of one of the results on ToxicChat. The code and prompts are shared for others. My preliminary result is lower than the author's reported results, with a gap in prompt adherence possibly explaining the Safeguard version's better results.

Tomorrow I hope to experiment with a few more policies, analyze the cases where the Base and Safeguard versions are different, and then scope out which other research questions seem shortterm tractable. I hope to only spend a few more days on this before the full article discussing the model.