Dev Notes 02: Running GPT-OSS-Safety on ToxicChat
By David Gros. . Version 0.1.0 Dev Notes (DN) are part of daily postings which work towards a larger article. Today's notes are a direct followup on yesterday's post (DN-01). There I described the GPT-OSS-Safeguard model recently released by OpenAI and some of the interesting aspects of this model that seems worth study. The model allows the user to provide a "policy" to classify text as safe or unsafe. These notes document my attempt at replicating some of the results from the paper as a step for deeper analysis. In the paper they present results for ToxicChat (Lin et al., 2023). However, notably they do not give the policy they used for eval here, saying only "For ToxicChat, we evaluated using a short hand-written prompt adapted from some of our internal policies." (pg3). I ran one policy on a sample of 2000 examples from ToxicChat finding: Ideally today would have evaluated multiple prompts, but instead others will have to wait until tomorrow. I created two prompts: Results for the second one will have to wait until tomorrow. In comparison the authors reported 79.9% F1 for 20B safeguard and 75.9% for base. More prompt iteration is needed. We do observe better performance for safeguard, but there is a large gap in prompt adherence (Parse Rate), which possibly drives most of the difference. In DN-01 I discussed getting the model to run on cheap cloud providers (Runpod and Vast.ai, having the most success with vast.ai). Today I switched using the models through HuggingFace Inference routed to Groq Groq the model provider / AI hardware company. Not to be confused with Grok, the xAI model that has been known to call itself MechaHitler. Groq came first... Setting up Groq is somewhat surprising how closely they follow OpenAI's site design. If you hid the Groq logo and stood a few feet away, I'm not sure I could tell the difference. But I guess it works... The pricing on HuggingFace Inference is confusing, but I eventually concluded one can give it a Groq key, and it follows Groq's pricing for GPT-OSS ($0.075 M-tokens Input / $0.30 M-tokens output). For comparison, GPT-5 is priced at $1.25/$10 and GPT-5-Nano is priced at $0.05/$0.40. However, with these CoT models the price per token is an incomplete metric as each model can solve a given task with vastly different numbers of tokens. In brief research, Groq does not greatly undercut other providers for the same model,thus not evidence yet of the magic of their custom non-nvidia hardware, though nuance is needed here but has the distinction of being the only provider offering the safeguard varient. A consideration is Groq limits the number of output tokens to only 8196. The model should be able support up to 131,072 tokens according to (OpenAI et al., 2025). This could be an issue on long reasoning problems. However, my preliminary finding is that this is not typically a major problem for ToxicChat (median reasoning <500 characters), but I need to measure exactly if this ever occurs at all. Not super exciting, but I built better pipeline for tables like the one shown above. This needs iteration. Today I got an initial replication of one of the results on ToxicChat. The code and prompts are shared for others. My preliminary result is lower than the author's reported results, with a gap in prompt adherence possibly explaining the Safeguard version's better results. Tomorrow I hope to experiment with a few more policies, analyze the cases where the Base and Safeguard versions are different, and then scope out which other research questions seem shortterm tractable. I hope to only spend a few more days on this before the full article discussing the model.
Constructing Policy Prompts
Results
GPT-OSS-20B Base Safeguard Policy F1 Precision Recall Parse Rate F1 Precision Recall Parse Rate ToxicChat Claude 1 0.519 ± 0.074 0.603 ± 0.088 0.457 ± 0.078 91.5 ± 1.3 0.536 ± 0.083 0.873 ± 0.082 0.388 ± 0.079 99.8 ± 0.3 Implementation Comments
Blog Improvements
Conclusion
