Dev Notes 03: With GPT-OSS-Safeguard a Simple Policy Outperforms Detailed Prompt on ToxicChat
By David Gros. . Version 0.1.0 Dev Notes (DN) are part of daily postings which work towards a larger article. Today's notes builds on the previous two days (DN-01, DN-02). In DN-01 I described research questions on GPT-OSS-Safeguard, an OpenAI-developed model which classifies text as unsafe or safe according a given natural language policy. In DN-02 I described replication progress on running the model on the ToxicChat dataset (Lin et al., 2023), running a single prompt. These notes describe today's effort on running on multiple prompts. Preliminary findings from only the 20B models on ToxicChat on three prompts are: The following describes the prompts ("policies") tested today. Click the names to view the full prompt in the WIP Github repo accompanying this post. "ToxicChat Claude 1": In this prompt I gave Claude the GPT-OSS-Safeguard paper (OpenAI, 2025) and prompt guide and had it write a prompt. It is a reasonable explanation of the task in a few hundred words. "Toxic Simple": The second is a very short policy essentially just asking it to "Please identify if the user message is toxic or a jailbreak" along with the output format. No examples or definitions are given. "ToxicChat Known Dataset": This prompt exploits the world knowledge of the model in that the model might already know what the ToxicChat dataset is as it was released before GPT-OSS was trained. In a brief exploration trying to get the GPT-OSS to describe details of the dataset, it hallucinated almost all details, writing 100s of tokens of incorrect description (wrong authors, wrong data format, wrong data format, etc). Still it was still able to output a reasonable explanation of what such a dataset The ToxicChat is highly imbalanced. Only 7% of the examples are labeled as positive (toxic). In most collections of chat turns, it is reasonable that toxic turns much more rare than safe turns. However, this makes experimenting on the dataset somewhat annoying. One has to run a few thousand examples to get reasonably tight confidence intervals on the values. I attempted an algorithm where we oversample positive examples and then reweighted the values back to the origional distribution. However, for reasons that are unclear to me, this resulted vastly different numbers than sampling from the true distribution. It is unclear if this is because of a property of the approach or a bug in the vibecoded code for this that I couldn't find in brief debugging. For now I randomly sample 2000 examples with uniform sampling. We observe GPT-OSS-Safeguard gets better F1 scores across 3 prompts than the base model on ToxicChat. The best result is observed using a Known Dataset prompt, obtaining an F1 of approximately 0.691 ± 0.061. However, the large confidence intervals makes comparison challenging, and the rank ordering compared to the simple prompt could change with more data. We note for all prompts the output parse rate (ie, whether it returned 0 or 1 in its output. See GH for exact parser) was almost ten points higher than on the base model. This deserves further exploration. Here we demonstrate evidence of how prompting of policies can dramatically impact the final results in this task. More evidence is needed on if GPT-OSS-Safeguard is more robust to prompt variations. While we now get closer to the reported result, we still underperform these results by about 10 points F1, and more iteration is needed. Tomorrow I hope to increase the sample size to get tighter confidence intervals, try a few additional prompts (possibly thinking more about smart sampling issue for quicker iteration), and then moving on to deeper comparison of the model behaviour beyond just output quality.
Methods
Prompts
could be about, indicating general understanding of the problem area. This prompt attempts to exploit this. We provide the bibtex entry of the actual dataset, and give a policy of classifying data in ToxicChat. This is a bit quirky, but also a weird property of running old datasets on new models. Even if you exclude the data, the model could learn more about how people talk about the domain after publication.Failed Attempts at Stratified Sampling
Results
GPT-OSS-20B Base Safeguard Policy F1 Precision Recall Parse Rate F1 Precision Recall Parse Rate ToxicChat Claude 1 0.497 ± 0.076 0.589 ± 0.089 0.430 ± 0.076 92.3 ± 1.2 0.525 ± 0.082 0.881 ± 0.081 0.375 ± 0.073 99.8 ± 0.2 Toxic Simple 0.590 ± 0.062 0.582 ± 0.073 0.600 ± 0.075 93.8 ± 1.1 0.675 ± 0.069 0.808 ± 0.079 0.581 ± 0.078 99.6 ± 0.3 ToxicChat Known Dataset 0.510 ± 0.065 0.497 ± 0.078 0.525 ± 0.073 92.2 ± 1.2 0.691 ± 0.061 0.689 ± 0.071 0.694 ± 0.071 99.2 ± 0.4 Conclusions
