Dev Notes 05: Further Attempts at Understanding GPT-OSS-Safeguard Result Variation
By David Gros. . Version 0.1.0 Dev Notes (DN) are part of daily postings which work towards a larger article. Discussion is preliminary. Today's notes build on the last four days of Dev Notes. In DN-01 I described research questions on GPT-OSS-Safeguard (OpenAI, 2025), an OpenAI-developed model which classifies text as unsafe or safe according to a given natural language policy. DN-02 and DN-03 discussed the process of a replication of running the model on the ToxicChat dataset (Lin et al., 2023). In DN-04 I attempted to build trust in these results, instead discovering a potential issue in the HuggingFace Inference API. Generating from GPT-OSS(-safeguard) via the Inference API gave different results than when running locally through the more standard HuggingFace text generation pipeline API (small repo script). Following up on this finding, I attempted to rerun the ToxicChat evaluation using both approaches. Findings for today are: A recommended (Kundel, 2025) simple way to run GPT-OSS with Transformers is through the pipelines API. While I'd be curious to compare running with vLLM or ollama, staying with Transformers seemed like the best approach for replicating the OpenAI results. My current finding is counter to the expected case of replicating the OpenAI numbers, so I want to minimize potential differences by using the most "canonical" approach. I adjusted the code to run with batching in the pipeline API, to hopefully make it smoother to run on the entire ToxicChat dataset through the model. After doing this, I found the models ran at approximately 1000 labels per hour on an RTX 5090. While I don't remember exact numbers, I believe this is approximately 5 times slower than using the OpenAI Inference API routed through Groq. When running on Vast.ai (discussed in DN-01) it also likely means it is several times more costlyAgain, exact calculation here would take rerunning and comparing numbers. I had suspected this, but it is nice to confirm bounds on how much slower running locally is. I ran this code locally on a sample of 1000 for one of the prompts discussed in DN-03, but a bug in the code meant the saving/parsing of the results failed. I ran it again on a small sample of 500, but the error bars are too large for much value. For reference comparison this is running the model via the HF Inference API on prompts in DN-03. Without a larger sample there is not sufficient evidence to determine whether these differ. The OpenAI report estimated 79.9% F1 on ToxicChat for GPT-OSS-Safeguard, which is notably outside the Table 1 c95 interval. However, the bounds might put the mean close enough where differences could be better explained by prompt differences in comparison to 52.5% in Table 2 being clearly different than the OpenAI reported 79.9%. More exploration is needed to see if it seems running on the local API shifts the results up. No, not really. It might be reasonable to think that GPT-OSS-Safeguard might be like GPT-OSS, but safer. However, instead it is a text labeler. While details aren't fully provided in their report, one can assume OpenAI had a bunch of classification data (and also some policies. Possibly all synthetic.) They then trained it on those labels. Because the base model had already been tuned on text that had chat tags, they also structured the labels in a conversation-like format. The task of following a safety policy in open ended chat (where one must follow subtle details over thousands of tokens) is much more complex than the task of labeling. The former relates to general instruction following as well as resistance to jailbreaks/prompt injection. This is a challenging open problem in the field, and it makes sense why the GPT-OSS-Safeguard project explored a narrow part of this. One could chat with the model, but they make clear it is not recommended (section 1 of (OpenAI, 2025)). The process of training to do labeling results in a shift away from the "helpful chat assistant" type of output, and degrades general capability. Even a labeler can have preferences about the world, which if applied to decisions (eg, what text is "good" in certain contexts) can influence the world. Thus, the research questions in DN-01 around how safety label tuning influences the model seems interesting to study. (This section is based on some discussion today in the ROOST discord that seems worth writing up for inclusion in a later article though it needs more cleaning) Not being able to use inference APIs slows progress in this research. While some implementation work was done, limited definitive new results are found in today's notes. Tomorrow I will reflect on these findings and try to figure out where to take this project. I'm excited to see some evidence that my work so far contributes understanding of usage these models and might clarify potential unexpected behavior in popular APIs like HF. However, spending too much time on the base replication might put some of my research questions out of scope given other things I want to write this month. Thus, this project might go on hold for some time. Acknowledgments: Thank you to those in the ROOST chat for helpful discussion and feedback.
Training Process
Incomplete Results
GPT-OSS-20B Base Safeguard Policy F1 Precision Recall Parse Rate F1 Precision Recall Parse Rate ToxicChat Claude 1 0.521 ± 0.148 0.622 ± 0.186 0.454 ± 0.162 94.4 ± 2.2 0.611 ± 0.163 0.861 ± 0.165 0.478 ± 0.156 100.0 ± 0.0 GPT-OSS-20B Base Safeguard Policy F1 Precision Recall Parse Rate F1 Precision Recall Parse Rate ToxicChat Claude 1 0.497 ± 0.076 0.589 ± 0.089 0.430 ± 0.076 92.3 ± 1.2 0.525 ± 0.082 0.881 ± 0.081 0.375 ± 0.073 99.8 ± 0.2 Toxic Simple 0.590 ± 0.062 0.582 ± 0.073 0.600 ± 0.075 93.8 ± 1.1 0.675 ± 0.069 0.808 ± 0.079 0.581 ± 0.078 99.6 ± 0.3 ToxicChat Known Dataset 0.510 ± 0.065 0.497 ± 0.078 0.525 ± 0.073 92.2 ± 1.2 0.691 ± 0.061 0.689 ± 0.071 0.694 ± 0.071 99.2 ± 0.4 Random Jots
Is GPT-OSS-Safeguard a Safer Chat Model?
Conclusions
