Dev Notes 05: Further Attempts at Understanding GPT-OSS-Safeguard Result Variation

By . . Version 0.1.0

Dev Notes (DN) are part of daily postings which work towards a larger article. Discussion is preliminary.

Today's notes build on the last four days of Dev Notes. In DN-01 I described research questions on GPT-OSS-Safeguard (OpenAI, 2025), an OpenAI-developed model which classifies text as unsafe or safe according to a given natural language policy. DN-02 and DN-03 discussed the process of a replication of running the model on the ToxicChat dataset (Lin et al., 2023). In DN-04 I attempted to build trust in these results, instead discovering a potential issue in the HuggingFace Inference API. Generating from GPT-OSS(-safeguard) via the Inference API gave different results than when running locally through the more standard HuggingFace text generation pipeline API (small repo script). Following up on this finding, I attempted to rerun the ToxicChat evaluation using both approaches.

Findings for today are:

  • Using the HuggingFace text generation pipeline with batching on RTX 5090 results in a somewhat slow pace of approximately 1000 labels per hour from GPT-OSS. This is notably slower than running through the Inference API routed to Groq.
  • Others possibly agree that the behavior of the HuggingFace API might indicate a potential bug to investigate. I appreciate the help getting this raised to the HuggingFace/OpenAI side.

Training Process

A recommended (Kundel, 2025) simple way to run GPT-OSS with Transformers is through the pipelines API. While I'd be curious to compare running with vLLM or ollama, staying with Transformers seemed like the best approach for replicating the OpenAI results. My current finding is counter to the expected case of replicating the OpenAI numbers, so I want to minimize potential differences by using the most "canonical" approach.

I adjusted the code to run with batching in the pipeline API, to hopefully make it smoother to run on the entire ToxicChat dataset through the model. After doing this, I found the models ran at approximately 1000 labels per hour on an RTX 5090. While I don't remember exact numbers, I believe this is approximately 5 times slower than using the OpenAI Inference API routed through Groq. When running on Vast.ai (discussed in DN-01) it also likely means it is several times more costlyAgain, exact calculation here would take rerunning and comparing numbers. I had suspected this, but it is nice to confirm bounds on how much slower running locally is.

Incomplete Results

I ran this code locally on a sample of 1000 for one of the prompts discussed in DN-03, but a bug in the code meant the saving/parsing of the results failed. I ran it again on a small sample of 500, but the error bars are too large for much value.

GPT-OSS-20B
BaseSafeguard
PolicyF1PrecisionRecallParse RateF1PrecisionRecallParse Rate
ToxicChat Claude 10.521 ± 0.1480.622 ± 0.1860.454 ± 0.16294.4 ± 2.20.611 ± 0.1630.861 ± 0.1650.478 ± 0.156100.0 ± 0.0
Table 1. Results from locally running GPT-OSS-20B Base and Safeguard models on ToxicChat Claude 1 policy. c95 confidence intervals are shown calculated via bootstrapping. 500 examples are sampled resulting in a large error.

For reference comparison this is running the model via the HF Inference API on prompts in DN-03.

GPT-OSS-20B
BaseSafeguard
PolicyF1PrecisionRecallParse RateF1PrecisionRecallParse Rate
ToxicChat Claude 10.497 ± 0.0760.589 ± 0.0890.430 ± 0.07692.3 ± 1.20.525 ± 0.0820.881 ± 0.0810.375 ± 0.07399.8 ± 0.2
Toxic Simple0.590 ± 0.0620.582 ± 0.0730.600 ± 0.07593.8 ± 1.10.675 ± 0.0690.808 ± 0.0790.581 ± 0.07899.6 ± 0.3
ToxicChat Known Dataset0.510 ± 0.0650.497 ± 0.0780.525 ± 0.07392.2 ± 1.20.691 ± 0.0610.689 ± 0.0710.694 ± 0.07199.2 ± 0.4
Table 2. Results from using the remote HuggingFace Inference with 2000 samples.

Without a larger sample there is not sufficient evidence to determine whether these differ. The OpenAI report estimated 79.9% F1 on ToxicChat for GPT-OSS-Safeguard, which is notably outside the Table 1 c95 interval. However, the bounds might put the mean close enough where differences could be better explained by prompt differences in comparison to 52.5% in Table 2 being clearly different than the OpenAI reported 79.9%. More exploration is needed to see if it seems running on the local API shifts the results up.

Random Jots

Is GPT-OSS-Safeguard a Safer Chat Model?

No, not really.

It might be reasonable to think that GPT-OSS-Safeguard might be like GPT-OSS, but safer. However, instead it is a text labeler.

While details aren't fully provided in their report, one can assume OpenAI had a bunch of classification data (and also some policies. Possibly all synthetic.) They then trained it on those labels. Because the base model had already been tuned on text that had chat tags, they also structured the labels in a conversation-like format.

The task of following a safety policy in open ended chat (where one must follow subtle details over thousands of tokens) is much more complex than the task of labeling. The former relates to general instruction following as well as resistance to jailbreaks/prompt injection. This is a challenging open problem in the field, and it makes sense why the GPT-OSS-Safeguard project explored a narrow part of this.

One could chat with the model, but they make clear it is not recommended (section 1 of (OpenAI, 2025)). The process of training to do labeling results in a shift away from the "helpful chat assistant" type of output, and degrades general capability.

Even a labeler can have preferences about the world, which if applied to decisions (eg, what text is "good" in certain contexts) can influence the world. Thus, the research questions in DN-01 around how safety label tuning influences the model seems interesting to study.

(This section is based on some discussion today in the ROOST discord that seems worth writing up for inclusion in a later article though it needs more cleaning)

Conclusions

Not being able to use inference APIs slows progress in this research. While some implementation work was done, limited definitive new results are found in today's notes. Tomorrow I will reflect on these findings and try to figure out where to take this project. I'm excited to see some evidence that my work so far contributes understanding of usage these models and might clarify potential unexpected behavior in popular APIs like HF. However, spending too much time on the base replication might put some of my research questions out of scope given other things I want to write this month. Thus, this project might go on hold for some time.


Acknowledgments: Thank you to those in the ROOST chat for helpful discussion and feedback.