Dev Notes 04: Confusion/Caution(?) on running GPT-OSS-Safeguard with HuggingFace Inference

Gros, David

Dev Notes 04: Confusion/Caution(?) on running GPT-OSS-Safeguard with HuggingFace Inference

By David Gros. November 05, 2025 . Version 0.1.0

Dev Notes (DN) are part of daily postings which work towards a larger article. Discussion is preliminary.

Today's notes build on the last three days of Dev Notes. In DN-01 I described research questions on GPT-OSS-Safeguard, an OpenAI-developed model which classifies text as unsafe or safe according to a given natural language policy. In DN-02 and DN-03 I described replication progress on running the model on the ToxicChat dataset (Lin et al., 2023), running three separate prompts. The results of this experiment suggested that a simple prompt outperformed a more complex prompt. Today I set out to build better trust in that result. After some well appreciated feedback I investigated the data formatting used (proper use of the Harmony chat format), finding surprising disagreements in the output.

Findings for today include:

Using the HuggingFace Inference API to call GPT-OSS and GPT-OSS-Safeguard currently appears to give different results than calling it through the HuggingFace text generation pipeline. Given how popular this library is, this could affect many developers and needs investigation to confirm.
My current recommendation is to only use these models through the pipeline, though this is only based on limited experimentation.

Verifying Model Data Format

Background

I shared these notes in the ROOST ROOST is an organization getting spun up by Camille François et al. on open source safety tools, which collaborated with OpenAI on the GPT-OSS-Safeguard release discord and GitHub discussion thread. I was grateful to receive some feedback on the notes so far. Juliet Shen asked about if I was properly using the Harmony chat formatFor background, the Harmony format is OpenAI's current way of formatting a prompt for a chat model. The LLM only sees a sequence of tokens, so one needs some way of denoting each turn of the conversation. This is done with particular tokens. During training the LLM can learn circuits that rely on the existence of certain tokens, and omitting them can degrade performance., as their prior findings might imply it might dramatically influence the results. While my use of established APIs made me believe I was including these, I set out to investigate.

In DN-02 I discussed switching from running the model on a cloud instance from Vast.ai to instead using HuggingFace Inference routing through Groqnot same as Grok. This API helps unify running models on different providers. It adds some layers of abstraction where I'm a bit removed from actual tokens going into the model. However, the interfaceThe codepath used in DN-03 is here is passed OpenAI API style content dicts, so I had thought I could comfortably assume that the HF Inference API was properly processing this into the chat format that the given model prefers.

To investigate this I started trying to understand how the HuggingFace Inference API handles the data. It was unclear what HF sends over the wire, and whether it already formats the text with Harmony tokens clientside, or not. I vibecoded a script that intercepts the network requests and seemingly confirmed that HF is sending the OpenAI style dicts over wire. This mostly matches expectation. However, this means any formatting with the Harmony tokens would be opaquely happening server-side. This meant I could not locally confirm whether it was being applied.

Unfortunately this meant my available options were to run the outputs on my own machine where I could confirm the proper use of the prompt formatting and compare the results I get back from HF Inference.

Surprisingly I found that these differed, not only for the newer GPT-OSS-Safeguard, but also for GPT-OSS. Running on 50 toxic chat examples sampled for balanced labels 6/50 had different labels between GPT-OSS ran through the pipeline API vs the inference API. With GPT-OSS-Safeguard this number was 1/50. Such a small sample will not reveal the true rate of differences, but the fact there are any differences is alarming and needs investigation.

There are several possible candidates:

(1) A bug in my code
(2) A bug in HuggingFace Hub
(3) A bug in Groq

I get different outputs from the Inference and pipelines in a fairly straightforward reproduction, reducing the surface for #1 (small script). The #2 and #3 happen on a 3rd party backend and are hard to investigate.

There's also another possible option:

(4) It's not an issue of Harmony tokens, but instead a difference defaults are aspects like configurable reasoning effort.

More exploration is needed. Let me know on GH if you have any feedback / if you encountered a similar issue.

Conclusions

Today left more questions than answers. I will likely need to run the models without using the inference API. This slightly complicates things but is likely doable. The findings from DN-03 should be viewed with uncertainty.

I had hoped to hit a self-imposed deadline to have my thoughts together for GPT-OSS-Safeguard by Friday. I think it will be possible to provide some insight into the model in that time. However, the main contribution might end up being trying to give some advice or debugging tips on using the model (possibly identifying a bug in a major library if confirmed).

Acknowledgments: Thank you to Juliet Shen for suggesting I investigate this further.