Dev Notes 04: Confusion/Caution(?) on running GPT-OSS-Safeguard with HuggingFace Inference

Gros, David

Dev Notes 04: Confusion/Caution(?) on running GPT-OSS-Safeguard with HuggingFace Inference

By David Gros. November 05, 2025 . Version 0.2.0

Dev Notes (DN) are part of daily postings which work towards a larger article. Discussion is preliminary.

Today's notes build on the last three days of Dev Notes. In DN-01 I described research questions on GPT-OSS-Safeguard, an OpenAI-developed model which classifies text as unsafe or safe according to a given natural language policy. In DN-02 and DN-03 I described replication progress on running the model on the ToxicChat dataset (Lin et al., 2023), running three separate prompts. The results of this experiment suggested that a simple prompt outperformed a more complex prompt. Today I set out to build better trust in that result. After some well appreciated feedback I investigated the data formatting used (proper use of the Harmony chat format), finding surprising disagreements in the output.

Findings for today include:

Using the HuggingFace Inference API to call GPT-OSS and GPT-OSS-Safeguard currently appears to give different results than calling it through the HuggingFace text generation pipeline. Given how popular this library is, this could affect many developers and needs investigation to confirm.
My current recommendation is to only use these models through the pipeline, though this is only based on limited experimentation.
The DN-03 results should be viewed with uncertainty.

Verifying Model Data Format

Background

I shared these notes in the ROOST ROOST is an organization getting spun up by Camille François et al. on open source safety tools, which collaborated with OpenAI on the GPT-OSS-Safeguard release discord and GitHub discussion thread. I was grateful to receive some feedback on the notes so far. Juliet Shen and Vinay Rao brought up whether I was properly using the Harmony chat formatFor background, the Harmony format is OpenAI's current way of formatting a prompt for a chat model. The LLM only sees a sequence of tokens, so one needs some way of denoting each turn of the conversation. This is done with particular tokens. During training the LLM can learn circuits that rely on the existence of certain tokens, and omitting them can degrade performance., as their prior findings might imply it might dramatically influence the results. While my use of established APIs made me believe I was including these, I set out to investigate.

In DN-02 I discussed switching from running the model on a cloud instance from Vast.ai to instead using HuggingFace Inference routing through Groqnot same as Grok. This API helps unify running models on different providers. It adds some layers of abstraction where I'm a bit removed from actual tokens going into the model. However, the interfaceThe codepath used in DN-03 is here is passed OpenAI API style content dicts, so I had thought I could comfortably assume that the HF Inference API was properly processing this into the chat format that the given model prefers.

Investigation Methods

To investigate this I started trying to understand how the HuggingFace Inference API handles the data. It was unclear what HF sends over the wire, and whether it already formats the text with Harmony tokens clientside, or not. I vibecoded a script that intercepts the network requests and seemingly confirmed that HF is sending the OpenAI style dicts over wire. This mostly matches expectation. However, this means any formatting with the Harmony tokens would be opaquely happening server-side. This meant I could not locally confirm whether it was being applied.

Unfortunately this meant my available options were to run the outputs on my own machine where I could confirm the proper use of the prompt formatting and compare the results I get back from HF Inference. I ran 50 ToxicChat examples sampled with balanced labels with both codepaths, and compared the labels.

Findings

Surprisingly I found that these differed, not only for the newer GPT-OSS-Safeguard, but also for GPT-OSS. In 50 toxic chat examples 6/50 had different labels between GPT-OSS when run through the pipeline API vs the inference API. With GPT-OSS-Safeguard this number was 1/50. Such a small sample will not reveal the true rate of differences, but the fact there are any differences is alarming and needs investigation.

There are several possible candidates:

(1) A bug in my code
(2) A bug in HuggingFace Hub
(3) A bug in Groq

I get different outputs from the Inference and Pipeline codepaths in a fairly straightforward reproduction, reducing the surface for #1 (small repo script). The #2 and #3 happen on a 3rd party backend and are hard to investigate.

There's also another possible option:

(4) It's not an issue of Harmony tokens, but instead different defaults in aspects like configurable reasoning effort which is not well documented.

More exploration is needed. Let me know on GH if you have any feedback / if you encountered a similar issue. Edit (Nov 6): thanks @andrewmchang for help flagging this with OpenAI/HF. I will probably not edit this post again, so check thread/future notes if would find updates helpful.

Conclusions

Today left more questions than answers. I will likely need to run the models without using the inference API. This slightly complicates things but is likely doable. The findings from DN-03 should be viewed with uncertainty.

I had hoped to hit a self-imposed deadline to have my thoughts together for GPT-OSS-Safeguard by Friday. I think it will be possible to provide some insight into the model in that time. However, the main contribution might end up being trying to give some advice or debugging tips on using the model (possibly identifying a bug in a major library if confirmed).

Acknowledgments: Thank you to those in the ROOST community for suggesting I investigate this further and helping understand the results.