Dev Notes 01: Exploration and Setup of GPT-OSS-Safeguard

Gros, David

Dev Notes 01: Exploration and Setup of GPT-OSS-Safeguard

By David Gros. November 02, 2025 . Version 1.0.0

This is the first of possibly several "Dev Notes" started as part of daily posts in November. These document some of the process when working towards a larger article.

A few days ago OpenAI released an interesting model – "GPT-OSS-Safeguard" (OpenAI, 2025b). This is a followup on GPT-OSS (OpenAI et al., 2025a), a version of their headline product with openly downloadable weights. They took this base model, and tuned it to classify whether text is safe or not under a given policy. It's intended for use cases like content moderation, with a "bring your own policy" contribution. They contrast this with prior moderation models where the policy is baked in.

Some Questions For Exploration

After reading the paper, I was left wondering several things:

How does all the training on safety policies influence how the model thinks?
- Does this model have different views on morality than the base model?
- How have the internals of the model shifted from this training? There has been some prior work on this "model diffing", and I view the application of model diffing techniques to GPT-OSS-Safeguard an interesting extension.
What are some characteristics of cases where GPT-OSS-Safeguard succeeds, but GPT-OSS does not? What are the new CoT patterns here that enable it?
How does the user's natural language choice influence the model? For example, does the model make different safety judgments when given English or Chinese or a low resource language. ChatGPT has an astounding 700 million(!) weekly users (Chatterji et al., 2025). Many millions of them are chatting in a language other than English Exact breakdown is unclear. Chatterji et al. (2025) at OpenAI analyze ChatGPT usage by country GDP groupings, but do not appear to present on the underlying country distribution or language.. While the GPT-OSS-Safeguard model card gives Multilingual MMLU results evaluating capabilitiesWhile very oddly leaving English MMLU out of their Table 3 results., they omit multilingual eval on anything related to safety. It is challenging to work on, as many unsafe utterances often use slang or subtle parts of language, making translation or curation challenging. More discussion is needed.

Attempts At Running The Model

I don't have access to a lab machine that can run the model right now. I attempted to coax my desktop with 32GB of RAM to run it on CPU, but in my attempt it did not cooperate. Instead, I turned to cheap cloud providers.

Some providers stretch rules to operate cheaper than more established clouds such as AWS, Azure, GCS, etc. Since 2017 NVIDIA has fought their way to the top AI totem poleRecently becoming the first company with a $5T dollar market cap in one small part by restricting how their GPUs are used which helps them maintain high margins for different product segments. According to their license agreements their consumer cards (such as RTX 4090) originally aimed at gaming cannot be used in datacenters. However, some providers still offer these cards. Some brand themselves as a "marketplace" for connecting with small independent hosts as a possible workaround. But they still basically try to be like a normal cloud, and at least one (Runpod) offers "secure cloud" instances, which certainly seems like they are operating a datacenter. Overall unclear how it works, but they are about a tenth the price of renting a true datacenter card.

Runpod. After brief research Runpod seemed like the initial best choice. After selecting some options, handing over $10, and uploading an SSH key, I had a machine up and running. However, I ran into a series of issues connecting to the machine. The first seemed to be due to an undocumented lack of support for RSA SSH keys. Giving it an ed25519 key made it happy. However, the connection was a weird (proxied?) SSH connection that would not cooperate with scp, rsync, or VS Code remote development. The docs say the latter should work, but in brief tries it would not. Bummer. I moved on, and will hopefully find another use for the credits when I don't need interactive use.

Vast AI. This worked. It was also cheaper than Runpod for a more powerful machine with lower latency. Interestingly the machine I used booted up right into a tmux session. I actually like this. However, it did break VS Code remote development. Manually configuring the ~/.ssh/config with a RemoteCommand allowed it to connect:

Host vast-gpu
    HostName <FILL IP>
    Port <FILL PORT> 
    User root
    IdentityFile ~/.ssh/id_ed25519
    RemoteCommand bash -l
    RequestTTY no

With a bit of reallocating the machine (loading two 20B Models + the environment used over 150GB of disk, more expected), I was able to locally run both GPT-OSS and its safeguarding sibling.

Huggingface Inference. My initial curiosity was around model diffing for the safeguard variant, thus needed local parameter access. However, some of my questions just need blackbox access. While the various HF blackbox APIs don't always support the latest models, it seems like it might work.

Replicating the results

An initial step in exploring the model is replicating their results on the datasets they use (ToxicChat (Lin et al., 2023) and a Moderation Dataset (Markov et al., 2023)). Unfortunately they do not seem to give their prompts for their evals. I need to experiment more and see if I can replicate their numbers with a mix of policy prompts. Please share if you found some sample prompts for the model.

Conclusion

There's a lot of interesting things to explore with GPT-OSS-Safeguard. After today I have better understanding of the model (/knowledge it exists. I just saw it this morning), better understanding of model diffing prior work, and the infrastructure to run it. Hopefully I'll have some results starting tomorrow.