Dev Notes 12: Blog Improvements and Playing With Wikidata Prompts

Gros, David

Dev Notes 12: Blog Improvements and Playing With Wikidata Prompts

By David Gros. November 16, 2025 . Version 0.1.0

Dev Notes (DN) document progress towards a larger article. Discussion is preliminary.

These notes document continued progress on exploring wikidata for LLM evaluation and interpretability. The notes build on the last four days of notes. In DN-08 I gave a rough outline of some of the topics that might be interesting to explore with LLMs. This includes questions like studying LLM coverage of certain sets of knowledge (eg, studying how many birthdays a given LLM knows, and how it knows that given number). Additionally, studying how well LLMs can recognize and articulate patterns in these entities. DN-09 discusses progress in processing Wikidata and assembles some motivation and prior work references. DN-10 discusses creating a "miniwikidata" that is easier to work with for the study. DN-11 discusses how these can be formed into prompts for LLMs to test how well it finds connections between entities.

Some findings and progress for today:

We show the full process of constructing prompts, sampling the prompts from an LLM (in this case GPT-4.1-nano), and scoring the resulting outputs.
The GPT-4.1-nano shows poor scores at finding the pattern when $|T|=10$ , $|F|=10$ , and $s=0.5$ (using DN-11#1.2 notation) on a small sample of 20 examples it never completely matches the pattern. Larger samples and larger models are needed.
There were several blog improvements such as $\LaTeX{}$ support, quotes, and listings which should make this and future articles a little nicer.

Simple Run of LLMs on Prompts

The process

We first gather query sets for the dataset. As described in DN-11#1.2, we sample a set $T$ of true entities matching a (property, target) pattern. We also select a set of $F$ entities which do not have this (property, target). We choose a fraction $s$ which is a fraction of entities in $F$ that also have the given property (but resolve to a different target).

You will be given some examples of labeled text.
Try to identify the pattern of the labeling.
---
'Brandy Norwood' -> true
'Kip S. Thorne' -> false
'Sachin Dev Burman' -> true
'Lilas Ikuta' -> true
'Frank Farian' -> true
'The Phoenician Scheme' -> false
'Vivianne Miedema' -> false
'MC Ren' -> true
'Bill Wyman' -> true
'Jelly Roll Morton' -> true
'Why Him?' -> false
'William Baldwin' -> false
'Rich Piana' -> false
'The Real Housewives' -> false
'Choi Siwon' -> false
'Casey Affleck' -> false
'Nick Drake' -> true
'Lin-Manuel Miranda' -> true
'Chamillionaire' -> true
'Rui Patrício' -> false
---
What is the pattern of the labeling?
Feel free to think through the problem if necessary.
Place your final answer in an `<final_answer>` xml tag.

Listing 1. Prompt for binary connection finding

These have ground truth of (occupation, composer).

I use the following prompt to convert this into a natural language label.

We are studying entities which have a certain property and value in Wikidata.
We are forming this as a binary classification problem where entities
are classified as true if they have the property and value, and false otherwise.
---
Property: {property} ({property_id})
Target Value: {target} ({target_id})
---
Convert this relationship into a natural language description.
Please give only your natural language description for the classification rule.
It should only be a single sentence.

Listing 2. Prompt for converting a (property, target) in NL

This converts the ground truth into

An entity is classified as true if it has the occupation of composer.

After running the main Listing 1 prompt to GPT-4.1-nano using my package lmwrapper, I get the following result:

Names labeled as true are full personal names 
  (including stage names, nicknames, or names with multiple parts) 
  that clearly identify an individual, whereas names labeled as false are 
  either phrases, titles, or names with features (like initials, accents, 
  or less common formats) that make them not straightforward 
  personal names in this context.

Listing 4. GPT-4.1-nano final answer after reasoning

Finally we use a score prompt

We are studying a task of identifying the relationship between collections of entities.
This was the ground truth answer:
---
{gt_answer}
---
This was the answer given:
---
{answer}
---
Give an integer score between 0 and 3
  0: The answer is completely wrong and or misleading (eg, the ground truth is countries in Europe, and the answer says it is a kind of fruit)
  1: The answer is does not capture the pattern, but generally would classify correctly (eg, the ground truth is countries in Europe, and the answer says place names)
  2: The answer seems partially correct, but misses some nuance (eg, the ground truth is countries in Europe, and the answer just says countries)
  3: The answer captures the classification category of the ground truth
In your output to the user please only give the score, no other text.

Listing 5. Scoring prompt

In this case the output scores a 0.

Small Sample Results

We run 20 prompts through GPT-4.1-nano. The score distribution is shown in:

Score	Count	Percent	At Least %
0	13	65.00%	100.00%
1	6	30.00%	35.00%
2	1	5.00%	5.00%
3	0	0.00%	0.00%

Table 1. Score distribution for GPT-4.1-nano responses

We observe that it does not generally succeed at this task. In previous hand tests using Claude Sonnet 4.5 this seemed to work better. I will explore using a larger model tomorrow. Today's notes demonstrate our ability to construct prompts from Wikidata and evaluate the results.

Blog Features

I spent probably too much time today building some extra blog features. In DN-11 I explained some of the hyperparameters in this exploration, and it would be nice to have math-mode style as above. I built some features for this.

$\LaTeX{}$ Support These include inline equations like $E = Mc^2$ and display equations:

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

I use KaTeX for this using rendering serverside in node. Iff there are math equations we include the necessary font for KaTeX. I explored some about how font files work and in interesting idea of packing only the characters that are actually used in the article to reduce the font size. However, this is just an interesting hyper-optimization, and it was not implemented.

Quote Support While a previous article used quotes, I now have a more general way of doing it.

All models are wrong, but some are useful.
— George Box

Code Blocks I added support for basic code blocks/listings. These were used above. However, they also support syntax highlighting via Pygments.

def greet(name):
    return f"Hello, {name}!"

print(greet("World"))

Listing 6. An example code listing with highlighting

A lot of this also simultaneously happened with exploring workflows on how to use Claude Code Web conveniently. Unlike some of its vibecode-first web competitors, it does not have good support for previewing frontend rendering. Part of the pitch for background agents is to be able to have multiple threads working on tasks simultaneously. This was somewhat clunky. I had a few commands in my shell history that I toggled between for checking out, pulling, and previewing, different branches that were all being done in Claude Code Web. A possible way to do this as well is the GitHub CLI. I might look into ergonomic ways for working with that, but still somewhat clunky. There are plenty of low hanging startup/open source ideas here for making this process smoother. I might prototype this for a future article.

Conclusion

Here we make some progress on studying Wikidata for studying LLM connection finding ability. Tomorrow I hope to run on more examples and with a larger model.