Dev Notes 12: Blog Improvements and Playing With Wikidata Prompts
By David Gros. . Version 0.1.0 Dev Notes (DN) document progress towards a larger article. Discussion is preliminary. These notes document continued progress on exploring wikidata for LLM evaluation and interpretability. The notes build on the last four days of notes. In DN-08 I gave a rough outline of some of the topics that might be interesting to explore with LLMs. This includes questions like studying LLM coverage of certain sets of knowledge (eg, studying how many birthdays a given LLM knows, and how it knows that given number). Additionally, studying how well LLMs can recognize and articulate patterns in these entities. DN-09 discusses progress in processing Wikidata and assembles some motivation and prior work references. DN-10 discusses creating a "miniwikidata" that is easier to work with for the study. DN-11 discusses how these can be formed into prompts for LLMs to test how well it finds connections between entities. Some findings and progress for today: We first gather query sets for the dataset. As described in DN-11#1.2, we sample a set of true entities matching a (property, target) pattern. We also select a set of entities which do not have this (property, target). We choose a fraction which is a fraction of entities in that also have the given property (but resolve to a different target). These have ground truth of (occupation, composer). I use the following prompt to convert this into a natural language label. This converts the ground truth into After running the main Listing 1 prompt to GPT-4.1-nano using my package lmwrapper, I get the following result: Finally we use a score prompt In this case the output scores a 0. We run 20 prompts through GPT-4.1-nano. The score distribution is shown in: We observe that it does not generally succeed at this task. In previous hand tests using Claude Sonnet 4.5 this seemed to work better. I will explore using a larger model tomorrow. Today's notes demonstrate our ability to construct prompts from Wikidata and evaluate the results. I spent probably too much time today building some extra blog features. In DN-11 I explained some of the hyperparameters in this exploration, and it would be nice to have math-mode style as above. I built some features for this. Support These include inline equations like and display equations: I use KaTeX for this using rendering serverside in node. Iff there are math equations we include the necessary font for KaTeX. I explored some about how font files work and in interesting idea of packing only the characters that are actually used in the article to reduce the font size. However, this is just an interesting hyper-optimization, and it was not implemented. Quote Support While a previous article used quotes, I now have a more general way of doing it. All models are wrong, but some are useful. Code Blocks I added support for basic code blocks/listings. These were used above. However, they also support syntax highlighting via Pygments. A lot of this also simultaneously happened with exploring workflows on how to use Claude Code Web conveniently. Unlike some of its vibecode-first web competitors, it does not have good support for previewing frontend rendering. Part of the pitch for background agents is to be able to have multiple threads working on tasks simultaneously. This was somewhat clunky. I had a few commands in my shell history that I toggled between for checking out, pulling, and previewing, different branches that were all being done in Claude Code Web. A possible way to do this as well is the GitHub CLI. I might look into ergonomic ways for working with that, but still somewhat clunky. There are plenty of low hanging startup/open source ideas here for making this process smoother. I might prototype this for a future article. Here we make some progress on studying Wikidata for studying LLM connection finding ability. Tomorrow I hope to run on more examples and with a larger model.
Simple Run of LLMs on Prompts
The process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
You will be given some examples of labeled text.
Try to identify the pattern of the labeling.
---
'Brandy Norwood' -> true
'Kip S. Thorne' -> false
'Sachin Dev Burman' -> true
'Lilas Ikuta' -> true
'Frank Farian' -> true
'The Phoenician Scheme' -> false
'Vivianne Miedema' -> false
'MC Ren' -> true
'Bill Wyman' -> true
'Jelly Roll Morton' -> true
'Why Him?' -> false
'William Baldwin' -> false
'Rich Piana' -> false
'The Real Housewives' -> false
'Choi Siwon' -> false
'Casey Affleck' -> false
'Nick Drake' -> true
'Lin-Manuel Miranda' -> true
'Chamillionaire' -> true
'Rui Patrício' -> false
---
What is the pattern of the labeling?
Feel free to think through the problem if necessary.
Place your final answer in an `<final_answer>` xml tag.
1
2
3
4
5
6
7
8
9
10
We are studying entities which have a certain property and value in Wikidata.
We are forming this as a binary classification problem where entities
are classified as true if they have the property and value, and false otherwise.
---
Property: {property} ({property_id})
Target Value: {target} ({target_id})
---
Convert this relationship into a natural language description.
Please give only your natural language description for the classification rule.
It should only be a single sentence.
An entity is classified as true if it has the occupation of composer.
Names labeled as true are full personal names
(including stage names, nicknames, or names with multiple parts)
that clearly identify an individual, whereas names labeled as false are
either phrases, titles, or names with features (like initials, accents,
or less common formats) that make them not straightforward
personal names in this context.
We are studying a task of identifying the relationship between collections of entities.
This was the ground truth answer:
---
{gt_answer}
---
This was the answer given:
---
{answer}
---
Give an integer score between 0 and 3
0: The answer is completely wrong and or misleading (eg, the ground truth is countries in Europe, and the answer says it is a kind of fruit)
1: The answer is does not capture the pattern, but generally would classify correctly (eg, the ground truth is countries in Europe, and the answer says place names)
2: The answer seems partially correct, but misses some nuance (eg, the ground truth is countries in Europe, and the answer just says countries)
3: The answer captures the classification category of the ground truth
In your output to the user please only give the score, no other text.
Small Sample Results
Score Count Percent At Least % 0 13 65.00% 100.00% 1 6 30.00% 35.00% 2 1 5.00% 5.00% 3 0 0.00% 0.00% Blog Features
1
2
3
4
def greet(name):
return f"Hello, {name}!"
print(greet("World"))
Conclusion
