Dev Notes 11: Progress In Identifying Connected Entity Sets for LLM Evaluation

Gros, David

Dev Notes 11: Progress In Identifying Connected Entity Sets for LLM Evaluation

By David Gros. November 15, 2025 . Version 0.1.0

Dev Notes (DN) discuss incremental progress towards a larger article. Discussion is preliminary.

Today's notes discuss progress towards constructing prompts for LLMs to find the connection between related entities motivated by problems in automated mechanistic interpretability. This is all related to a general curiosity about Wikidata and its application for studying LLMs.

These notes build on the last three days of notes. In DN-08 I gave a rough outline of some of the topics that might be interesting to explore with LLMs. This includes questions like studying LLM coverage of certain sets of knowledge (eg, studying how many birthdays a given LLM knows, and how it knows that given number). Additionally, studying how well LLMs can recognize and articulate patterns in these entities. DN-09 discusses progress in processing Wikidata. DN-10 discusses creating a "miniwikidata" that is easier to work with for the study. Additionally it collects some writing for the motivation for this study and started gathering some of the references to prior work to contextualize this study. I won't repeat all the motivation in these notes, but plan to in the future article.

Some findings and progress for today:

I explored some methods of filtering down the properties for a good set for playing a "super connection" game.
I built up a method for converting these into queries and outlined some of the progress still needed here.

Filtering to Good Property Sets

In DN-10 I discussed some of the current pipeline so far in filtering the data.

It involves first using page view data to get a set of interesting data. I use a sample of 20 hours of page view data (since Wikipedia reports data with each hour getting its own file). I duplicate this here for reference.

Rank	Page	Views	Wikidata ID
1	Deaths in 2024	121,859	Q123489953
2	Cleopatra	105,883	Q635
3	J. D. Vance	97,342	—
4	XXXTentacion	82,487	Q28561969
5	Weightlifting at the 2024 Summer Olympics – Women's 49 kg	75,740	Q116495986
6	Bob Menendez	75,108	Q888132
7	YouTube	73,930	Q866
8	Pornhub	72,666	Q936394
9	Lyle and Erik Menéndez	71,410	—
10	Biggest ball of twine	67,400	Q4906916
11	Kepler's Supernova	66,182	Q320670
12	.xxx	65,772	Q481
13	Tim Walz	65,425	Q2434360
14	2024 Summer Olympics	62,453	Q995653
15	Matthew Hudson-Smith	61,339	Q16575549
16	Deadpool & Wolverine	61,319	Q102180106
17	Pushpa 2: The Rule	57,510	Q112083510
18	2024 Indian general election	55,112	Q65042773
19	XXX (2002 film)	52,061	Q283799
20	Portal:Current events	51,853	Q4597488

Table 1. Top 20 Wikipedia/Wikidata entity pages by views (sampled from 20 random hours in 2024). Not all entities are successfully joined connecting pages to Wikidata. Table entry hyperlinks lead to the Wikipedia page and the Wikidata pages.

Tangent: I mentioned yesterday I was confused by the inclusion of Cleopatra in this data. In discussion with a friend they pointed out that actually Cleopatra and J.D. Vance was apparently a brief trend in 2024 viral trend of dissing Vance as like Cleopatra because he wore eyeliner. This might explain things a bit. It still likely means I need a larger sample of hours, because for that brief hour Vance and Cleopatra were just getting a lot of views. A larger sample would reveal if this was true for more of 2024 and help me narrow in on the important entities.

We then download these entities, and form a mini graph of the data in an easy-to-work-with Parquet file.

Forming the query sets

Today I wanted to convert these into actual prompts for the LLMs. As a reminder the Wikidata examples are (source, property, target) triplets. Some of these are properties like gender where the target values are concentrated in a few target valuesThis is its own interesting exploration. About 60% of entities with that property map to male. A bit of a gender bias in Wikidata entities (though this is a weird sample of 10k of the 100k top page view entities). Broader exploration is interesting. . So to get good ones to perform our connections game we filter to ones where the most frequent target represents at most 40% of a given (property, target). Additionally we filter to properties that occur at least 50 times in our sample, and targets that occur at least 10 times.

Parameters in the Query Set

There are several formulations of this "super connection" game. We start with a binary formulation. We have a set T which are entities which share a (property, target). We also have a set F which are entities which do not match this (property, target). One way is to sample these F entities from the set of all entities. However, this might lead to weird results. For example if we have (property=occupation, target=actor) then drawing F from the set of all entities will have some entities which are not even people. In this case the LLM might identify the false pattern of just "are these people or not". We set configurable parameter s which is the fraction of the entities in F that share the same property as those in T. This was written quickly, but I have a sad lack of LaTeX math support in my backend render. I will try to add that in a future post

Here is an example of a prompt we construct when |T| = 10, |F| = 10, and s=0.5.

You will be given some examples of labeled text.
Try to identify the pattern of the labeling.
---
'David Cronenberg' -> true
'Tim Duncan' -> false
'Catherine O'Hara' -> true
'Yang Hyun-suk' -> false
'A Wrinkle in Time' -> false
'Drew Seeley' -> true
'Chris Potter' -> true
'Jamie Drysdale' -> true
'Guillermo del Toro's Pinocchio' -> false
'Summer McIntosh' -> true
'Connor Price' -> true
'Tyler Hynes' -> true
'Annabelle Dexter-Jones' -> false
'Stephan James' -> true
'Kat Timpf' -> false
'Cindy Breakspeare' -> true
'Alice in Wonderland' -> false
'Ari Up' -> false
'Kalyani Priyadarshan' -> false
'Gordon Getty' -> false
---
What is the pattern of the labeling?

This is the part where I'd ask the reader if they can figure it out. It is also the part where I'd nicely hide the answer. The site doesn't have a component for that right now, so here's the answer.

The true entities share the property of (place of birth, Toronto). So in this little sample example, can LLMs figure it out? Claude 4.5, ChatGPT (whatever default on free web GUI), and Gemini 2.5 Pro all identify the pattern as these entities are Canadian. This is true, but it misses the full nuance of the feature.

Well, that was the story anyway. When exploring this a bit more as a spot check, I actually uncovered that "Drew Seeley"Apparently famous for providing the singing voice for Troy Bolton in High School Musical (dubbing over Zac Efron) is listed as born in Toronto in Wikidata but other sources online list Ottawa. Hm... I didn't cherry pick this particularly (I printed a few prompts, and then selected an interesting seeming one). But in this sample, having one mislabel in Wikidata is something that needs to be investigated further.

Conclusion

These notes were again somewhat rambly unfortunately. We show how to generate prompts, but not how to run them at large scale. It seems in sight though. This exploration will continue tomorrow.