Dev Notes 11: Progress In Identifying Connected Entity Sets for LLM Evaluation

By . . Version 0.1.0

Dev Notes (DN) discuss incremental progress towards a larger article. Discussion is preliminary.

Today's notes discuss progress towards constructing prompts for LLMs to find the connection between related entities motivated by problems in automated mechanistic interpretability. This is all related to a general curiosity about Wikidata and its application for studying LLMs.

These notes build on the last three days of notes. In DN-08 I gave a rough outline of some of the topics that might be interesting to explore with LLMs. This includes questions like studying LLM coverage of certain sets of knowledge (eg, studying how many birthdays a given LLM knows, and how it knows that given number). Additionally, studying how well LLMs can recognize and articulate patterns in these entities. DN-09 discusses progress in processing Wikidata. DN-10 discusses creating a "miniwikidata" that is easier to work with for the study. Additionally it collects some writing for the motivation for this study and started gathering some of the references to prior work to contextualize this study. I won't repeat all the motivation in these notes, but plan to in the future article.

Some findings and progress for today:

  • I explored some methods of filtering down the properties for a good set for playing a "super connection" game.
  • I built up a method for converting these into queries and outlined some of the progress still needed here.

Filtering to Good Property Sets

In DN-10 I discussed some of the current pipeline so far in filtering the data.

It involves first using page view data to get a set of interesting data. I use a sample of 20 hours of page view data (since Wikipedia reports data with each hour getting its own file). I duplicate this here for reference.

RankPageViewsWikidata ID
1Deaths in 2024121,859Q123489953
2Cleopatra105,883Q635
3J. D. Vance97,342
4XXXTentacion82,487Q28561969
5Weightlifting at the 2024 Summer Olympics – Women's 49 kg75,740Q116495986
6Bob Menendez75,108Q888132
7YouTube73,930Q866
8Pornhub72,666Q936394
9Lyle and Erik Menéndez71,410
10Biggest ball of twine67,400Q4906916
11Kepler's Supernova66,182Q320670
12.xxx65,772Q481
13Tim Walz65,425Q2434360
142024 Summer Olympics62,453Q995653
15Matthew Hudson-Smith61,339Q16575549
16Deadpool & Wolverine61,319Q102180106
17Pushpa 2: The Rule57,510Q112083510
182024 Indian general election55,112Q65042773
19XXX (2002 film)52,061Q283799
20Portal:Current events51,853Q4597488
Table 1. Top 20 Wikipedia/Wikidata entity pages by views (sampled from 20 random hours in 2024). Not all entities are successfully joined connecting pages to Wikidata. Table entry hyperlinks lead to the Wikipedia page and the Wikidata pages.

Tangent: I mentioned yesterday I was confused by the inclusion of Cleopatra in this data. In discussion with a friend they pointed out that actually Cleopatra and J.D. Vance was apparently a brief trend in 2024 viral trend of dissing Vance as like Cleopatra because he wore eyeliner. This might explain things a bit. It still likely means I need a larger sample of hours, because for that brief hour Vance and Cleopatra were just getting a lot of views. A larger sample would reveal if this was true for more of 2024 and help me narrow in on the important entities.

We then download these entities, and form a mini graph of the data in an easy-to-work-with Parquet file.

Forming the query sets

Today I wanted to convert these into actual prompts for the LLMs. As a reminder the Wikidata examples are (source, property, target) triplets. Some of these are properties like gender where the target values are concentrated in a few target valuesThis is its own interesting exploration. About 60% of entities with that property map to male. A bit of a gender bias in Wikidata entities (though this is a weird sample of 10k of the 100k top page view entities). Broader exploration is interesting. . So to get good ones to perform our connections game we filter to ones where the most frequent target represents at most 40% of a given (property, target). Additionally we filter to properties that occur at least 50 times in our sample, and targets that occur at least 10 times.

Parameters in the Query Set

There are several formulations of this "super connection" game. We start with a binary formulation. We have a set T which are entities which share a (property, target). We also have a set F which are entities which do not match this (property, target). One way is to sample these F entities from the set of all entities. However, this might lead to weird results. For example if we have (property=occupation, target=actor) then drawing F from the set of all entities will have some entities which are not even people. In this case the LLM might identify the false pattern of just "are these people or not". We set configurable parameter s which is the fraction of the entities in F that share the same property as those in T. This was written quickly, but I have a sad lack of LaTeX math support in my backend render. I will try to add that in a future post

Here is an example of a prompt we construct when |T| = 10, |F| = 10, and s=0.5.

You will be given some examples of labeled text.
Try to identify the pattern of the labeling.
---
'David Cronenberg' -> true
'Tim Duncan' -> false
'Catherine O'Hara' -> true
'Yang Hyun-suk' -> false
'A Wrinkle in Time' -> false
'Drew Seeley' -> true
'Chris Potter' -> true
'Jamie Drysdale' -> true
'Guillermo del Toro's Pinocchio' -> false
'Summer McIntosh' -> true
'Connor Price' -> true
'Tyler Hynes' -> true
'Annabelle Dexter-Jones' -> false
'Stephan James' -> true
'Kat Timpf' -> false
'Cindy Breakspeare' -> true
'Alice in Wonderland' -> false
'Ari Up' -> false
'Kalyani Priyadarshan' -> false
'Gordon Getty' -> false
---
What is the pattern of the labeling?

This is the part where I'd ask the reader if they can figure it out. It is also the part where I'd nicely hide the answer. The site doesn't have a component for that right now, so here's the answer.

The true entities share the property of (place of birth, Toronto). So in this little sample example, can LLMs figure it out? Claude 4.5, ChatGPT (whatever default on free web GUI), and Gemini 2.5 Pro all identify the pattern as these entities are Canadian. This is true, but it misses the full nuance of the feature.

Well, that was the story anyway. When exploring this a bit more as a spot check, I actually uncovered that "Drew Seeley"Apparently famous for providing the singing voice for Troy Bolton in High School Musical (dubbing over Zac Efron) is listed as born in Toronto in Wikidata but other sources online list Ottawa. Hm... I didn't cherry pick this particularly (I printed a few prompts, and then selected an interesting seeming one). But in this sample, having one mislabel in Wikidata is something that needs to be investigated further.

Conclusion

These notes were again somewhat rambly unfortunately. We show how to generate prompts, but not how to run them at large scale. It seems in sight though. This exploration will continue tomorrow.