In Progress -- Applications of Wikidata For LLM Evaluation and Automatic Mechinterp Validation: A Pilot Study and Commentary

Gros, David

In Progress -- Applications of Wikidata For LLM Evaluation and Automatic Mechinterp Validation: A Pilot Study and Commentary

By David Gros. DRAFT . Version 0.0.0

This is an early in-progress draft that will be built out over the next few days. It builds on several Dev Notes (08, 09, 10, 11, 12). This debateably should be another dev notes, but I want to start condensing together an article outline today even if results and structure will evolve. There is abundant missing numbers as I sketch out the outline. Any discussion is extremely preliminary.

Abstract There is great interest in understanding how LLMs represent knowledge as well as in using LLMs themselves in automated interpretability of LLM internals. In this pilot study we explore some of uses of Wikidata, a structured knowledge base with over 1.6 billion facts, for helping add understanding to these questions.

In the first part, we apply Wikidata to study LLM knowledge comprehensiveness, exploring a question like "how many total birthdays does a LLM know?". Of the 7 million birthdays in wikidata, a model like X appears to know X%.

In the second part, we apply Wikidata to study LLMs' ability to articulate patterns in text. We show LLMs a set of entities which all match a pattern (eg, all born in Toronto), and entities which do not, and task the LLM with articulating the pattern. This has parrallels to the task of using LLMs for automated interpretability of sparse features or circuits. Our task highlights the many design decisions here, such as how to select examples or prompting. In a small pilot we find LLMs like GPT-X can often get close the pattern, but finding true underlying pattern is rare. This points the challenge of faithfully reconstructing unknown features.

This small exploration was done during one week and is not in-depth. However, it does contribute preliminary results for discussion, and provides pointers to interesting areas of future work. We also contribute artifacts ("wikidata-tiny24", "wikiconnect-tinybinary") which can make it easier for researchers to explore Wikidata.

Introduction

Modern LLMs combine both increadible knowledge and increadible pattern finding abilities. One can ask an LLM like Claude Sonnet 4.5 fairly obscure trivia facts, say "What is Paul Rudd's birthday?", and it will instantly recall the Ant Man actor's birthday is April 6, 1969. Simultaniously, when asked "What is the connection between Paul Rudd and Tobey Maguire?", Claude Opus 4.5 without web search will connect that the Ant Man and Spider Man stars shared the screen in the fairly obscure 1999 film The Cider House Rules (notably, the smaller Claude Sonnet model misses this connection). Somewhere in the model this vast trove knowledge and connections are stored and retrieved, though exactly this happens remains poorly understood. Understanding these internals might have important value in managing LLM deception, auditing, and alignment.

So how are we to study this process?

Wikidata? Wikidata is like Wikipedia, but is structured. Data is primarily represented in the form (subject entity, property, target). This provides a rich source for studying these questions which deserves exploration and disucssion. Knowledge bases have long had a role in AI, and many others have studied connections between LLMs and sources like Wikidata. However, we X.

We studying the following research questions: Narrow

Background

Preliminaries: At a simple level, LLMs take in a token, which becomes a dense vector, which goes through several layers where each layer updates an opaque dense vector and passes information between tokens, before finally the dense representation is projected to predict a next token. Vectors are considered dense when most elements are non-zero.

Individual elements of these dense vectors do not always relate to concepts or algorithms the model has learned, as several features get represented at the same time (Elhage et al., 2022).

Feature Extraction and Explanation: To address challenges of dense vectors, several techniques have been proposed. Sparse autoencoders (Bricken et al., 2023; Cunningham et al., 2023) and crosscoders (Lindsey et al., 2024) can help convert dense representations inside the network into sparser activations which might relate to individual concepts in the model. However, even once finding sparse features, there is still at an impasse. One only knows that there is some text where a given feature activates more, and some text where a given feature activates less. However, one does not necessarily know what concept the feature corresponds to, or even if it corresponds to an explainable concept at all.

An approach to solve this can be to show LLMs themselves examples of this text for a given (sparse) feature, and have it explain the pattern. Bills et al. (2023) showed GPT-4's ability to explain certain neurons (eg, that it appears to fire on movies), which was validated by having another instance of GPT-4 simulate outputs from the explanation to measure whether the explanation worked. However, there was not a known ground truth to the underlying neuron. Sherburn et al. (2024) created 20 hand-crafted categories of rules and saw how well LLMs could articulate the rules from classification examples. The manual process of defining these rules meant that exploration was limited to this relatively small set of rules.

Please see Section 6 for further discussion of prior work.

Processing Wikidata for Easier Use

Basic Structure

Wikidata represents facts primarily as triples of (Subject entity, Predicate (property), Target Object) these are refered to as a "Snak" in the database, but interestingly there does not appear to be any agreed upon etemology for that name.. For example, one can have subject "Paul Rudd (Q276525)", property "date of birth", The target can be either another enitity, or a litteral value such as a date. Additionally Wikidata supports qualifiers on a triple such as a reference source, or when a measurement (eg, "population") was taken. For our study we mostly ignore qualifiers.

There are several ways of working with Wikidata. The main way is query language called SPARQL, which is like SQL, but designed for the graph structured data. There is a public interface which supports about 5 simultanious requests, and each request must complete in about 1 second. One can also use a REST endpoint to all the data for an individual, at a rate of about X entities per second. Alternatively, one can download a full dump, which as of late 2025 is about 120 GB compressed.

For some of our RQs using SPARQL is challenging within the 1 second API limits. Working with a full dump is expensive, and has abundant amount of data not useful for our study. eg, a ton of chess matches

We create a "mini wikidata" which is a subgraph focusing on key entities.

Identifying Key Entities: Wikidata internally does not have a notion of an entity's importance. As a workaround we leverage that most imporant Wikidata entities have a corresponding Wikipedia page. We make an estimate of the most 100k most viewed Wikipedia pages in 2024 via sampling 20 hours in the hourly page view dumps. Table X shows some of the entities at the top, middle, and bottom of this 100k high view estimate. Some appear to represent temporary trends, highlighting we likely would benefit from a larger sample.

We then use the REST API to connect these with wikidata, finding a match X% of the pages. From the list 100k we sample 10k entities, for which we download the full data, and process into a clean tabular structure that is easer for us or others to work with. These are a much more manageable size and will be available on HuggingFace Datasets later this week.

Knowledge Comprehensiveness

In our first application of Wikidata, we explore knowledge comphrensiveness. A narrow in on the question "How many birthdays has an LLM memorized?". These are relatively stable facts, which humans have basic intuitions about remembering. In humans there is a (crf?)

Who Gets A Wikibirthday?

Using a SPARQL query, we estimate there about X million entities with a birthday in Wikidata. This is an interesting population, which we breifly discuss. Gender, occupation, month

We randomly sample 1000 of these birthdays for testing LLM knowledge.

Measuring Knowledge Coverage

Directions

It seems interesting to study what determines. Prior work has identified some mechanisms in how LLMs embedd time (ref). Prior work has done deep dives into trying to work out specifically how mechanism such as X digit artithmatic and X.

Connection-Finding Methodology

Here we'll go through our process of finding the connecting entities

Filtering Data

Not all triples are good for this task. The filtering starts with the process of sampling popular entities described above in Section 2.2. We then process our miniwikidata wth X triples and filter to properties that occur at least X times, filter cases where the property occures X times, and properties with a "concentration" of at least X.

Entity Set Selection Design Space

Prompt Construction Design Space

Evaluating

Modeling Design Space

Connection-Finding Results

Model Classification Skill

Rule Articulation Accuracy

Influence of Entity Set

compare 10 v 50.

Related Work

In Section X we discussed prior work establishing some open problems. Here we briefly survey some of the other work in this area.

Other Applications of Wikidata

NYT Connections

Future Directions

Artifact Contributions

Limitations and Challenges

This is only a pilot study, and it is not extensive. Here are some challenges of further work in this area.

Wikidata Has Errors. It is unclear the exact reliability rate of these kinds of questions. As one scales up the number of examples given in a prompt, one increases the chance of at least one error.

Relations Can Be Obscure. Many relations in wikidata can be fairly obscure. Additional work is needed for curration to make sure the tasks serve as a good analogue for questions we care to ask about LLM knowledge and internals.

Conclusion

Wikidata is a really fun source play around with. This pilot study was a partial revisiting of some work I did in 2023 framed around synthesizing executable programs to explaining . Wikidata queries is a subset of that.

I remain deeply uncertainty of the tractability of explaining any proportion the mechanics of models. I was somewhat disa

Acknowledgements: Thankes