Dev Notes 08: Initial Explorations with wikidata
By David Gros. . Version 0.1.0 Dev Notes (DN) document progress towards a larger article. Discussion is preliminary One amazing part of modern LLMs is their vast knowledge. They can also learn very complicated patterns. There has been some prior work trying to understand mechanistically what is going on that enables the LLMs to retrieve and reason about these facts. Often this exploration likely does not make full use of online resources. In these notes I discuss work on getting setup for an exploration of wikidata as a stepping stone to some mechanistic interpretability study. Findings: Wikidata is a structured format behind some of the data on wikipedia. At a basic sense, it is a triplet dataset which stores (entity, relationship, entity). For example (Barrack Obama, born on, August 4, 1961). Together there are billions of these types of connections forming a complex web of connections. I hope to study some of the ways LLMs represent these connections, as well as how well they can verbally identify connections in text. Ask a modern LLM about an even slightly famous person, and it somehow has memorized their birthday. Using wikidata I hope to first just explore one basic question estimating what fraction of people with a Wikipedia-listed birthday various LLMs have memorized. This gives some hints at the capacity of the models to memorize facts. Additionally, I'm curious to peer into the network to see what is happening on birthdays that it memorizes vs those it has not. When humans memorize birthdays, they can rely on complicated patterns. There are some friend's birthdays I remember by recalling a specific party that was right after New Year's day. My memory of this fact goes through remembering some other event. It would be interesting if LLMs have similar patterns. Birthdays are a nice fact as they are clearly a relationship from just one entity, they don't really change after being added, and most people with a birthday on Wikipedia have been referenced on the internet a fair amount. I put together some rough queries for counting certain relationships. The working number I have is about 7M birthdays. I hope to do a bit more validation to see if these queries are comprehensive or not. A popular word game is NYT Connections. Players are given 16 words, and must organize them into 4 connected groups. There has been some study of using this game to evaluate LLMs. These are limited to using a few hundred puzzles. I think there are opportunities to extend it via the wikidata graph to create 10s of thousands of very challenging versions. One approach to understanding certain features within LLMs is to show an LLM a case where a neuron fires, and then try to get it to explain the pattern. This has some overlap with the NYT Connections game where you must recognize patterns in words. By making a super connections framing I think there are ways to study the limits of this approach. As well explore prompting approaches that do and do not work well. Downloading a full dump of wikidata is about 100GB. As discussed there is an endpoint for making database SPARQL queries on wikidata. However, it is pretty unclear how complete these queries are or what the rate limits are. Downloading and preprocessing the dataset will take hours of wallclock time in compute. I began this just so that way I have it as an option if I need to. These notes are mostly just getting ready a new project. They are somewhat quick brain-dump-y, but I hope to have it cleaned up. Like in DN-07 I found myself somewhat distracted today trying to finish up a prior exploration as well as some other activities. I hope to have some starting results to share tomorrow.What and Why of Wikidata
How Many Birthdays Do LLMs Know
How many birthdays are on Wikidata
Super Connections
Connection to Mech Interp
Downloading and Processing Wikidata
Conclusions
