Dev Notes 09: Notes on Finding Key Entities and Relations in Wikidata

By . . Version 0.1.0

Dev Notes (DN) discuss progress towards a larger article. Discussion is preliminary.

In yesterday's DN-08 I discussed initial progress on an exploration of using Wikidata for studying some facets of LLM knowledge and connection finding abilities. Today's notes document some progress towards getting together a set of entities to do this exploration.

Some key findings:

  • I am shifting away from using the full wikidata dump for the project. Using the query engine is easier and might still get what I want, and for my time budget a comprehensive view is infeasible.
  • I learned some details about approaches for getting relevant entities for this exploration.

The following notes are unfortunately rather a rough ramble rather than well-condensed thoughts.

Building a SPARQL API

Wikidata uses a query syntax called SPARQL. Like SQL, it is a declarative language, but is structured specifically for some characteristics of the graph nature of Wikidata.

I spent some time exploring the SPARQL queries to understand how they work and what I can actually do with them.

I built out a basic API that will make queries. Additionally I cache the results in order to reuse redundant code that might make several calls to the API. I also explored some about the limits of the sparql endpoint. The main constraint seems to be that they must complete within 60 seconds. On experimentation this is somewhat constraining but not completely limiting.

I built an API that takes this into account. It handles multiple concurrent connections and will handle retries and some ability to deal with pagination and conversion of the sparql JSON results into Parquet files in order to allow better handling of the results as I later want to use this for processing using Polars.

Finding most popular relations

With some iteration I got a set of about the top 100 relationships. Attempting to query a larger number in SPARQL caused issues. However, many of these relationships were not very useful. They included things like reference IDs or other not super informative relationships.

This is a starting point to explore with, but I might need to do a more manual process of getting common and useful entities.

Finding the most popular entities

When constructing the set of items to play the "super connections" game with the LLMs I want some notion of how significant the entities are. I don't only want to include truly obscure entities, or at least when I do, I want to know how this affects the results. I explored two different approaches here. One was looking at the number of Wikipedia editions that a given entity links to. However, sorting by this metric ran into issues. There is also an approach where you look at the number of inbound edges into entities, but this seems likely to give many uninteresting entities which are not good for this exploration.

An alternative process might be to think about how popular the entity is on Wikipedia itself. This seems the most reliable. Unfortunately Wikipedia doesn't appear to have good formats for this. There are page view dumps, but they are aggregated at an hourly grouping. My working plan is to download a random sample of ~10 hours in 2025 and then aggregate the counts of those hours. This should allow me to get say the top 10,000 entities of Wikipedia. However, some work is needed to estimate how easily these can be linked back to the Wikidata entries. Then the approach might be to look at what properties backlink on those entities or which properties forward link in order to get a better set of interesting properties.

More effort is needed on this area.

Conclusions

Once again these notes were annoyingly rushed with some other things going on. I hope to continue to iterate and get a full set of valid entities to use for this exploration soon.