Dev Notes 14: Progress on Gathering and Processing IMDb Data
By David Gros. . Version 0.1.0 Dev Notes (DN) document progress towards some larger article. Discussion is preliminary. Today's notes build on yesterday's notes DN-13 exploring the distribution of a movie's IMDb ratings. It was motivated by a weird quirk of the movie KPop Demon Hunters, where there is a weird gap in the 9 ratings. I gathered the ratings of the top 3000 films of the 21st century with the most ratings. At #2020, KPop Demon Hunters is fairly deep into this list (which is partially why so many). I originally sorted by number of ratings rather than rating score as I didn't purely want to just include "good" movies. However, I later merge this top 3000 by 21st century films by number of ratings with a top 3000 all films by rating value. Filters? In an initial analysis some of the films that really stand out as having a peculiar distribution are non-American films or fairly unknown films. I don't actually have the country of origin in my data currently. I do have the film's MPA Rating (like PG, PG-13, etc) as well as the metascore if listed. For now I will filter out movies that are missing these ratings. This leaves 2862 films. Let be the percent of raters who gave a movie a score of , where . We say a film has a 9-gap at all if and Among my current set of 2862 films, it looks like about 1658 (58%) have any 9-gap. So it's not that unusual for people to shy giving out 9s. However, often this 9-gap is fairly small and not very noticeable (because the number of 10s is also low). Next we try to look for the largest 9-gap. There are a couple of reasonable ways to do. Here's my current thought. The first two terms just capture how many percent points the 9 is below 10 and 8. The third term looks at the absolute value difference between 10 and 8. Some films just have strong 10s bias. Other films have more of a "9-cliff" than a "9-gap". Including this third term helps narrow in on films where the 10 and 8 are similar, creating a clear gap. However, this term should debatably be removed. Among the current set of films, the top 9-gap are: So KPop Demon Hunters at number 6 does indeed make the top 10 here. It is also the mean rating on the list. However, this is method sensitive. For example if we change the forumla to: Without the absolute value term keeping 8 near 10, it does not though. I think this term is fair for the question though. "I'm Still Here" is a Brazilian film coming at number 1. However, this is mostly because it has a bunch of 10s, and very few 8s or 9s. So not really a 9-gap. I origionally just wanted to focus on the 9-gap. However, there are a lot of other interesting distributions. For example, "10ers" with the most 10s, "anti-9-gap" for films where 9 is greater than 8 and 10 (these appear to be very rare), a generalization of bimodality, films which are very 1 or 10, or generally just high entropy / high uniformity. These are each interesting in their own way, but need to figure out which should make the cut for an article. I'm trying to not abandon that as it is cool. However, between this and some other things going today, I did not add to that draft. This IMDb data is really interesting. Today's notes demonstrate being able to rank by these values. KPop Demon Hunters does indeed have one of the largest gaps, but not quite the highest. It seems like I have basically all I need for an article here. I want to make a few components to nicely render these distribution graphs. I think doable tomorrow.Gathering Data
How many have a 9-gap?
Quantifying a 9-gap
Highest 9-gap
Other kinds of distributions?
Other Notes on Wikidata Project
Conclusions
