Dev Notes 07: Additional Notes on Streets Analysis and Updates
By David Gros. . Version 0.1.0 Dev Notes (DN) document progress on larger articles. Discussion is preliminary Today's notes are somewhat peculiar, as they focus on improvements to yesterday's article rather than work towards an upcoming article. I had hoped to begin on an AI-focused article that had connections to mechanistic interpretability. However, there were several things I still wanted to do for yesterday's analysis of street names, and I ended up spending most of my time on that. Findings for today include: Yesterday's article discussed work in scraping street names from OpenStreetMap (as discussed in DN-06 and Nov 9 article on most common names. These have been merged into the main article already, but I am breaking these out to discuss a few more things that didn't make it into the article, plus to more cleanly meet the slightly arbitrary goal of having a distinct post every day of November. We add analysis looking at the most common words in the scraped data. While I had expected Street to be most common, instead "Road" took the lead. "North" takes the lead as the most common direction mentioned. Additionally, I explored which words are unusual in street names. We use an algorithm (TF-IDF) that is used in search engines to find keywords. This algorithm identifies words that appear in streets of a given state, while also being unusual in other states. We plot these top two street keywords for every state in Figure 2. To identify distinctive words for each state (Figure 2), we adapt the TF–IDF method (Spärck Jones, 1972). Each street name is treated as a document, and we use binary TF (0 or 1), so TF–IDF simplifies to just the IDF score. For each word, IDF = ln(total streets / number of streets containing the word). We sum these scores for all streets in each state and use them to rank words. To avoid common terms, we drop the 25 most frequent "stop words" (like road, street, avenue) and require at least 10 occurrences of a word in a state to include it. It's unclear if TF-IDF is the best way to do this. It partially biases towards larger states. Also, the decision of where to put the "stop word" limit is not principled, but affects the result. I briefly research and thought about more information theoretic-based approaches. For example, how much information does learning a given street tell you about which state. However, this gets complicated. Given I partially was writing for a general audience, being able to gesture at "standard search engine algorithm", which reasonably fits TF-IDF, seemed like the best approach. While this won't go in the main article, I briefly explored president surnames in the streets data. Unsurprisingly Washington (in this case with his state honors) leads as the most common president surname. Currently there are 11 roads with Obama, 79 with Trump, and 7 with Biden. It will be interesting to see how this shifts over time. The values in Figure 2 are loosely semantically coded (eg, "Lake" is blue, "forest" is green). This kind of detail would traditionally have added extra work to making visualization. However, using LLMs I could just ask it read through the list of all the keywords, and then make a dictionary for a color associated with every one. It feels like a magical paradigm of semantic programming. Additionally, for the first time in this project I explored "transpiling via LLMs". I developed a way of parsing the OpenStreetMap data (with heavy help from LLMs). However, this was inefficient, using more memory on my laptop than I had available. Rather than optimize the code I was able ask an LLM to port the code to Rust. This works remarkably well. It did not quite work on the first try, as the output slightly differed from the Python code. However, with brief iteration, I arrived at a solution that was vastly more efficient than the Python code. This paradigm of "LLMs as compiler" is remarkable. Today's notes are a somewhat weird reflection on the past article. I hope to have interesting new things to discuss later in the week.Improvements on State Street Analysis
Keyword Analysis.
Most Common Words
Keyword Analysis
President Surnames
Thoughts on the Magic of LLM-based development
Conclusions
