- The current focus of the tool is to visualize how GPT-2 stores knowledge about specific topics.
- The dataset used are descriptions of cities with more than 100k inhabitants, scraped from Wikipedia (link). Totals to ~4.4k cities.
- The first 1k characters of each cities' Wikipedia page are scraped and cleaned, then ran through the encoder from the GPT-2 repo. Each prompt is truncated at the first 30 tokens, and serialized to disk for further use.
- First 5 examples from the dataset:
0 / 4386 https://en.wikipedia.org/wiki/Kabul : Kabul is the capital and largest city of Afghanistan, located in the eastern section of the country. It is also a municipality, forming part of the
1 / 4386 https://en.wikipedia.org/wiki/Herat : Herāt is an oasis city and the third-largest city of Afghanistan. In 2020, it had an estimated population of 574,276
2 / 4386 https://en.wikipedia.org/wiki/Kandahar : Kandahar is a city in Afghanistan, located in the south of the country on the Arghandab River, at an elevation of 1,
3 / 4386 https://en.wikipedia.org/wiki/Mazar-i-Sharif : Mazār-i-Sharīf, also called Mazār-e Sharīf, or just Mazar, is the fourth
4 / 4386 https://en.wikipedia.org/wiki/Jalalabad : Jalalabad is the fifth-largest city of Afghanistan. It has a population of about 356,274, and serves as the capital of N
Extracting latent info
- Each prompt is encoded and passed through GPT-2 to extract activations and (TODO: logit lens)
- For each MLP (TODO: full name), kept track of neurons which fired at any point during the entire sequence, and at what activation
- (TODO: Extract logit lens)
- Clustering is done after compressing latent vectors down to 50 dims w/ PCA, then further compressed to 2 dims using tSNE