tnwei/01-city-circuits-background.md

## 01-city-circuits-background.md

      
    Raw
  

              01-city-circuits-background.md
            
          
    Background


goal is to get a better understanding of how knowledge neurons develop through training. The motivating theory here is that by combining methods in an interactive tool, we can better understand specific circuits, or at least prune the search space of things to try. But there are a lot of ways to combine the methods, so I think another effective use of effort is to experiment with radically different interfaces.

Methods used

TODO: Brief explanation

Saliency maps
Logit lens
Knowledge neurons


## 02-tool-overview.md

      
    Raw
  

              02-tool-overview.md
            
          
    Tool overview

Data


The current focus of the tool is to visualize how GPT-2 stores knowledge about specific topics.
The dataset used are descriptions of cities with more than 100k inhabitants, scraped from Wikipedia (link). Totals to ~4.4k cities.
The first 1k characters of each cities' Wikipedia page are scraped and cleaned, then ran through the encoder from the GPT-2 repo. Each prompt is truncated at the first 30 tokens, and serialized to disk for further use.
First 5 examples from the dataset:

0 / 4386 https://en.wikipedia.org/wiki/Kabul : Kabul is the capital and largest city of Afghanistan, located in the eastern section of the country. It is also a municipality, forming part of the
1 / 4386 https://en.wikipedia.org/wiki/Herat : Herāt is an oasis city and the third-largest city of Afghanistan. In 2020, it had an estimated population of 574,276
2 / 4386 https://en.wikipedia.org/wiki/Kandahar : Kandahar is a city in Afghanistan, located in the south of the country on the Arghandab River, at an elevation of 1,
3 / 4386 https://en.wikipedia.org/wiki/Mazar-i-Sharif : Mazār-i-Sharīf, also called Mazār-e Sharīf, or just Mazar, is the fourth
4 / 4386 https://en.wikipedia.org/wiki/Jalalabad : Jalalabad is the fifth-largest city of Afghanistan. It has a population of about 356,274, and serves as the capital of N

Extracting latent info


Each prompt is encoded and passed through GPT-2 to extract activations and (TODO: logit lens)
For each MLP (TODO: full name), kept track of neurons which fired at any point during the entire sequence, and at what activation
(TODO: Extract logit lens)
Clustering is done after compressing latent vectors down to 50 dims w/ PCA, then further compressed to 2 dims using tSNE


## 03-tool-interface.md

      
    Raw
  

              03-tool-interface.md
            
          
    Tool interface

Using the explorer

Left pane


Search for a city in the dataset in the left pane. Relevant prompts from the dataset will appear for selection.

Middle pane


The middle pane will display the logits / activations per layer for the selected prompt.
Notice the slider at the bottom of the middle pane; the activations shown correspond to the current underlined token in the selected prompt.
The neurons are displayed layer-wise, with activation strength represented by a green hue.

TODO: Yet to figure out:

High probability token shown on the left side of the activations?
The cell numbers and network structure are different for each prompt?

Right pane


If a neuron is selected, the top portion of the right pane displays other relevant prompts that show strong activation on the same neuron. The attention strength (TODO: verify) of each token is similarly represented by a green hue.
The bottom portion of the right pane displays (TODO: activation weighted by self attention??) of the tokens in the selected sentence, wrt the selected neuron.

Using the neuron clustering view

The neuron cluster displays the latent vectors of the dataset condensed to 2 dimensions. When a neuron is selected, the right pane displays other relevant prompts that show strong activation on the same neuron.