Skip to content

Instantly share code, notes, and snippets.

@jordantgh
Last active December 3, 2023 23:38
Show Gist options
  • Save jordantgh/fda3fcaa064c2b09422a8709f0fca5d2 to your computer and use it in GitHub Desktop.
Save jordantgh/fda3fcaa064c2b09422a8709f0fca5d2 to your computer and use it in GitHub Desktop.
G Factor Thoughts

This classic blog attempts to give a critical analysis of the general intelligence factor 'g'. This factor is what emerges from a factor analysis of the correlation between various tests of mental ability, and is purported to explain a large fraction of the variance in performance across tests. Using Thomson's ability-sampling model, Cosma Shalizi created some simulated test data and performed factor analysis. Using 11 tests which draw from 500 shared and 500 unique 'abilities', all of which are uncorrelated independent random variables, it's shown that a single factor explaining around ~30% of test performance variance emerges. This is suggested to undermine one of the core ideas of 'g' theory, i.e., that the 'single factor' needn't correspond to any single variable of interest. While I in fact affirm this conclusion about 'g', I don't think this is a particularly strong argument concerning the reality or meaningfulness of g.

We must first make sure we don't get confused by the fact that, in the simulation, all of the 'abilities' are uncorrelated; this is irrelevant. Summing random variables creates structure in the data which factor analysis captures. For example, if we have 'tests' $S_1 = A+B$ and $S_2 = B+C$, it's of course expected that $Cor(S_1, S_2)$ is nonzero, as they share an underlying variable. What's in discussion in the blog post concerns what happens when you generalise this to arbitrary numbers of variables loaded into our 'test' variables $S_i$.

To break down the simulations in the blog, let's consider a simple worked example: we'll suppose there is a set of 'shared abilities' that can potentially influence multiple tests (think 'eyesight', which could influence many types of test you might subject a person to (reading, driving, archery, etc)), and multiple sets of 'unique abilities' that are specific to each test (for example, 'plant knowledge' might exclusively influence a 'gardening skill' test).

"Number of tests": 5
"Shared abilities": {A, B, C, D, E, F, G, H, I, J, K}
"Unique Abilities": {
    "Test 1": {a1, b1, c1},
    "Test 2": {a2, b2, c2, d2},
    "Test 3": {a3, b3},
    "Test 4": {a4, b4, c4, d4},
    "Test 5": {a5, b5, c5, d5, e5}
}

Now we can draw abilities at random from each relevant category for each test (from as few as 1 to as many as 11 each).

"Test 1": {C, H, J, I, K, E, D, a1, b1, c1},
"Test 2": {B, E, K, G, C, H, A, J, d2},
"Test 3": {K, H, G, A, D, I, C, F, a3, b3},
"Test 4": {H, K, G, C, I, d4, c4, a4, b4},
"Test 5": {H, K, G, D, I, B, C, E, J, e5}

In this case, it so happens that a trio of abilities (C, H and K) appear in all tests. Others, e.g. G also appear in multiple tests. For the purposes of our simulation, the data generating process is random, and uninteresting. Thus, there is no special significance to attribute to an ability or combination of abilities appearing multiple times. However, suppose we believe our battery of tests is meant to capture something we care about, like athletic performance; we might then wonder what exactly are these variables that have predictive value across a range of tests, and if there's anything significant in their co-incidence. I'll have more to say on this point later.

What matters for now is the simulations that Shalizi carried out. As I mentioned, he used a large number of shared and unique variables. The first point to make is that having run the code, I found that there was quite a bit of run to run variation in the variance explained by the leading factor -- between 20-50%. One factor explains this, namely the degree of overlap between variables in the tests. In runs where the tests look more similar, the $R^2$ of the leading factor is higher. This won't be surprising to anyone with a basic understanding of the underlying statistics here, but it bears mention because there's a bit of subtlety in what a factor really is and the nature of it's relationship to a variable of interest. When we say that tests have overlapping abilities, what we are saying is that the same collections of abilities contribute to multiple tests. The abilities at the core of the overlapping sets of abilities will, on average, be the largest components of our leading factor in a factor analysis. The case made in the blog is basically to say that the presence of multiple variables in our leading factor undermines, at least, a monolithic interpretation of the factor. In the case of 'g' and 'intelligence', this cuts against monolithic interpretations of intelligence. However, as I alluded to a moment ago, we need to remember that real life life tests are not randomly drawing from collections of uncorrelated random variables. If a collection of abilities, either independently or jointly, carries predictive value, we can ask the question if there is something meaningful about that particular collection of variables. To be more concrete: if $C+H+G$ is predictive in a range of tests, is there a meaningful $'X'$ such that $X = C+H+G$ is worth writing down?

At this point, we should first ask ourselves what a 'real', 'meaningful', or 'monolithic' variable in science looks like. Are there any clear examples? What is it that we want to contrast the g factor with? One example of a 'proper' physiological measurement like height might come to mind. Few would doubt that height is real, not outside of a philosophy seminar at any rate. But it's trivial to show that height can be decomposed into a linear combination of arbitrarily many underlying variables: leg height, torso height, head height, etc; we can be as granular as we like. If we conduct a battery of basketball related tests and measure height, eye colour and hair colour, we will probably recover a single dominant factor mostly loaded with height. Will that change if, instead of measuring height, we measure leg, torso, and head length individually? Of course it won't. Neither the statistical properties nor the underlying reality of the situation will have changed, but we'll be able to point to a factor loaded with multiple component variables. A bona fide 'IQ organ' in the brain whose mass solely determines intelligence would be consistent with Shalizi's simulations. All that's needed is just to split the measurements up. If the point here goes through, then we've established that the mere fact that a factor is composed of multiple component variables doesn't imply anything in particular about the ontological status of the factor, negative or positive.

Indeed, though, on the 'positive' end of this (where I've been focused on the negative end), it remains the case 'g realism' can't really follow from the existence of a single dominant factor in a factor analysis. My point hitherto has been narrow: many variables in factor $\centernot\implies$ factor is itself 'meaningless' in general. For a factor to be meaningful is still a high bar. I do not have any grand theory of what constitutes a 'proper' scientific variable in this sense, but it seems likely to depend on a background theoretical framework. Consider the case of g. One of the critiques of g theory and (more often) IQ tests is that they are culturally relative: the test batteries index heavily on the experience of Western researchers and their cultural norms. If this is right, one of the abilities that would compose the factor would be proximity to Western cultural norms. Whatever theories of intelligence we might have in mind, it seems unlikely this would be a part of it. This could be seen as an example of 'overfitting' in a statistical sense (as it is a function of peculiarities of picking out features a restricted data set), but there are deeper problems - some variables will feature in any human testing endeavour, like motivation, energy levels, etc. These all reflect what Chomsky called the performance/competence gap in linguistics, but which applies broadly in many scientific domains. The issue is that we don't have a deep, principled mechanistic account of the phenotype that we want to measure. If we had a clear concept of intelligence, or its physiological components, we could skip the test batteries and measure the relevant parameters directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment