Cars Parts Show Us How Some Genetic Stats Mislead

A weekly newsletter featuring the biggest ideas from the smartest people

1. We can “read” genes with ease now, but still can’t say what most of them “mean.” Mastering precisely how they “cause” higher-level traits will require clearer “causology” and fitter metaphors.

2. Genes (more precisely, gene products) contribute to fiendishly complex processes that confound the standard stats grinder. To illustrate, imagine scrutinizing cars and their parts like we do bodies and genes in “genome-wide association studies” (GWAS). The details don’t matter here, beyond that a car-GWAS would analyze a car-level trait like fuel efficiency by variations in the properties of all the car’s parts.

3. Consider a car having standard and sporty models. The latter have larger gas-guzzling engines and available pimped-up painted brake calipers. And let’s say sporty buyers more often pick red brakes, then statistically speaking red brakes bring greater gas guzzling “risk.”

4. If I’m not mistaken (please correct me stats geeks), no stats-only data wizardry can distinguish such non-causal entanglements (p-values can’t discern “phantom patterns”).

5. Generally, part-level properties can have non-causal and non-random “links” to higher-level traits. And including non-causal factors distorts the statistics (misallocating the variation that seems “explained by,” “accounted for,” or “linked to”). Lacking causal insights, you always run the “red-brake” risk.

6. Regarding metaphors, gene products work more like words than car parts (genes aren’t static “blueprints”). They act via sentence-like structures with collective effects and multiple “meanings.” But we lack the rules (~cellular syntax, gene grammar) for how parts of biology compose life’s activity-sentences.

7. Genes also sort of work like music: Typically “played” in precise synchrony to orchestrate many molecular melodies (simultaneous biochemical sentences) enabling enormous ensemble effects.

8. And life typically has way more moving parts than cars, and more complex transient casual structures. It’s traits often have multiple hetero-causal etiologies (roadmaps exhibiting sufficient but not necessary logic). Current stats can’t disentangle hetero-causal effects (larger type-mixed samples often won’t help).

9. All this is sort of known (e.g., genetic architecture, causal roles) yet “jump-to-the-genes” GWASing continues (with rickety elaborations like polygenic scoring).

10. Thankfully, fitter thinking is afoot—for instance, geno-pheno mapping (Massimo Pigliucci), better “Laws of Biology” (Kevin Mitchell), Reductionist Bias Corrections (Krakauer), and Causal Structure Modeling (Judea Pearl).

11. Biology and social science need less primarily parts-focused thinking (you can’t grasp chess by studying the properties of its pieces alone), and ways to handle different kinds of causes and roles—see Krakauer’s Figure 4, Aristotle’s four causes, Tinbergen’s four questions, Marr’s three levels. Much in these fields is more process-or-algorithmshaped (often resisting Occam’s Razor).

12. Related iffy thinking exists far beyond genomics. As mostly practiced, stats presume a flat or “heap” causal structure that’s often ill-suited for process-oriented life, or car making, or even cooking (cooks need step-by-step recipes to turn parts into wholes).

13. Statistical analysis without causal insights often runs the red-brake risk. The habit of adding variables to “control for” factors can misallocate variation (itself often a nonsensical or low quality quantification).

14. Similar structureless-sausage data risks pervade black box approaches to Big Data and AI.

15. You know that correlation doesn’t imply causation, but AI doesn’t “know” that.

Illustration by Julia Suits, The New Yorker cartoonist & author of The Extraordinary Catalog of Peculiar Inventions