Originally written 18 Jul 1998; Last addition 21 Jul 2019

A survey of the literature about replicability of cognitive brain imaging.

A letter about replicability of cognitive brain imaging that I tried to publish in 2003 : Responses

Misunderstanding in cognitive brain imaging.

0. definition of 'Cognitive Brain Imaging' (CBI)

By 'Cognitive Brain Imaging' (CBI) I mean any imaging techniques which looks at structures of and in the brain, with the purpose of understanding the mechanisms of the brain. This is to differentiate cognitive brain imaging from clinical brain imaging, which is used to identify damage to the brain, and to map it to guide treatments. The discussion is only about cognitive imaging. I consider only PET and MRI studies.

1. Should we really expect to see much in CBI?

There is in the field of cognitive science a great expectation from CBI. These expectations, however, are based on naive and wishful thinking. There are two main problems with these expectations: The resolution of imaging and the differences between individuals.

Resolution

Brain imaging techniques like PET and MRI have a resolution that is still far from being really useful. Currently, this is in the range of millimeters. The working of the brain, however, is far too complex to be based on working unit of that magnitude. Thus, the pattern of activity in higher resolution must be important. Most of people realize that, but they also assume that the pattern of activity at the millimeter level is necessarily significant too.

The last assumption is simply nonsensical. Two neuron activity patterns, which are functionally different, do not necessarily give different activity at resolution of millimeters. Because a volume of few cube millimeters contains large number of neurons (~ 1000000 , [5 Jul 2008] Logothetis 2008 (doi) computes 5.5 millions neurons in typical voxel), changes in the pattern of activity inside it probably have small probability of showing as a change in the total activity of the whole volume. Hence, most of the activity and changes in activity in the brain cannot be detected by PET and MRI.

On the other hand, there is probably a lot of 'noise' in the brain, i.e. activity that is not related to the task that the brain is believed by the researchers to be doing at the time they are looking at it. Part of this 'noise' is activity related to other tasks that the brain is doing at the same time, and part of it is real noise, i.e. has no functional significance. Currently, we have no way to directly distinguish between functional activity that is relevant to the task, functional activity that is irrelevant to the task, and noise. Thus we cannot tell whether the patterns of activity that we see in PET or MRI are significant.

Note that this does not tell us that the patterns seen in PET and fMRI are necessarily insignificant. In some regions, the changes in patterns of activity may be enough to actually show in the resolution of PET and fMRI. The problem is that we cannot know that, unless we have an independent test of the significance of the low-resolution activity pattern.

[5 Jul 2008] Logothetis 2008 (Nature 453, 869-878 (12 June 2008)) gives a much more authoritative and up-to-date critique of fMRI. But he completely ignores variations between individuals and reproducibility.

Differences between individuals

Brain damage research (Neuropsychology) tells us that in the cortex, where most of cognitive operations happen, function is localized only for input and output and perception/generation of phonemes. Thus most of the functions of cognition are variable between individuals, and do not have specific locations in general. Almost all CBI studies at the moment try to find specific location for cognitive functions, and therefore are bound to fail. In general, we should not expect to see the same patterns of activity between different individuals.

The expectation that CBI will find location of functions is based on what I call the 'dogma of cognitive science', or the 'sameness assumption', in Reasoning Errors.

Like in the case of resolution, that does not tell us that we cannot pick anything by PET and MRI, because there may be patterns of activity that are the same across individuals (e.g. patterns of activity associated with the basic learning mechanism). It does tell us that we have to be cautious, and need some way of testing whether what is seen is real data.

Because of these two problems, we must have a way of testing if the pattern of activity that we detect in CBI is real. Currently, it seems that the only way to distinguish between relevant and irrelevant data is to check if it is reproducible. If a specific task reproducibly evoke specific pattern of activity, it is probably has some significance. This is discussed in the next section.

2. Reproducibility

The hallmark of real results in a laboratory scientific experiment is their reproducibility. If you can reproduce the relevant conditions in the experiment, you must get the same results. This of course applies to cognitive brain imaging, and as discussed above, essential for checking of the results are real or not. So, are the results of CBI reproducible?

2.1 Publishing Bias

In all science, there is a bias against publishing negative results, including failure to reproduce published results. This starts by the unwillingness of the researchers to write negative results papers, continues in the tendency of editor and reviewers to reject these papers as uninteresting and unconstructive, and ends with the readers finding them boring. Importantly, it also affects the ability of the researchers to get funded. Thus, there is constant need to guard against too 'positive' tendency. In most of the fields of scientific research, people are aware of this, but not in cognitive brain imaging.

In private conversations, researchers in cognitive brain imaging admit that there are 'tons' of unpublished studies that show 'weird' results. In any field of research outside cognitive sciences that would cause red lights to show up, but not in cognitive brain imaging. Instead, they concentrate on publishing those studies that seem to show real results. This has an escalating effect, because once a trend of not publishing negative results is established, the mood in the field is becoming even more 'positive' oriented, and makes it even more difficult to publish negative results.

2.2 Reproducibility of the published work.

Even with this concentration of positive thinking, the researchers in this field cannot find enough good results to publish, and most of the published work is irreproducible, as shown in my paper about replicability in brain imaging. For that, they have to ditch the concept of reproducibility, at least implicitly. A convincing example that this already happened is here. Note that the irreproducibility of the paper was not consider a problem by any of the reviewers or the editors that I contacted.

2.3 A publication that addresses the reproducibility of PET studies

The first paper that I know that "tries to address of question of reproducibility" is:

Poline J-B, Vandenberghe R, Holmes AP, Friston KJ, Frackowiak RSJ. Reproducibility of PET activation studies: lessons from a multi-center European experiment. NeuroImage 4, 34 54 (1996)

I put the double quotes because the authors of this paper clearly have no intention to objectively evaluate their results. Their main data, presented in their figure 2, clearly demonstrates irreproducible results.

In case it is not obvious from the pictures that the results are irreproducible, pretend that you are a researcher with the results of one of studies shown in figure 2, and based on the other results, evaluate these questions:

For each pixel that your study identifies as active, what the probability that it is going to be reliably reproducible in other studies? In other words, how many of your positive results are real rather than false positives?
For each pixel that your study identifies as active, how many pixels that your study identifies as non-active are active in most of the other studies? In other words, how much false negatives you have, compared to your positive results?

For your study to be considered reproducible, the answer to the first question has to be close to 1, while the answer to the second has to be close to 0. Otherwise, it means that your study has either many false positives or many false negatives, or both. None of the studies shown in the pictures passes both tests.

You can repeat the exercise with only sub-group of the studies (e.g. only 3d studies, or eliminating the last 3 'low sensitivity' studies). You will still see that in general, the pattern of activity in any single study cannot be regarded as showing the real pattern of activity.

Nevertheless, the authors repeatedly state that the results show reproducibility. They also go on to discuss possible explanation of variability, like the distinction of 3d vs. 2d studies and low-sensitivity studies. As discussed above, even with these distinctions the results still show irreproducibility. More importantly, all the studies are supposed to show the same results. If we say, for example, that the 3d results are real, then the results of the 2d studies, which are different, cannot be real, and we have to dump all the studies that have been done until now using 2d.

Small areas of activity were reproducible in almost all the studies, according to the authors. However, from each individual study, it is not possible to predict those regions, which means that the results of a single study cannot be used on their own.

In addition, these reproducible regions are similar to those that were identified by brain damage. While this is reassuring, the main interest in brain imaging is the hope that it can tell us something that we don't already know. This study does not support this hope.

The failure of this study is emphasized by the fact that it didn't come up with any actual useful piece of data that can be built on in further research.

2.3.1 Another study about replication

I found this paper later:

Casey, B.J., Cohen, J.D., Craven, K.O., Davidson, R., Nelson, C.A., Noll, D.C., Hu, X., Lowe, M., Rosen, B., Truwit, C., & Turski, P. (1998). Reproducibility of fMRI results across four institutions using a spatial working memory task. NeuroImage, 8, 249-261.

(It was published a year after I wrote the replicability paper).

This is a four-center study in the US, using fMRI data rather than PET like Poline et al above. The task that they chose involves a specific motor action, which from brain damage data is known to be fairly localized, and also some "working memory", which is not known to be localized.

The data itself clearly shows irreproducibility, but the authors try to hide it.

First, they collect results from two subjects from each site, and generate a dataset from it (table 1), simply by picking out voxels in which there was activity from at least 6 out of 8 or 5 out of 7. As shown in section 3 below, this approach can generate "significant" results even when the data is totally random between the subjects. Because the data involves a motor task, it is not random, so it is even easier to generate "significant" results. The authors do not bother to tell us how many voxles there are where there is activity from less than 6 (or 5) subjects, so it is not possible to tell if the distribution is similar to the one in section 3 below. However, from table 1 we can see that there were no locations with activity from all 8 subjects, 5 locations with activity from 7 subjects, and 17 locations with activity from 6 subjects, i.e. a large increase from 8 to 7 to 6. Projecting this to smaller numbers suggest many areas with activation from small number of subjects, i.e. many irreproducible locations of activity.

During all the discussion the authors refer to the areas where there is activity from more than 6 subjects as "reliable" activity. However, "reliable" means that it will reproduce, and these areas are clearly not reproducible, because they don't even reproduce inside this study. For example, of the 11 areas identified in Table 1 for the Memory vs. Motor condition, only two have similar (within 10 mm) areas in the data from the Boston site (table 2a), and four don't have a matching peak within 20 mm in the Boston data (the results in the other cases are not better, See here for complete listing). The authors themselves apparently did not bother to carry this check, presumably because the result was not to their liking.

In the second part of the analysis, the authors list the areas that each site identified in tables 2-4. Clearly the data diverge, as is also clear from figure 3. The authors highlight two areas that are, in their words, "revealing almost exact matches", one for each condition. However:

This is one area out of 15-20 areas that each site identified for each condition, so the data from each site is mostly irreproducible (more than 90%).
Even in this "almost exact matches", the distances between the coordinates of different sites are up to 17mm, which is a large distance.
The local patterns of activity in each area are completely different between the sites (from Figure 3), so the difference in location is not just a displacement.
Most importantly, these two areas are clearly not reproducible, because neither of them have been identified in the pooled data (table 1), i.e. they fail to reproduce even within this study.

The fact that the authors completely ignore these points, in particular the last point, clearly demonstrates that they have no intention to evaluate the data properly. They do discuss differences between sites, but mainly to suggest excuses for the differences. They do not try to evaluate the implications of these differences on the reliability of imaging data. In the discussion, the authors also claim that there is similarity between their data and some previous papers, but they don't present actual comparison.

2.4 A review trying to cope with the variability in the results.

Another amusing discussion is in

Benjamin Martin Bly and Stephen M. Kosslyn (1997). Functional anatomy of object recognition in humans: evidence from positron emission tomography and functional magnetic imaging. Current Opinion in Neurology, 10, 5-9.

These authors obviously would have like the data to give some useful results, but when they collect all the relevant studies they are obviously irreproducible. In the words of the authors, there is a large variability. The authors try to consider several possible excuses, but never lose their faith in the power of imaging to give reproducible results. Even when they go as far as suggesting that object recognition is 'opportunistic' (by which they seem to mean 'variable in time'), they still believe that proper studies can find reproducible results.

It is standing out that even though they actually admit the possibility of variations in time in the same individual ('opportunistic system'), and acknowledge the variability between studies, they do not believe in inter-individual differences. They completely ignore this point when discussing the problems associated with averaging between individuals. Obviously, if there are variations between individuals, averaging become a noise generating procedure.

The data they present is best explained by assuming that 90-100% of it is just irreproducible noise, either because the imaging techniques do not capture real data because of the reasons discussed in section 1 above, or because there are variation between individuals, or both. At most, the data suggests that during object recognition the temporal and occipitotemporal cortex tend to be noisier than other parts of the cortex. This is quite compatible with brain damage data, but, like in 2.3 above, does not actually add anything to it, and demonstrates that imaging is unlikely to give better results. It seems that for the authors this conclusion is not only unacceptable, but actually incomprehensible.

2.5 A discussions of the validity of PET and MRI

An interesting case is in The promise and limit of neuroimaging. Even though the name is quite suggestive, the authors (actually philosophers), who intend "to identify and to critically evaluate the epistemic status of PET, with a goal of better understanding both its potential and its limitations", ignore the question of (ir)reproducibility altogether. Amusingly, the examples of actual research they bring [section 3.1 and 3.2] show a case of irreproducibility, but the authors explain it by assuming that the second experiment was poorly design for duplication of the first one. Maybe they are right in this case, but it is still true that nobody has reproduced the results of the first paper (or the second paper).

2.6 A Review of imaging of visual recognition

Farah and Aguirre (1999, TICS, V.3, P.179) discuss imaging of visual recognition. They collect data from 17 studies, and it is scattered all over the posterior cortex. In the authors words (p.181):

The only generalization that one can make on the basis of these data is that visual recognition is a function of the posterior half of the brain!

(Exclamation mark in the source)
The authors explicitly say that they are disappointed, and refer to their results as "(non)-results". Then they go on to look for explanations. Even though their data is clearly irreplicable, they do not consider this possibility. They go as far as mentioning the possibility that imaging shows "epiphenomenal" activity, because they need explanation why their data contain activities in regions that brain damage studies suggest are not essential. They dismiss this problem by simply ignoring it in the rest of the discussion. They then go on to the usual blurb about different experimental settings, and as usual ignore the question why researchers don't repeat the experiments with the same settings. They next go on cheerfully to introduce newer paradigms, which they believe will sort out the problem.

This case is especially interesting, because the presentation of the data makes it absolutely clear that it is not replicable, yet the authors do not even mention this possibility. One explanation is that they intentionally mislead the reader, but I would say that this is unlikely. Rather, it seems that this is an example of a "theory-driven blindness", where the theoretical prejudices of the authors make them blind to what their data 'tells' them.

2.7 A book that mentions the lack of replicability in imaging

[25Nov2001]

I didn't read the book "The new phrenology: the limits of localized cognitive processes in the brain" by William Uttal, but apparently it mentions the problem of lack of replicability in cognitive brain imaging. I first heard about the book by reading a review that tries to bury it as deep as possible in Nature ("Bumps on the brain", Nature, Vol 414, p. 151, 8 November 2001). The review is an interesting phenomenon on its own, and here are my comments on it.

2.8 The fMRI data center

[28Nov2001]

There is a fMRI data center, in which they want researchers to put their datasets. One of the goals of this center is "Providing all data necessary to interpret, analyze, and replicate these fMRI stduies." At least they did not completely ignore it, but there is no sign that they are aware that replicability is a problem, and hence needs special attention. Nevertheless, this center makes it easier to researchers to try to replicate previous studies, so maybe will make it clearer to them that they can't. Since this is a relatively new enterprise, its effect, if any, will take some time.

[ 4 Oct 2004] The enterprize doesn't seem to catch on. It seems that more than 95% of the datasets are from Journal of Cognitive Neurosceince, so in general authors in the field don't bother to deposit unless they publish in JoCN, which probably forces it, because the editor (Gazzaniga) is the driver of the database.

2.9 Studies that test the reproducibility

[18Jan2002]

Here (Roberts et al, American Journal of Neuroradiology 21:1377-1387 (8 2000) ) is an article that really test the reproducibility of fMRI. It is worth noting that these are radiologists, rather than cognitive scientists, and therefore are much less affected by the dogmas of cognitive science. Their conclusion is:

Quantitative magnetoencephalography reliably shows increasing cortical area activated by increasing numbers of stimulated fingers. For functional MR imaging in this study, intra- and interparticipant variability precluded resolution of an effect of the extent of stimulation. We conclude that magnetoencephalography is more suited than functional MR imaging to studies requiring quantitative measures of the extent of cortical activation.

If that has been in a journal that cognitive scientists read, I would probably wouldn't have to write this page.

I read only the abstract of this (Miki et al, Jpn J Ophthalmol 2001 Mar-Apr;45(2):151-5). They seem also to fail to reproduce.

Same authors as above, previous study (Miki et al, American Journal of Neuroradiology 21:910-915 (5 2000)). Failed to reproduce, even within the same subject.

This article [also here]( McGonigle, D.J., Howseman, A.M., Athwal, B.S., Friston, K.F., Frackowiak, R.S.J., & Holmes, A.P. (2000). Variability in fMRI: An examination of intersession differences. NeuroImage, 11, 708-734) checks inter-session variability, and find a lot of it (see their figures 2 and 3). In fact, their findings undermine all the studies using fMRI. However, they carefully avoid discussiing reproducibility of fMRI. They try to check their results by Random Effects Analysis. They obviously think it is a good idea, but their data does not support it, and they say in the conclusion:

Our assumption of Normally distributed intersession residuals was not supported by close examination of some of our data, and so we accept that future work is required before random-effects models can be used to their full potential.

This implies that even though it doesn't work yet, they are sure that the "random-effects models" will work after some "future work". They don't give the basis for this confidence. See below why random effect analysis is unlikely to be as useful as they think.

This article ( Stability, Repeatability, and the Expression of Signal Magnitude in Functional Magnetic Resonance Imaging. Mark S. Cohen, PhD, and Richard M. DuBois. JOURNAL OF MAGNETIC RESONANCE IMAGING 10:33-40 (1999)) tries to introduce a new approach to evaluate fMRI data. Among other things, they say (p.34):

The most popular approach now in practice is to count the number of voxels that exceed a nominal correlation threshold (often without correcting for the number of samples), sometimes applying additional constraints of spatial contiguity by using ad hoc approaches such as the split-half t-test or convolution smoothing. While it is statistically sound to compare this measure across trials, we show here that all such data are highly suspect, and have such high variance from subject to subject, and trial to trial, that both the statistical power and reliability of the fMRI experiment is compromised severely.

(Italics in the original text).

The basic claim, that fMRI is unreliable, is the same as what I say. However, they claim to actually prove it in their multi-session experiment. We may need to take this statement with some salt, because to advance their new approach they want to slug off the existing ones. However, this statement apparently passed the review process of the Journal of Magnetic Resonanc Imaging, so it wasn't that unacceptable to the referees.

In this abstract (NeuroImage Human Brain Mapping 2002 Meeting, Poster No.: 10184) they actually test for replication across individuals, and found "high inter subject variability" in all tasks. Nevertheless, they took the task that looked to them most reproducible (i.e. least irreproducible) and use it for the patient that was their main interest. They also claim that within subject over time "evidenced the involvement of the same cortical network", which is difficult to interpret. It is fair to say that their conclusion is that follow-up studies should rely on "language tasks of known stability across subjects and time.", though they omit to say that their tests were found not to have such stability over subjects.

Here is an article (Arthurs and Boniface, Trends in Neurosciences Volume 25, Issue 1 , 1 January 2002, Pages 27-31 ) with a promising title: "How well do we understand the neural origins of the fMRI BOLD signal?". The text does discuss the question and says that we don't really. It would have been natural for them to ask "if we don't really understand it, can be sure it generates real data?", and the first step to answer that is to check if it generates reproducible results. However, the authors don't ask this question, and seem to regard as absolutely granted that the fMRI signal gives valid results.

[ 5 Apr 2003 ] In this review (Scott and Johnsrude Trends in Neurosciences, Volume 26, Issue 2 , February 2003, Pages 100-107) of "The neuroanatomical and functional organization of speech perception" they don't check for reproducibility, but in their Box 2 (p. 103) they present a "Meta-analysis" of imaging studies of speech perception. The figure shows what is clearly a random distribution, the kind of thing you will get if you fill a box with variously coloured chips and then spill it on the floor. That doesn't bother the authors, and based on this data they say:

There is, thus, some evidence for parallel and hierarchical responses in human auditory areas.

Clearly even if the data they show is not completely random, there isn't any sign in it for anything parallel or hierarchical.

In the same issue the previous item, there is a review (Ugurbil,Toth and Kim Trends in Neurosciences, Volume 26, Issue 2 , February 2003, Pages 108-114) titled "How accurate is magnetic resonance imaging of brain function?", which sounds promising, but there is nothing in it about reproducibility.

[28Jul2003] This article (pdf) (Schultz et al , Arch Gen Psychiatry. 2000;57:331-340) doesn't intentionally test for reproducibility. It compares activitity in face recogniton between two groups of normal persons, and a group of autists. While it does seem that the activity in the autist group is more different from the normal groups than the normal groups between thesmselves, the activities of the normal groups are clearly different (Figure 3. A and B compared to C and D). The authors claim that in the ROI in the right Fusiform gyrus the activity is not significantly different between the normal groups, but since it is the clear that the activities are different, that just shows that they use lousy statistics.

This study (pdf) (Disbrow et al, J. Comp. Neurol. 418:1-21, 2000) is actually a serious study, for a change. Like the Roberts group above (with which they cooperate, and Disbrow took part in both papers), they are not cognitive scientists, and their research can be best described "comparative and developmental brain anatomy" (Web page). They compare activations in the somatosensory cortex between individuals, and present the data properly, i.e. in a way that let the reader see the actual data. The somatosensory area is one of the "gross features" of the cortex, which was known from brain damage studies, and is determined by input into the cortex from extra-cortical sources, so it is one place where we should expect better reproducibility. The study does show that the activations cluster across individuals, but also that there is considreable variability ("somewhat variable" in the authors words) across individuals. It is the lack of these kind of studies for other features that were claimed by imagers which is the most convincing evidence for the general lousiness of cognitive brain imaging.

[5 Aug 2003 ] In this article (Hasson et al, Neuron, Vol. 37, 1027-1041, March 27, 2003) they show their results from 5 different subjects in Figure 2, and they are clearly wildly divergent. Nevertheless, they go on to average over all the subjects, as if it actually make sense (see discussion in the next section).

[ 13 Aug 2003 ] This paper (Dehaene et al, Cognitive Neuropsychology, 2003, V. 20, pp. 487-506) compares data from several imaging studies of number processing (using different tasks), and claim "high consistency of activatiosn" (legend of Figure 1). However, in the five locations for which they show data in table 1 from more than one study, all of them spread on 2cm or more, which is a large distance in the cortex, and certainly cannot be regarded as repruduction of the result. Since the different studies did different experiments, it is not really an example of lack of reproducibility, but it is also not example of reproducibility. In addition, of the 8 studies in table 1, six are from the laboratory of the authors themselves, so they are not actually independent. [ In a conference in the beginning of July 2003 I argued with some people about the rproducibility of imaging, and the only paper that they could quote as showing reproducibility was this one (which was still in press). ]

This Editorial (Reproducibility of Results and Dynamic Causal Modeling in fMRI: The New Perspectives in fMRI Research Award, John Darrell Van Horn, Journal of Cognitive Neuroscience, Vol. 15, Issue 7 - October 2003, pages 923-924) says:

If fMRI results are unreliable, how can they be used to construct theories of cognitive function?

Which suggests he is actually worried about replication. Later he also says:

Replications of neuroimaging studies have occurred only rarely in the literature (e.g., Chawla et al., 1999; Zarahn, Aguirre, & D'Esposito, 2000), presumably due to the economics of performing fMRI studies and the strong desire of researchers to make unique contributions to the field.

Even the two reference that he brings are not real. Chawla et al, Neuroimage, 9, 508-515 don't present any data to show replication. For Zarahn et al, Brain Res Cogn Brain Res. 2000 Jan;9(1):1-17 I didn't find the full paper online, but the absract makes it clear that at most they show replication with the same subjects and the same research group. It seems the author selected these two articles simply because the word "replication" appeared in their titles.

Notwithstanding this promising statements, the author (one of the author of miller et al below) is not worried at all about replicability. He just introduces two articles that introduce new methods for analysing the data, one of which is called "Reproducibility Maps"(See here). However, this "Reproducibility" is of voxels within the same study, i.e. it is not reproducibility in the normal scientific sense of the word, which need to be reproduction across reseach groups. Maybe in principle this method can be applied across studies, but it certainly not the way it is used in the article. There is also a commentary, which also mention "Reproducibility" in its title, but it is the same "voxel Reproducibility" rather than cross-research-centers Reproducibility.

The article about "Reproducibility Maps" (Liou et al, Journal of Cognitive Neuroscience, V 15 No 7 Page: 935-945, doi) starts with this sentence:

Historically, reproducibility has been the sine qua non of experimental findings that are considered to be scientifically useful.

and then goes on to discuss the "voxel reproducibility". It seems that these people have heard about reproducibility and its importance, but didn't really understood the concept (more likely, they just pretend).

It may be regraded as good news that these people at least pretend to be worried about reproducibility, but this kind of writing only creates a smokescreen, because it gives the impression that reproducibility (in the sense of reproducing results across research centers) is being discussed when it isn't. [29 Apr 2004] The latest Nature Neuroscence contains another demonstration (first author the same as above). In this review(Sharing neuroimaging studies of human cognition, Van Horn et al, Nature Neuroscience 7, 473 - 481 (2004), doi) they say:

The reproducibility of fMRI-related effects in previously published data has also been explored.

And then go on to discuss the article about "Reproducibility Maps".

Here is an abstract that claims to test reproducibility across research centers and find very consistent results. I couldn't find any full article associated with this, so it probably unsubstantiated waffle rather than real result.

This article (Tegeler et al, Human Brain Mapping 7:267-283(1999)) really test reproducibility, but of whole images, rather than peaks. Since the data in imaging studies are the peaks, and whole-image reproducibility is not related to peaks reproducibility, this is actually irrelevant.

[ 7 Nov 2003] It should also be noted that by now there are quite many articles that claim to achieve reproduction within the same subject or group of subject (typically referred to as "intrasubject", "intra-subject", "test-retest", "inter-session"). While these studies are interesting, they cannot tell us anything about cross-research-center reproducibility, or even cross-subject reproducibility.

This article (Intersubject Synchronization of Cortical Activity During Natural Vision, Hasson et al, Science, Vol 303, Issue 5664, 1634-1640 , 12 March 2004) is very misleading, because it falsely gives a strong impression of similarity of activity across subjects. Their main claim is that they "found a striking level of voxel-by-voxel synchronization between individuals" (abstract). In the text they say:"Thus, on average over 29% � 10 SD of the cortical surface showed a highly significant intersubject correlation during the movie (Fig. 1A)."

First, it should be noted that high significance of correlation is not the same as high correlation. Even very low correlation can give high significance if you have enough data points, and Hasson et al have many data-points. They do not tell us the actual level of correlation, even though it is the more useful piece of information. This may be be because the level is low, or because they don't understand the distinction. They do show correlation levels in Fig 5c, where it is between regions, rather than intersubject.

Much more important, though, is the fact the correlations between pairs of subjects are different between pairs. From Figures S1 and S3 it is clear that apart from strong tendency towards the back of the head (i.e. the visual areas), there is no similarity between the pair-wise correlations. In other words, if a voxel is correlated between two subjects, it doesn't tell us if it is correlated between these subjects and any of the rest of the subjects, or among the rest of the subjects. If anything, this is a demonstration of variability across individuals, rather than similarity (except the back-head tendency). However, only readers that will make the effort of loading the supplementary material can see it.

The fact that the data shows variability rather than similarity clearly undermines all the hype of this paper, of which there is quite a lot.

This paper also uses the usual trick of "localizing scans" to find face and building sensitive patches, and then show that they are sensitive to face and buildings. There is some novelty in that they show that for the same subjects these patches are the same for static images and a movie. However, these patches are obviously different across individuals (they don't even bother to compare them), but by including the discussion inside a paper about "intersubject synchronization" they give the impression that they are the same across individuals. Since these patches are irrelevant to the main point of the article ("synchronization"), it looks like this is done intentionally.

This article (Machielsen et al, fMRI of Visual Encoding: Reproducibility of Activation, Human Brain Mapping 9:156-164(2000)) says that "We performed an fMRI study involving within and between-subject reproducibility during encoding of complex visual pictures.", But I don't see between subject comaprison of peaks of activity. They find large variability within subjects.

In this article (Vandenbroucke et al, Interindividual differences of medial temporal lobe activation during encoding in an elderly population studied by fMRI, NeuroImage V. 21, 1 , January 2004, Pages 173-180 doi) they genuinely look at differences across individuals in the MTL (which is not cortex), and find a lot. Since they were looking at "elderly population", they conclude that in old age averaging is not reliable. Somehow it completely escaped their attention that it is possible that it is unreliable in young population too. (I don't actually know how reproducible the results are in the MTL, in general).

[ 4 Oct 2004] In this article (Murphy and Garavan, NeuroImage Volume 22, Issue 2 , June 2004, Pages 879-885) they actually test reproducibility, between two groups of 20 subjects, and say (in the abstract):

These analyses revealed that although the voxelwise overlap may be poor, the locations of activated areas provide some optimism for studies with typical sample sizes. With n = 20 in each of two groups, it was found that the centres-of-mass for 80% of activated areas fell within 25 mm of each other.

This is more or less what you would expect from random distribution, so is not a reason for any optimism whatsoever. Obviously these authors simply cannot conceive the possibility that the data is irreproducible.

This article (Marshall et al, Radiology 2004;233:868-877) (if link is broken try here), explicitly state in the CONCLUSION:

The generally poor quantitative task repeatability highlights the need for further methodologic developments before much reliance can be placed on functional MR imaging results of single-session experiments.

Unfortunately, the way they say it may leave non-experts with the impression that there is no problem with multiple-session studies, and that there are significant number of such studies. In addition, the fact that they used old subjects (69 years old) leaves it open to claims that it is the age that causes the lack of replicability. But it is definitely a start.

In the abstract (materials and method), they say: "Within-session, between-session, and between-subject variability was assessed by using analysis of variance testing of activation amplitude and extent." But in the article itself they don't discuss at all between-subject variability, only within-subject variability.

[ 25 Jul 2012] There are now several articles that claim to show inter-subject correlations when viewing movies( from 2012, from 2010 and from 2008, all available full-text online, share two autjors (J��skel�inen and Sams)). They are doing quite a lot of mathematical analysis to reach the conclusions, and it requires quite deep analysis to check it, which I didn't. However, if the results are actually significant, they should reproduce across studies. In the above studies, they don't even try to compare the actual correlations across studies. Considering that they share authors, the most likely reason for this is that it is clear to them that they are different. That suggests quite strongly that we don'thave here anything significant.

This one (2010) and this one (2010) are similar stuff from other groups. Again difficult to evalute the significance without deep analysis, and there is no effort to compare to other studies. There is here (Trends in Cognitive Science, article in press 2012, "Brain-to-brain coupling: a mechanism for creating and sharing a social world") an opinion article by authors of these two articles discussing the issue, and comparison of correlations across studies is not mentioned, even in their "Questions for future research", which does not promise much. This one (2011) by another group (referred by the opinion), not better.

2.10 A comment from a reviwer

[15 July 2005] In this review article (See me, hear me, touch me: multisensory integration in lateral occipital-temporal cortex, Michael S Beauchamp, Current Opinion in Neurobiology Volume 15, Issue 2 , April 2005, Pages 145-153 ), the author says:

Another important issue is the high degree of inter-subject and -laboratory variability observed in fMRI studies.

So he is aware of the "high degree of variability", but it doesn't seem to worry him too much. He then says:

STS, LO and MT are attractive targets for a review, because there is some consensus about their anatomical locations.

(STS, LO and MT are names of regions of the cortex).
He seems to think that there isn't even "some consensus" about the anatomical location of other regions, which is further than I would go. All this, however, does not prevent him from using fMRI. He seems to be completely unaware of the importance of reproducibilty

2.11 Maybe a real effort to cope with the issue

[26 Apr 2006]

In this press release (UCI Receives Major Grant to Help Create National Methods and Standards for Functional Brain Imaging, 13 Mar 2006), they report about a grant (of 24.3M$) to "... standardize functional magnetic resonance imaging and help make large-scale studies on brain disease and illness possible for the first time".

In the release the director of the receiving consortium says:

There are around 7,000 locations in the U.S. right now doing MRI scanning, yet there is no way to combine the data in a useful way from these sites,

The release then says that "Although brain imaging technology has generated remarkable progress in understanding how mental and neurological diseases develop, it has been nearly impossible for one laboratory to share and compare findings with other labs". It does not explain how it is possible to progess in understanding anything if laboratories cannot share and compare their findings.

Obviously these people know that there is a problem, and it seems that they genuinely try to correct it. It seems though that they still believe that the issue is mainly variability between sites and hardware. and they don't realize that there is the issue of reporducibility across individuals. But it is certainly some progress to at least try to reproduce across sites.

Later in the release it says:

FBIRN established for the first time that the difference between MRI scanners and techniques across centers is so great that the value of multi-site studies is undermined,

Obviously, the "difference .. across centers" undermines the value of all studies, not only multi-side studies. It is interesting question whether these people really don't understand it, or just being careful not to offend too many researchers.

[ 17 Oct 2007] This didn't get that far. From reading their publication in this page, they concentrated on cross-site variations, and did not deal with cross-subject variations at all. Cross-site variation is also interesting to some extent, but without data on cross-subject variations doesn't tell us much.

2.12 Variability due to analysis method

[28 Apr 2006]

In this issue of Human Brain Mapping (Volume 27, Issue 5 (May 2006)) they discuss reproducibility, but what they mean is reproducibility of the results from the same data when it is analysed by different methods. They seem to think that the methods of analysis introduce quite a lot of variability.

It is interesting that they do spend that amount of effort on looking for reproducibility across methods, but don't do the same for reproducibility using the same method acorss individuals and sites. Obviously the latter is the important question, but discussing it will lead to awkward answers.

2.13 A Review of Functional Imaging Studies on Category Specificity

[27 Apr 2007] In this review (Christian Gerlach, Journal of Cognitive Neuroscience. 2007;19:296-314) it says in the abstract: "Not a single area is consistently activated for a given category across all studies." But it does not occur to him that means that there is a question od reproducibility that needs to be addressed.

2.14 A real effort to test reproducibility (but even the authors don't take it seriously)

[6 Nov 2007]

This article (Bertrand Thirion, Philippe Pinel, S�bastien M�riaux, Alexis Roche, Stanislas Dehaene, and Jean-Baptiste Poline. Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses.. Neuroimage, 35(1):105--120, March 2007; Full text) test reproducibility by using large number of subjects, splitting them to groups and comparing between the groups. Their results clearly invalidate the vast majority of studies until now, if not all of them. Most clearly, they conclude that you need at least 20 subjects to get a reliable result, and most studies use much smaller number.

It should be noted that this is based on very simple tasks, and is unlikely to generalize to more complex tasks, which are likely to be more variable.

Interestingly, the authors themselves do not seem to take their results seriously, and they published many studies with many less subjects. To check if they still do, I looked up the publications page of one of the authors, Stanislas Dehaene , and downloaded the two articles that were listed before and after the article above. The first was done with 9 subjects, the second was done with 12 subjects. Obviously, the study above shows that their results in the two latter studies are unreliable, but that doesn't bother them.

2.15 An article claiming "reproducible across subjects"

[21 Mar 2012]

This article claims in the abstract that their results are "reproducible across subjects". I didn't see yet the full text.

In this blog entry it says they use three subjects, so it doesn't sound promising.

The article is critical of other fMRI studies, and claims in the abstract to "challenge that view [localization view of brain function]".

2.16 Giving up of reproducing and trying something else

[27 Mar 2012] In this article (What Makes Different People's Representations Alike: Neural Similarity Space Solves the Problem of Across-subject fMRI Decoding, Rajeev D. S. Raizada1 and Andrew C. Connolly 2012; Full text fron an author's web page) they figured out that you cannot reproduce activity across subjects. Instead they look at what they call "similarity space".

They say things like "However, just as the literal fingerprints on people's hands are idiosyncratic to individuals, the �neural fingerprints� of representations in their brains may also be subject-unique. Indeed, this has found to be the case." (bottom of first column). However, they dont' state explicitly that previous "results" of fMRI studies are not reproducible.

Their technique achieves something, and in principle it can work on top of stochastic connectivity, because similarity between patterns of activity of concepts can be a result of learning. Their results, however, are not as promising as they pretend. The concepts that they use are very very very gross compared to the subteleties of human thinking, and they already not 100% accurate. It seem unlikely that it will work for more fine-grained concepts.

In addition, the method as currently described suffer from combinatorial explosion. The authors discuss it and mention a further study in which they develop heauristics to deal with 92 concepts. It remains to be seen how well these heauristics work, and can they be scaled to larger numbers.

Without scaling to fine-grained, and hence numerous, cocepts, it it not obvious that this technique will be any use in understanding the underlying mechanisms. Nevertheless, it produces results with some probability of being reproducible, which is a significant improvement.

By the way, the term "connect" or any of its derivative does not appear in this article, which is a little surprising considering that the underlying reason for any result they have must be something to do with the actual connectivity.

2.17 An editorial in Nature Neuroscience

[19 Sep 2017]

Nature Neuroscience has published in Feb 2017 an editorial, following a paper that shows some serious problems in the analysis of fMRI: Fostering reproducible fMRI research. But they completely miss the point. I discuss this in detail, because it is a typical example of the way the field in general "deals" with the issue of reproducibility.

The main problem with the editorial is what is missing, which is a discussion of comparison of data between independent researchers. In fMRI research, you start with raw data coming from the machine itself (which is already processed to some extent). You then analyse this raw data to get high-level data, like "region X is active in situation Y". The high-level data is obviously the interesting result. Thus "reproducible fMRI" must be reproduction of the high-level data.

In principle, if the raw data is reproducible and the analysis is consistent, the high-level data will be reproducible too. But it is also possible for the raw data to be different from various experimental reasons (like different machines, different stimuli) and still be analysed to produce reproducible high-level data. So to know if fMRI research is reproducible in an interesting way, we must either check that the high-level data is reproducible, by comparing it across research groups, or check that the raw data is reproducible and that the analysis is consistent.

The editorial completely ignores the question of reproducibility of the data, either the raw data or the high-level data. All the steps that they take and recommend are about improving the analysis. It stands out the most when they write:

We also encourage researchers to deposit their data sets in recommended data repositories (http://www.nature.com/sdata/policies/repositories) so that they can be aggregated for large-scale analyses across studies, potentially improving the statistical power and robustness of any conclusions that may arise from these analyses.

Obviously, such data repositories can be used to compare the data across research groups, and hence check for reproducibility, but that possibility did not occur to the writers. All they can think of is aggregating the data.

This is quite representative of the current state in the field. They do talk about reproducibility, but only consider ways of improving the analysis. The reproducibility of the data is not discussed at all (because they know it is not reproducible and don't want to admit it).

It is worth noting that comparing high-level data across studies like I did in 1997 is quite easy. The reason they don't do it is because they know the results, and they are negative as they were in 1997.

Apart from failing to discuss comparison of data, the editorial is biased and dishonest argument in trying to defend fMRI research. After some introduction, they mention the study that raised methodological issues and a previous study that raised questions of reproducibility, and then write:

Unfortunately, such reports have unintentionally harmed this technique's reputation and called into question the merit of published fMRI research.

This sentence could have been written: "Such reports raise doubts about reproducibility of fMRI research," but it was not. Instead, it is "Unfortunately", and "unintentionally", and "harmed this technique's reputation", all terms with negative connotations, obviously with the intention of associating these connotations with "such reports". They continue:

Are these criticisms warranted and, even if the answer is 'no', how can the scientific community address the negative connotations associated with this research?

Whether these criticisms are warranted is an interesting question, but they are not actually discussing it in the rest of the editorial. Instead, they imply that the answer is 'no', and go on to worry about the "negative connotations" associated with "this research" (not obvious if this refers to fMRI research or the critical articles). They continue:

Even with the innumerable parameters that may differ between individual fMRI studies-study and task designs, scanner protocols, subject sampling, image preprocessing and analysis approaches, choice of statistical tests and thresholds, and correction for multiple comparisons, to name a few-many findings are reliably reproduced across labs. For example, the brain regions associated with valuation, affect regulation, motor control, sensory processing, cognitive control and decision-making show concordance across different fMRI studies in humans; these findings have also been supported by animal research drawing on more invasive and direct measures.

That would be a plausible defence of fMRI research if the bulk of fMRI research was about mapping these regions, and thus imply that this is the case. But that is false, because fMRI is used for much more detailed studies (because mapping these areas doesn't tell you much, an it can be done by other methods). These more detailed studies come with more detailed data, which is not reproducible.

The editorial doesn't consider the more detailed data, and instead continues:

These converging results should be highlighted in commentaries regarding research reproducibility, and critiques should be constructively balanced with potential solutions.

That is just ridiculous. They are not discussing young children that require encouragement and guidance. They are discussing the work of supposedly serious scientists, which should expect non-"constructively balanced" critiques and come out with their own solutions. Really the message here is: "Don't talk about reproducibility without pretending that, in general, fMRI produce reproducible results (because it "unintentionally harmed this technique's reputation and called into question the merit of published fMRI research."). They continue:

In doing so, these critiques can provide an opportunity to revisit methods and highlight caveats, allowing the neuroimaging community to refine their methodological and analytical approaches and adopt practices that ultimately lead to more robust and reproducible results (http://www.ohbmbrainmappingblog.com/blog/keep-calm-and-scan-on).

And what about rejecting spurious results? Rejecting spurious results is what makes science different from other approaches to knowledge, which is why the main method of doing it, i.e. reproducible research, is so important. But the editors of Nature Neuroscience succeed to forget it here.

The link is to another response to the paper that provoked the editorial, pretty technical blog by "Organization for Human Brain Mapping" which is all about analysis, without any consideration of checking reproducibility by comparing data across.

This blog has a link to a "OHBM COBIDAS Report" (whatever that means) titled "Best Practices in Data Analysis and Sharing in Neuroimaging using MRI". This is long and technical. They define "replication" and "replicability" to mean what I mean when I write "reproduction" and "reproducibility", and define "reproduction" to mean what I would call "re-analysis". However, checking replication by comparing data appears in this document only in abstract sense and in references to other fields. They don't discuss how to actually do it, who should do it, and how to encourage doing it, even though in the "scope" they write: "Hence while this entire work is about maximizing replicability,...".

2.18 "Small sample sizes reduce the replicability of task-based fMRI studies" - actual progress

29 Dec 2018

Here (Small sample sizes reduce the replicability of task-based fMRI studies: Benjamin O. Turner, Erick J. Paul, Michael B. Miller & Aron K. Barbey, 7 Jun 2018) is an article where they actually check and conclude that you need more than 100 subjects to get really reproducible results. They do it by splitting the subjects in studies to two groups, and checking if the results are reproducible between the groups. As they point out, that means that the replicability is between measurements on the same scanner which are processed in the same way, so is actually optimal, and the replicability between different research groups should actually be even worse.

This would invalidate almost all of the fMRI literature, and the authors try to kind of hide this, by using mild language. They keep talking about "modest replicability", for example in the abstract

We find that the degree of replicability for typical sample sizes is modest and that sample sizes much larger than typical (e.g., N=100) produce results that fall well short of perfectly replicable.

Any two random patterns can have "modest replicability", all you need to do is to ignore the bits that are different between them. So "modest replicability" really means no replicability, but it does not sound as bad. They suggect doing research with larger groups, and some other ideas, without explicitly stating that the results until now are useless.

It is difficult to blame the authors for not being more explicit about the uselessness of existing research, because they probably wouldn't be able to publish the paper otherwise, but it does reduce the impact of what they say. Nevertheless, it is going in the right direction.

The journal in which it is published, Communications Biology, is a predator journal (i.e. where the authors pay), so the quality of the research is not obvious.

2.19 "False-positive neuroimaging" - kind of actual progress

21 Jul 2019

A new article (doi) in BioRxiv now report a survey of 135 functional Magnetic Resonance Imaging studies in which researchers claimed replications of previous findings. They found that 42.2% did not report coordinates at all, and that of those that did, in 42.9% had peaks more than 15 mm away. In other words, The majority of the replication claims were bogus.

It can be argued that this is an improvement from my survey where I found 0% replication. However, I surveyed all the papers that I could find about fMRI, not only ones that claim replication. Considering that they considered only claims of replication, it is not obviously an improvement at all.

The more than half bogus claims of replication tell us that now researchers know that they need to at least pretend replication, but don't succeed to do it in most the cases, so instead make bogus claims. The less than half cases where the claims are not obviously bogus don't tell us much either, because by now there are many imaging results, and authors can select which of these results their latest experiment happen to replicate (there is nothing about pre-registering in this survey). Thus we don't know if any of these is real.

On top of this, researchers will obviously will not try to replicate results that they already know are not replicable. Thus these replication efforts were:

Applied only to the fraction of reults that researchers believed they can replicate.
Contain only the cases where the authors decide to publish.
More than half are bogus.
We don't know how many of the non-obviously bogus ones are real replication rather than the authors just picking some previous result that looked similar.

Together, all these factors make the survey best compatible with rate of replicability of few percents at most, and also compatible with 0% replication.

Thus the progress that this shows is that:

Researchers now do think they should show replication.
It is possible that some results are replicable.
It is possible to actually publish such paper.

3. Averaging of images

In many cases (almost all of published research in PET and MRI) the results are averaged over several subjects, to get significant results. Obviously, if there are variations between individuals, this procedure will generate noise. However, most of the researchers in the field simply ignore the possibility of variations between individuals, and hence feel free to use averaging to enhance the significance of the results.

Some researchers in the field understand that averaging necessarily gives more significant results, but not all of them. For example, in a collection of tutorial essays (Posner, M. I., Grossenbacher, P. G., & Compton, P. W. (1994). Visual attention. In M. Farah & G. Ratcliff (Eds.), The neuropsychology of high-level vision: Collected tutorial essays (pp. 217-239)), Posner et al (1994) say (P. 220):

It is remarkable that at the millimeter range of precision most studies have shown it possible to sum activations over subjects who perform in the same tasks in order to obtain significance. This suggests that even high-level semantic tasks have considerable anatomical specificity in different subjects add is perhaps the most important results of the PET work.

This, of course, is plain nonsense. In case that is not obvious, here is a hypothetical example. Assume the following for some experiment (the numbers are selected so all the results are integers, but they are typical):

The 'noise' in the activity (standard deviation of activity) of the brain in the control condition is 18X, where X is some fraction of the average activity in the control condition.
The task in the experiment causes a set of half of the pixels in the image of each individual subject to increase their activity by around 18X, as compared to the control condition. This set is the same between experiments on the same individual subject, but there is no correlation between the sets of 'active' pixels in the brains of different subjects.
The researchers take as significant only pixels where the activity in the task is higher than the activity in the control condition by more than two standard deviations.

In this case, none of the pixels in any of the subjects in the experiment will show significant (over 36X) increase in the task. However, if the researchers average over 9 subjects, for example, then the level of noise goes down by 3, to 6X. The average increase in activity in the task (compared to control) is 9X (compare to 18X increase in those pixels that are active in each individual). However, this increase will be distributed unevenly, and since we assume no correlation between subjects, the distribution will be binomial. On average, out of each 512 pixel, there going to be:



Number       Number of               Average  
of pixels    subjects showing        Of increase     
             increase of 18X

    1             9                   18X
    9             8                   16X
   36             7                   14X
   84             6                   12X
  126             5                   10X
  126             4                    8X
   84             3                    6X
   36             2                    4X
    9             1                    2X
    1             0                    0X

Since the averaged noise is 6X, any pixel with average increase of activity of more than 12X will be regarded as having 'significant change'. Thus, after averaging, around 9% ([1 + 9 + 36]/512) of the pixels will become 'significantly active' in the task.

Note that this happens even though we assume that there is no correlation at all between the subjects. Any correlation that does exist between brains (e.g. visual input in the back of the cortex), even if very weak, will increase the tendency of averaging to generate 'significant results' out of random noise.

The reason for these 'ghost results' is that while the averages of activity and their standard deviations go down on averaging, the extremes of the distribution of activities do not go down on averaging, and it is these extremes that appear as significant results.

Note that the threshold of what is 'significant' is a free variable, which can be adjusted freely to generate the 'best result' out of data. It is therefore possible to generate 'significant results' in almost any situation.

The only way to check if these 'significant results' are real is to compare them to other studies, either with the same technique or with other technique. If the result is reliably reproducible, it is unlikely to be random. The problem with this, of course, is that the results of cognitive imaging studies are not reliably reproducible, as discussed above.

4. Random-effect analysis

[6Jun2001]

Random-effect analysis is an effort to judge the significance of an effect by comparing its magnitude to the variance between the subjects. In cognitive brain imaging, the effect is the difference of activity in each pixel between two (or more) conditions. The underlying assumption is that the subjects' distribution of differences will reflect the distribution in the general population, even if their mean is different because of sampling error. Therefore, this analysis can, in principle, reject spurious random effects that are just sampling errors from a variable population. One of the reviewers of my replicapiblity paper used it as the main counter-argument.

Obviously, the first point to note is that random-effect analysis hasn't produced yet any reproducible result in the cortex. It is thus not actually improving the results. At most, it may have reduced the rise in number of published papers.

Assuming the underlying data is indeed irreproducible, either because of the resolution problem or because of variability between individuals, the reason for lack of reproducibility is clear. The question is why random-effect analysis doesn't reject all the papers.

The first possibility, which may explain most of the cases, is that random-effect analysis is either not performed, or performed wrongly. The actual procedure of doing the random-effect analysis makes the analysis more complex, and hence add possibilities of doing things wrongly.

Another reason for failure of random-effect analysis to block noise papers is that the analysis is based on the assumption that the distribution of the magnitude of the effect in the sample is the same as in the general population (i.e. has similar variance). That is (approximately) correct if the mean is different between the sample and the population because the samples contain several large outliers, and random-effect analysis can easily eliminate these cases. However, it can go wrong for two cases:

Pixels where the variance is too large in the sample and are rejected even though they shouldn't. This creates many false negatives, which is especially serious problem, because most of the researchers are unaware that this is a problem.
Pixels where the mean is different because the sample contains individuals from only one side of the distribution of the population, which also reduces the variance, and therefore the random-effect analysis may fail to reject the result.

The "advertised" way over both of these problems is to use large number of subjects, which should take both the distribution and the mean of the sample closer to the values in the population. Note that in case (2) above, i.e. for pixels in which the sample's variance is too low, increasing the number of subjects will cause increase in the variance of the sample. For most of the pixels, it will cause a decrease in the variance in each pixel, and therefore the analysis will reject less of them. Thus the number of accepted pixels will increase.

One problem with this solution that it does mean using many subjects, which makes studies more difficult. The main problem, though, is that it works properly only if the background activity is flat, i.e. if the only real difference between the conditions is some pixels becoming more/less active, and the rest of the pixels stay at the same level of activity, with only random fluctuations. Obviously, this is unlikely to be true, and there are variations in the mean level of activity of regions in the cortex between conditions, and that makes the analysis much messier.

For simplicity, let's assume the only real difference in some region between two conditions (A and B) is that the mean activity of the pixels in this region increases in condition B. In this case, any pattern of activity that will come out of an experiment is noise, rather than real result, so we need to consider if random-effect analysis can prevent getting a result, i.e. patterns of activity, in this case. Consider three cases:

The increase of the mean activity between the conditions is large compared to the variance in the population.: In this case, if we had used all the population as a sample, almost all the pixels should show significant difference, and in the case of a population of infinite size all of them will (provided the increase is above the resolution of the equipment). For small number of subjects in the experiment, the variance can be too large, and hence most of the pixels rejected, thus leaving a random pattern of accepted pixels.; In this case, increasing the number of subjects in the sample will decrease the variances of the pixels, thus increasing the number of accepted pixels. With enough subjects, all the pixels will be accepted, and no pattern will emerge. However, the number of subjects that is required to reach this state is dependent on the increase between the conditions, so the researchers need to measure this increase and calculate the number of subjects that is required accordingly. I don't think anybody ever did this.; On the other hand, the researchers have a way of decreasing the number of accepted pixels: they can increase the threshold for acceptance. Since researchers like a nice result, which in this case means only small part of the pixels being accepted, they can increase the threshold until they get a nice result. Normally increasing the threshold increase the reliability of the result, so this would not be regarded as a bad practice. However, in this case, when the problem is that most of the pixels are false negatives, increasing the threshold makes the result worse by making more false negatives, and allow the researchers to generate patterns where there aren't any.
The increase is similar in magnitude to the variance in the population.: In this case it is not obvious what the result actually should be, and even using all the population as a sample will still give a random pattern. Only much larger sample then the whole population will actually give a consistent result, because it will have smaller variation and show all pixels active as in the previous case. In this case, any number of subjects will give some pattern.
The increase is much smaller than the variance in the population.: In this case, all pixels should be rejected, and using all the population as a sample they will be. Increasing the number of subjects will reject more pixels, thus getting closer to the correct result. However, how many subjects will be required is difficult to figure out: since when the increase is 0 this number is small, and when it is the same size as the population variance (the previous case) it is much larger than the number of people in the population, it is clear that the required number of subjects is very strongly dependent on the size of the increase. Thus it requires a very accurate measurement of the increase to know the number of subjects that is required. I don't think anybody tried to do this.

Thus if the average activity in a region is not flat across the conditions, random-effect analysis will generate random patterns of activity. Since it is unlikely that the average is completely flat between conditions over the entire cortex, random-effect analysis cannot reliably reject false results.

It is worth noting that when using random-effect analysis in other fields, it is normally use to test whether some manipulation has an effect, and in this case it is valid. In imaging, where what is looked for are peaks of activity, that is not good enough, because the effect in a voxel maybe a result of a change in all the area plus noise in this voxel.

5. What experiments should be done in CBI

As my paper shows, current CBI studies are irreplicable. Thus the first target should be establishing the replicability of CBI. As discussed in [1] above, CBI suffers two main problems: resolution and variability across individuals. Hence a useful step would be to decouple these problems.

It is not obvious what can be done about the resolution of fMRI and PET, but the variability across individuals can be dealt with by repeating studies on the same individual. As I found in my paper, this is not done currently.

The basic design of an experiment to test the CBI method would be something like this: First, find the pattern of activity in the brain of one individual that is associated with some task or concept. Then repeat the same experiment (ideally by another research group) with the same individual, and see if the result is the same.

What is 'the same' in this context is not obvious. A possible criterion is whether these patterns can be used to identify the task or concept that the individual is thinking about. For example, in the first stage the researchers would find the patterns associated with thinking about cow, sheep, table and chair. Then the experiment would be repeated, and the researchers would see if using the results of the first experiment and only the patterns of activity from the second experiment they can identify what the individual was thinking about in the second experiment.

If the patterns are the same (at least good enough to identify what the individual thinks about), it tells us two important things:

Some of the patterns in the resolution of PET and fMRI are actually significant.
There are replicable patterns of activity associated with a concept in the brain. Most of neuroscientists would take this for granted, but currently we don't have direct evidence for this, and the 'opportunistic system' suggestion from Bly & kosslyn [2.4 above] would predict otherwise.

If the patterns are not the same, the conclusion is less clear. It may be either because of the lack of resolution or because of lack of replicable patterns of activity in the brain.

[8Feb2000] Just found this, which tries to measure replicability within the same subject(s) over time. They claim that there is replicability, but don't give enough information to judge if it is serious. I e-mailed the author to find if there are newer results. [27Jul2001: the link is dead.]

[28Nov2000] This one is interesting, because it shows in-subject comparison between imaging in 1.5T and 3T (T (Tesla) is the unit of the strength of the magnetic field). They look at activity in M1 and V1, which is most likely to be reproducible. The results (in Fig 4) show quite large differences even inside the same object between the different fields. If these difefrences are typical, the experiement above will fail to show anythng useful.

5.1 They now do these experiments.

Haxby et al (Science, p. 2425-2430, Vol 293, 28 Sep 2001) actually perform a similar experiment to the one I describe above. They show the subjects pictures from different categories, divided the data from each subject to two sets, and check if they can figure out which category a subject looks at by comparing the pattern of activity to the patterns of activity found in the other set. They claim high success rates. Of course it still remains to be seen if this is reproducible.

The most interesting point about this article is their treatment of the question of similarity across objects. Almost all of the previous brain imaging studies were based on the assumption that there are patterns of activity which are similar across subjects, and these patterns is what the researchers were looking for. This paper is based solely on within-subject comparison, and the authors do not assume or look for similar patterns across subjects. This is quite different theoretical assumption. Haxby et al, however, do not make this point explicit, and a non-alert reader is unlikely to realize that the paper is based on a very different theoretical underpinning from previous studies.

The commentators in the same issue (Cohen and Tong, Science, p. 2405-2407, Vol 293, 28 Sep 2001) completely missed this point. In a quite long comment they discuss this article and another brain imaging study in the same issue. However, they completely ignore the similarity assumption, and don't mention at all that Haxby et al study is based solely on within-subject comparison.

Haxby et al do mention the question of cross-subjects comparison (p.2427), but claim that the warping methods (methods to account for differences in brains 3d shapes) are inadequate. This is a ridiculous assertion, because the warping methods can easily cope with the resolution of their study (3.5mm). In addition, they could show the patterns that they show in Fig. 3 for more than one subject, to enable visual comparison, but they didn't. They leave it to the reader to work out that there simply aren't any cross-subject correlations in their data, though they do hint about it in their comment that single-unit recording in monkey did not reveal larger scale topographic arrangement.

[ 26 Nov 2004] They now a further study in this vain. This article (Hanson, Matsuka and Haxby, NeuroImage Volume 23, Issue 1 , September 2004, Pages 156-166; probable full text) does a more serious analysis. In this paper they mentioned that "1) inter-subject variability in terms of object coding is high, notwithstanding that the VMT mask is relatively small" (bottom of p.14 of the pdf file), but avoid showing comparisons between individuals. This paper argues against the idea of face area, but they don't mention the problems that I discuss in section 7 below.

[ 27 Apr 2004] Today I went to a lecture by Haxby, in which he discussed these patterns. In the end of the talk I asked him if he tried to compare the patterns across subjects. His answer was quite odd. He clearly didn't compare across indivisuals, and thought that it won't work because of the "high resolution", but then also said that they are working on better methods to try to align the patterns across individuals. He seems to both believe and not believe that the patterns can be matched across subjects. For the rest of the audience (Cambridge University department of psychology) the issue was of no interest.

5.2 Real progress?

[07Nov2002]

([15July2005] No.)

This is a new article in the Journal of Cognitive Neuroscience: Extensive Individual Differences in Brain Activations Associated with Episodic Retrieval are Reliable Over Time, Michael B. Miller, John Darrell Van Horn, George L. Wolford, Todd C. Handy, Monica Valsangkar-Smyth, Souheil Inati, Scott Grafton and Michael S. Gazzaniga, Journal of Cognitive Neuroscience. 2002;14:1200-1214. ([30Oct2003] dead link Abstract only here ).

They explicitly tested for variability between individials in their group, and obviously found that there is a lot of variability. They also found that this variability is stable within individuals.

In their discussion, they try not to attack group studies (i.e. almost all studies until now, including their own studies), but do make it reasonably clear that their results actually invalidate these studies.

The principle author of this study (Gazzaniga) is quite a central figure in Cognitive Neuroscience, so there is some chance that this paper will not be simply ignored.

While the results of this study nicely agree with what I am saying, that doesn't say that they are going to be reproducible. Until they are, this study cannot be regarded as supporting evidence. I actually e-mailed the authors suggesting them to ask another research center to reproduce the results with the same individuals, to see if they can reproduce it. I got an interesting answer from the first author, which suggests that they are not going to pursue this line of research.

6. What fMRI actually measures

Logothetis et al (Neurphysiological investigation of the basis of the fMRI signal, Nature, 2001, V.412, p.150) tried to identify what cause the fMRI signal, i.e. what causes the haemodynamic responses that fMRI detects. This article was hyped to some extent as an important advance, but it doesn't promise much.

The first important point is that it was found that the fMRI signal correlates with the LFP (local field potential), but not (or not so well) with action potentials. Since action potentials are the way neurons transfer information from their body to the axon and hence to other neurons, this means that fMRI does not correlate with information processing.

The commentators on the article gloss this point by writing about "local processing", referring to processes inside the dendrites. These, however, has no functional significant if they don't affect the firing of action potentials, so on their own cannot be regarded as information processing, local or otherwise.

In the "news and views" about this article (Bold insights, Nature, 2001, V.412, p.120), Marcus E. Raichle infers from this article that: "For the neurophysiologists, who seek to understand the biochemical and biophysical processes underlying neural, the absence of action potentials must not be interpreted as the absence of information". He doesn't tell us how information processing can happen without action potentials. As far as I can tell, the sole argument is based on the 'religious' belief that fMRI must be associated with information processing. I E-mailed Raichle asking how information processing can happen without action potentials, but I would be surprised if I get an answer.

there are some claims that the LFP also represents the activity of small local neurons (i.e. non-pyramidal ones), for example Logothesis himself in The Underpinnings of the BOLD Functional Magnetic Resonance Imaging Signal J. Neurosci. 2003 23: 3963-3971. That would give the LFP a a little more credibility, but not much. The pyramidal cells are the mabjority of cells, and because of their size and huge number of synapses, they form even larger part of the synapses in the cortex. If the LFP doesn't correlate to their activity, it still doesn't really correlate to information processing.

The other important point in Logothesis et al is that they show that the fMRI signal lags for more than 2 seconds behind the neural response (that was actually more or less known before, but not measured directly). That means that it can be used to monitor only mental operations that cause patterns of activity which are fixed for seconds. Most of thinking proceeds much faster than that, so fMRI will never be useful for monitoring thinking. Only simple operations, like perceiving a fixed stimulus or performing repeatedly a simple motor task, can be monitored by fMRI. Note that the restriction is a result of the behaviour of the underlying variable of fMRI (haemodynamic response), and hence cannot be overcome by improving the technique.

[ 8 Dec 2003] There are now claims that they can measure the magnetic field that results from neuronal activity directly here (full article). I have no idea how real that is, in principle it can work.

A point that was discussed in the article and commentaries is that Logothesis et al show that the neural response in many cases is much more significant (in order of magnitude) than the fMRI response. That means that the fMRI measurements always contain very large number of false negatives, i.e. pixels where there is a response in the level of activity, but not a significant fMRI signal. Therefore, the patterns of fMRI signals do not actually reflect properly the patterns of neural activity (even the lagging LFP that they correlate with).

In short, this paper shows that the fMRI signal is not correlated with information processing, lags behind the activity that it is correlated with, and does not properly reflect it.

[23 Jun 2003] This one made me laugh: Pessoa et al (Neuroimaging Studies of Attention: From Modulation of Sensory Processing to Top-Down Control J. Neurosci. 2003 23: 3990-3998), after referring this paper, say:

Thus, in some case fMRI may reveal significant activation that may have no counterpart in single-cell physiology.

As mentioned above, Logothesis et al showed the fMRI will necessarily have huge number of false negatives (i.e. where it fails to show activity even when there is activity), because its signal has significance in order of magnitude lower than the neural responses. This, however, seems to have completely passed by this authors, who infer the inverse problem, i.e. that the single-cell physiology will miss activity that the fMRI shows. It seems that they cannot preceive the possibility that the fMRI signal is not an accurate reflection of the underlying activity. Note their usage of the verb "reveal", with the strong connotations of being true and useful, rather than more neutral terms (like "indicate", "show", "give", "suggest").

7. The "Fusiform Face Area"

The "Fusiform Face Area" (FFA) was quoted by the editor of Neuron as an example of a "reasonably robust" result (here). There are many references to the FFA, for example, Tootel et all (Roger B. H. Tootell, Doris Tsao, and Wim Vanduffel Neuroimaging Weighs In: Humans Meet Macaques in "Primate" Visual Cortex J. Neurosci. 2003 23: 3981-3989), in a mini-review article, say:

This basic face/non-face distinction has been replicated consistently in many laboratories,( Puce et al 1996, Allison et al 1999, Halgren et al 1999, Haxby et al 1999, Tong and Nakayama 1999, Hoffman and Haxby 2000, Hasson et al 2001), a significant accomplishment in itself.

As the side point, we can note these authors clearly know that there is a problem with reproducibility in cognitive brain imaging, but they completely ignore this point in the rest of the article (as do the rest of the mini-reviews in this issue of J. Neurosci.). However, they do claim that the FFA is reproducible. What they don't tell you is the FFA is "reproducible" in different locations in different individuals. In fact the variability is such, that researchers find the FFA by looking for regions that are differentially active when the subject view faces in each individual brain. For example, in Tong et al(COGNITIVE NEUROPSYCHOLOGY, 2000, 17 (1/2/3), 257-279), an article from the principle investigator(Kanwisher) who is normally credited for identifying the FFA (in this article), they use this method of identifying the FFA in individual subject, and justify it by saying:(p.3 of the pdf file):

Such individual localisation was crucial because the FFA can vary considerably in its anatomical location and spatial extent across subjects (Kanwisher et al., 1997).

Thus the FFA is not reproducible across individuals, but you can find face-sensitive patches in each individual and call them "the FFA". And that is what Tootell et al above call "significant accomplishment". Tootell et al are actually aware of the technique that is used to achieve "reproducibility" (at least that what Tootell told me by e-mail), but couldn't be bothered to tell it to their readers.

Faces are a very special "objects" for humans, both because all of us are "experts" in them (because we have a lot of experience looking at them and interpreting them), and (more importantly) because they are associated with emotional response (because most humans reflect their emotional state in their face). Therefore, it is plausible that we have many (learned) face-selective patches in the high-level vision areas. The "FFA" seems to be a good place to look for these kind of patches, but that doesn't mean that the area is specialized for faces. The stochastic connectivity of the cortex (for which the "considrable variation across subjects" can be regarded as another small piece of evidence), rules out any face-specific circuits inside this area. What is possible is that because the empotional content of faces, the connectivity to extra-cortical structures which take part in processing emotioal responses (most importantly, amygdala) may determine where it is easier to find face-selective patches.

It should also be noted that and there are already suggestions (backed by some evidence) that this area tend to be more active when we view objects for which we have high expertise (and we all have high-expertise for faces)(also this one) . This suggeston, because of its generality, has a better chance of actually being true (though, obviously, we still need to see reproduction of the data), but because other objects don't reflect emotions the way faces do, other objects will never have such strong responses as faces do.

[ 10 Dec 2007 ] A review article by promoters of the FFA, fails to consider the question of emotions and their reflection in faces. The words 'emotion' or 'feel' do not appear at in the article, which is quite stunning considering the strong association between face expressions and emotions.

Note also that in the original article they found FFA patches only in 12 out of 15 subjects. Because of the positive publication bias, we don't know in how many other subjects it is not possible to find these patches. It is also not known in what percentage of subjects it is possible to find such patches in other regions.

The idea of looking for patches in individual brains and then discuss them as if they are reproducible is used in other cases, too. For example, Kanwisher herself use it in A Cortical Area Selective for Visual Processing of the Human Body , Paul E. Downing, Yuhong Jiang, Miles Shuman, Nancy Kanwisher, Science Sep 28 vol293, 2001: 2470-2473. Since this kind of studies are effectively comparing the data with itself, they are more likely to generate results that are reproducible, but are not interesting (e.g. that face-sensitive patches are sensitive to faces). They therefore are effective in making the discussion of reproducibility of imaging more confused. More of the "body" stuff(full text), and more, and quite funny argument about the "EBA"(p.126, V.8, No.2, Feb 2005, Nature Neuroscience).

[2Apr2004] The latest Science contains an example how of the localizing is used (Contextually Evoked Object-Specific Responses in Human Visual Cortex, Cox et al, Science, Vol 304, Issue 5667, 115-117 , 2 April 2004). The localizing is barely mentioned in the text of the article ("Once we localized the FFA for each subject"), and there is another sentence in Figure 3. The supporting material gives more information, including the fact that for two out of nine subjects, the localizer scan found nothing. The vast majority of readers, though, are not going to notice any of this, and get the impression that the FFA is a well defined area. The fact that this is not what the article tries to show make the effect even stronger, because the readers are unlikely to pay much attention to side-issues.

[7 May 2006] The idea of localizer apparently went to the head of many researchers, to the point that some have difficulties to publish articles that don't localize. See the exchange in "Comments and Controversies" in Neuroimage, Volume 30, Issue 4, Pages 1077-1470 (1 May 2006). Note that neither side raises the question of repdoducibility of the results they get, which should be the first question in a discussion of methodology.

[6 Oct 2008] Just found this (Vul, E. & Kanwisher, N. (in press). Begging the question: The non-independence error in fMRI data analysis. To appear in Hanson, S. & Bunzl, M (Eds.), Foundations and Philosophy for Neuroimaging.) This exposes some of the mistakes in fMRI. The interesting thing about it is that it is written by Kanwisher, the inventor of the "Fusiform Face Area". She actually mentions he 1997 article (above) as one case of showing bad figures. But she ignores other problems, and reproducibility is not mentioned, as usual.

8. Imaging of the LGN

[ 2 Nov 2004 ] Other parts of the brain are more ordered than the cerebral cortex, so in principle they may show higher reproducibility, if the reason for lack of reproducibility is the variability in the cortex across individuals. This article (Schneider et al, Retinotopic Organization and Functional Subdivisions of the Human Lateral Geniculate Nucleus: A High-Resolution Functional Magnetic Resonance Imaging Study The Journal of Neuroscience, October 13, 2004, 24(41):8975-8985 (full text)) doesn't promise much. In their figure 2 they show data from two "representative" subjects. They say (p. 8978):

In the coronal plane, the representation of the horizontal meridian was oriented at an 45� angle, dividing the bottom visual field, represented in the medial-superior section of the LGN, and the top visual field, represented in the lateral-inferior section. Although the extent of activations varied somewhat among subjects, the general pattern of retinotopic polar angle organization was consistent.

The last sentence is a very optimistic view of their data. Apart from the gross similarity that they describe in the first sentence, the data for the two subjects is completely different. It is obvious that not only every voxel is different, but even each group of 2x2x2 pixels varies across the individuals almost randomly, apart from the gross similarity.

9. Serious criticism of some studies

This article (Voodoo Correlations in Social Neuroscience, Vul et al, In Press, Perspectives on Psychological Science , First author web page) contains very serious criticism of some papers using fMRI. Surprisingly, they don't mention reproducibility in their paper.

In one rebuttal of this paper, Response to "Voodoo Correlations in Social Neuroscience" by Vul et al. - summary information for the press, they state that there were many replications of the findings that are discussed by Vul et al. Of the refrences they give, one doesn't actually appear in the list of refernces (Gu and Hahn 2007), and I couldn't easily find online the text of the review (Singer and Leiberg, 2009) and one of the papers (Lamm et al 2007a). I checked the other four references they give (Lamm et al 2007b, Saarela et al 2007, jackson et al 2006, Jackson et al 2005). I couldn't see anything that could count as comparing the actual data (rather than interpretations) between these articles and previous publications. This is a clear case of the bogus references technique.

In a sense, it is progress that they actually think about reproducibility, but they obviously don't take the concept seriously.

The authors of the original paper answer here. They seem not to have bothered to read the references.

The first author of the original article, Edward Vul, also wrote this book chapter. Again, there is serious criticism of some articles, but no reference to reproducibility.

-----------------------------------------------------

Yehouda Harpaz
yh@maldoo.com
18jul98
http://human-brain.org/