A survey of the literature about replicability of cognitive brain imaging.
A letter about replicability of cognitive brain imaging that I tried to publish in 2003 : Responses
By 'Cognitive Brain Imaging' (CBI) I mean any imaging techniques which looks at structures of and in the brain, with the purpose of understanding the mechanisms of the brain. This is to differentiate cognitive brain imaging from clinical brain imaging, which is used to identify damage to the brain, and to map it to guide treatments. The discussion is only about cognitive imaging. I consider only PET and MRI studies.
There is in the field of cognitive science a great expectation from CBI. These expectations, however, are based on naive and wishful thinking. There are two main problems with these expectations: The resolution of imaging and the differences between individuals.
Brain imaging techniques like PET and MRI have a resolution that is still far from being really useful. Currently, this is in the range of millimeters. The working of the brain, however, is far too complex to be based on working unit of that magnitude. Thus, the pattern of activity in higher resolution must be important. Most of people realize that, but they also assume that the pattern of activity at the millimeter level is necessarily significant too.
The last assumption is simply nonsensical. Two neuron activity patterns, which are functionally different, do not necessarily give different activity at resolution of millimeters. Because a volume of few cube millimeters contains large number of neurons (~ 1000000 , [5 Jul 2008] Logothetis 2008 (doi) computes 5.5 millions neurons in typical voxel),changes in the pattern of activity inside it probably have small probability of showing as a change in the total activity of the whole volume. Hence, most of the activity and changes in activity in the brain cannot be detected by PET and MRI.
On the other hand, there is probably a lot of 'noise' in the brain, i.e. activity that is not related to the task that the brain is believed by the researchers to be doing at the time they are looking at it. Part of this 'noise' is activity related to other tasks that the brain is doing at the same time, and part of it is real noise, i.e. has no functional significance. Currently, we have no way to directly distinguish between functional activity that is relevant to the task, functional activity that is irrelevant to the task, and noise. Thus we cannot tell whether the patterns of activity that we see in PET or MRI are significant.
Note that this does not tell us that the patterns seen in PET and fMRI are necessarily insignificant. In some regions, the changes in patterns of activity may be enough to actually show in the resolution of PET and fMRI. The problem is that we cannot know that, unless we have an independent test of the significance of the low-resolution activity pattern.
[5 Jul 2008] Logothetis 2008 (Nature 453, 869-878 (12 June 2008)) gives a much more authoritative and up-to-date critique of fMRI. But he completely ignores variations between individuals and reproducibility.
Brain damage research (Neuropsychology) tells us that in the cortex, where most of cognitive operations happen, function is localized only for input and output and perception/generation of phonemes. Thus most of the functions of cognition are variable between individuals, and do not have specific locations in general. Almost all CBI studies at the moment try to find specific location for cognitive functions, and therefore are bound to fail. In general, we should not expect to see the same patterns of activity between different individuals.
The expectation that CBI will find location of functions is based on what I call the 'dogma of cognitive science', or the 'sameness assumption', in Reasoning Errors.
Like in the case of resolution, that does not tell us that we cannot pick anything by PET and MRI, because there may be patterns of activity that are the same across individuals (e.g. patterns of activity associated with the basic learning mechanism). It does tell us that we have to be cautious, and need some way of testing whether what is seen is real data.
Because of these two problems, we must have a way of testing if the pattern of activity that we detect in CBI is real. Currently, it seems that the only way to distinguish between relevant and irrelevant data is to check if it is reproducible. If a specific task reproducibly evoke specific pattern of activity, it is probably has some significance. This is discussed in the next section.
The hallmark of real results in a laboratory scientific experiment is their reproducibility. If you can reproduce the relevant conditions in the experiment, you must get the same results. This of course applies to cognitive brain imaging, and as discussed above, essential for checking of the results are real or not. So, are the results of CBI reproducible?
In all science, there is a bias against publishing negative results, including failure to reproduce published results. This starts by the unwillingness of the researchers to write negative results papers, continues in the tendency of editor and reviewers to reject these papers as uninteresting and unconstructive, and ends with the readers finding them boring. Importantly, it also affects the ability of the researchers to get funded. Thus, there is constant need to guard against too 'positive' tendency. In most of the fields of scientific research, people are aware of this, but not in cognitive brain imaging.
In private conversations, researchers in cognitive brain imaging admit that there are 'tons' of unpublished studies that show 'weird' results. In any field of research outside cognitive sciences that would cause red lights to show up, but not in cognitive brain imaging. Instead, they concentrate on publishing those studies that seem to show real results. This has an escalating effect, because once a trend of not publishing negative results is established, the mood in the field is becoming even more 'positive' oriented, and makes it even more difficult to publish negative results.
Even with this concentration of positive thinking, the researchers in this field cannot find enough good results to publish, and most of the published work is irreproducible, as shown in my paper about replicability in brain imaging. For that, they have to ditch the concept of reproducibility, at least implicitly. A convincing example that this already happened is here. Note that the irreproducibility of the paper was not consider a problem by any of the reviewers or the editors that I contacted.
The first paper that I know that "tries to address of question of reproducibility" is:
Poline J-B, Vandenberghe R, Holmes AP, Friston KJ, Frackowiak RSJ. Reproducibility of PET activation studies: lessons from a multi-center European experiment. NeuroImage 4, 34 54 (1996)
I put the double quotes because the authors of this paper clearly have no intention to objectively evaluate their results. Their main data, presented in their figure 2, clearly demonstrates irreproducible results.
In case it is not obvious from the pictures that the results are irreproducible, pretend that you are a researcher with the results of one of studies shown in figure 2, and based on the other results, evaluate these questions:
For your study to be considered reproducible, the answer to the first question has to be close to 1, while the answer to the second has to be close to 0. Otherwise, it means that your study has either many false positives or many false negatives, or both. None of the studies shown in the pictures passes both tests.
You can repeat the exercise with only sub-group of the studies (e.g. only 3d studies, or eliminating the last 3 'low sensitivity' studies). You will still see that in general, the pattern of activity in any single study cannot be regarded as showing the real pattern of activity.
Nevertheless, the authors repeatedly state that the results show reproducibility. They also go on to discuss possible explanation of variability, like the distinction of 3d vs. 2d studies and low-sensitivity studies. As discussed above, even with these distinctions the results still show irreproducibility. More importantly, all the studies are supposed to show the same results. If we say, for example, that the 3d results are real, then the results of the 2d studies, which are different, cannot be real, and we have to dump all the studies that have been done until now using 2d.
Small areas of activity were reproducible in almost all the studies, according to the authors. However, from each individual study, it is not possible to predict those regions, which means that the results of a single study cannot be used on their own.
In addition, these reproducible regions are similar to those that were identified by brain damage. While this is reassuring, the main interest in brain imaging is the hope that it can tell us something that we don't already know. This study does not support this hope.
The failure of this study is emphasized by the fact that it didn't come up with any actual useful piece of data that can be built on in further research.
I found this paper later:
Casey, B.J., Cohen, J.D., Craven, K.O., Davidson, R., Nelson, C.A., Noll, D.C., Hu, X., Lowe, M., Rosen, B., Truwit, C., & Turski, P. (1998). Reproducibility of fMRI results across four institutions using a spatial working memory task. NeuroImage, 8, 249-261.
(It was published a year after I wrote the replicability paper).
This is a four-center study in the US, using fMRI data rather than PET like Poline et al above. The task that they chose involves a specific motor action, which from brain damage data is known to be fairly localized, and also some "working memory", which is not known to be localized.
The data itself clearly shows irreproducibility, but the authors try to hide it.
First, they collect results from two subjects from each site, and generate a dataset from it (table 1), simply by picking out voxels in which there was activity from at least 6 out of 8 or 5 out of 7. As shown in section 3 below, this approach can generate "significant" results even when the data is totally random between the subjects. Because the data involves a motor task, it is not random, so it is even easier to generate "significant" results. The authors do not bother to tell us how many voxles there are where there is activity from less than 6 (or 5) subjects, so it is not possible to tell if the distribution is similar to the one in section 3 below. However, from table 1 we can see that there were no locations with activity from all 8 subjects, 5 locations with activity from 7 subjects, and 17 locations with activity from 6 subjects, i.e. a large increase from 8 to 7 to 6. Projecting this to smaller numbers suggest many areas with activation from small number of subjects, i.e. many irreproducible locations of activity.
During all the discussion the authors refer to the areas where there is activity from more than 6 subjects as "reliable" activity. However, "reliable" means that it will reproduce, and these areas are clearly not reproducible, because they don't even reproduce inside this study. For example, of the 11 areas identified in Table 1 for the Memory vs. Motor condition, only two have similar (within 10 mm) areas in the data from the Boston site (table 2a), and four don't have a matching peak within 20 mm in the Boston data (the results in the other cases are not better, See here for complete listing). The authors themselves apparently did not bother to carry this check, presumably because the result was not to their liking.
In the second part of the analysis, the authors list the areas that each site identified in tables 2-4. Clearly the data diverge, as is also clear from figure 3. The authors highlight two areas that are, in their words, "revealing almost exact matches", one for each condition. However:
The fact that the authors completely ignore these points, in particular the last point, clearly demonstrates that they have no intention to evaluate the data properly. They do discuss differences between sites, but mainly to suggest excuses for the differences. They do not try to evaluate the implications of these differences on the reliability of imaging data. In the discussion, the authors also claim that there is similarity between their data and some previous papers, but they don't present actual comparison.
Another amusing discussion is in
Benjamin Martin Bly and Stephen M. Kosslyn (1997). Functional anatomy of object recognition in humans: evidence from positron emission tomography and functional magnetic imaging. Current Opinion in Neurology, 10, 5-9.
These authors obviously would have like the data to give some useful results, but when they collect all the relevant studies they are obviously irreproducible. In the words of the authors, there is a large variability. The authors try to consider several possible excuses, but never lose their faith in the power of imaging to give reproducible results. Even when they go as far as suggesting that object recognition is 'opportunistic' (by which they seem to mean 'variable in time'), they still believe that proper studies can find reproducible results.
It is standing out that even though they actually admit the possibility of variations in time in the same individual ('opportunistic system'), and acknowledge the variability between studies, they do not believe in inter-individual differences. They completely ignore this point when discussing the problems associated with averaging between individuals. Obviously, if there are variations between individuals, averaging become a noise generating procedure.
The data they present is best explained by assuming that 90-100% of it is just irreproducible noise, either because the imaging techniques do not capture real data because of the reasons discussed in section 1 above, or because there are variation between individuals, or both. At most, the data suggests that during object recognition the temporal and occipitotemporal cortex tend to be noisier than other parts of the cortex. This is quite compatible with brain damage data, but, like in 2.3 above, does not actually add anything to it, and demonstrates that imaging is unlikely to give better results. It seems that for the authors this conclusion is not only unacceptable, but actually incomprehensible.
An interesting case is in The promise and limit of neuroimaging. Even though the name is quite suggestive, the authors (actually philosophers), who intend "to identify and to critically evaluate the epistemic status of PET, with a goal of better understanding both its potential and its limitations", ignore the question of (ir)reproducibility altogether. Amusingly, the examples of actual research they bring [section 3.1 and 3.2] show a case of irreproducibility, but the authors explain it by assuming that the second experiment was poorly design for duplication of the first one. Maybe they are right in this case, but it is still true that nobody has reproduced the results of the first paper (or the second paper).
Farah and Aguirre (1999, TICS, V.3, P.179) discuss imaging of visual recognition. They collect data from 17 studies, and it is scattered all over the posterior cortex. In the authors words (p.181):
The only generalization that one can make on the basis of these data is that visual recognition is a function of the posterior half of the brain!
(Exclamation mark in the source)
The authors explicitly say that they are disappointed, and refer to
their results as "(non)-results". Then they go on to look for
explanations. Even though their data is clearly irreplicable, they do
not consider this possibility. They go as far as mentioning the
possibility that imaging shows "epiphenomenal" activity, because they
need explanation why their data contain activities in regions that
brain damage studies suggest are not essential. They dismiss this
problem by simply ignoring it in the rest of the discussion. They then
go on to the usual blurb about different experimental settings, and as
usual ignore the question why researchers don't repeat the experiments
with the same settings. They next go on cheerfully to introduce newer
paradigms, which they believe will sort out the problem.
This case is especially interesting, because the presentation of the data makes it absolutely clear that it is not replicable, yet the authors do not even mention this possibility. One explanation is that they intentionally mislead the reader, but I would say that this is unlikely. Rather, it seems that this is an example of a "theory-driven blindness", where the theoretical prejudices of the authors make them blind to what their data 'tells' them.
I didn't read the book "The new phrenology: the limits of localized cognitive processes in the brain" by William Uttal, but apparently it mentions the problem of lack of replicability in cognitive brain imaging. I first heard about the book by reading a review that tries to bury it as deep as possible in Nature ("Bumps on the brain", Nature, Vol 414, p. 151, 8 November 2001). The review is an interesting phenomenon on its own, and here are my comments on it.
There is a fMRI data center, in which they want researchers to put their datasets. One of the goals of this center is "Providing all data necessary to interpret, analyze, and replicate these fMRI stduies." At least they did not completely ignore it, but there is no sign that they are aware that replicability is a problem, and hence needs special attention. Nevertheless, this center makes it easier to researchers to try to replicate previous studies, so maybe will make it clearer to them that they can't. Since this is a relatively new enterprise, its effect, if any, will take some time.
[ 4 Oct 2004] The enterprize doesn't seem to catch on. It seems that more than 95% of the datasets are from Journal of Cognitive Neurosceince, so in general authors in the field don't bother to deposit unless they publish in JoCN, which probably forces it, because the editor (Gazzaniga) is the driver of the database.
Here (Roberts et al, American Journal of Neuroradiology 21:1377-1387 (8 2000) ) is an article that really test the reproducibility of fMRI. It is worth noting that these are radiologists, rather than cognitive scientists, and therefore are much less affected by the dogmas of cognitive science. Their conclusion is:
Quantitative magnetoencephalography reliably shows increasing cortical area activated by increasing numbers of stimulated fingers. For functional MR imaging in this study, intra- and interparticipant variability precluded resolution of an effect of the extent of stimulation. We conclude that magnetoencephalography is more suited than functional MR imaging to studies requiring quantitative measures of the extent of cortical activation.If that has been in a journal that cognitive scientists read, I would probably wouldn't have to write this page.
I read only the abstract of this (Miki et al, Jpn J Ophthalmol 2001 Mar-Apr;45(2):151-5). They seem also to fail to reproduce.
Same authors as above, previous study (Miki et al, American Journal of Neuroradiology 21:910-915 (5 2000)). Failed to reproduce, even within the same subject.
This article [also here]( McGonigle, D.J., Howseman, A.M., Athwal, B.S., Friston, K.F., Frackowiak, R.S.J., & Holmes, A.P. (2000). Variability in fMRI: An examination of intersession differences. NeuroImage, 11, 708-734) checks inter-session variability, and find a lot of it (see their figures 2 and 3). In fact, their findings undermine all the studies using fMRI. However, they carefully avoid discussiing reproducibility of fMRI. They try to check their results by Random Effects Analysis. They obviously think it is a good idea, but their data does not support it, and they say in the conclusion:
Our assumption of Normally distributed intersession residuals was not supported by close examination of some of our data, and so we accept that future work is required before random-effects models can be used to their full potential.This implies that even though it doesn't work yet, they are sure that the "random-effects models" will work after some "future work". They don't give the basis for this confidence. See below why random effect analysis is unlikely to be as useful as they think.
This article ( Stability, Repeatability, and the Expression of Signal Magnitude in Functional Magnetic Resonance Imaging. Mark S. Cohen, PhD, and Richard M. DuBois. JOURNAL OF MAGNETIC RESONANCE IMAGING 10:33-40 (1999)) tries to introduce a new approach to evaluate fMRI data. Among other things, they say (p.34):
The most popular approach now in practice is to count the number of voxels that exceed a nominal correlation threshold (often without correcting for the number of samples), sometimes applying additional constraints of spatial contiguity by using ad hoc approaches such as the split-half t-test or convolution smoothing. While it is statistically sound to compare this measure across trials, we show here that all such data are highly suspect, and have such high variance from subject to subject, and trial to trial, that both the statistical power and reliability of the fMRI experiment is compromised severely.(Italics in the original text).
The basic claim, that fMRI is unreliable, is the same as what I say. However, they claim to actually prove it in their multi-session experiment. We may need to take this statement with some salt, because to advance their new approach they want to slug off the existing ones. However, this statement apparently passed the review process of the Journal of Magnetic Resonanc Imaging, so it wasn't that unacceptable to the referees.
In this abstract (NeuroImage Human Brain Mapping 2002 Meeting, Poster No.: 10184) they actually test for replication across individuals, and found "high inter subject variability" in all tasks. Nevertheless, they took the task that looked to them most reproducible (i.e. least irreproducible) and use it for the patient that was their main interest. They also claim that within subject over time "evidenced the involvement of the same cortical network", which is difficult to interpret. It is fair to say that their conclusion is that follow-up studies should rely on "language tasks of known stability across subjects and time.", though they omit to say that their tests were found not to have such stability over subjects.
Here is an article (Arthurs and Boniface, Trends in Neurosciences Volume 25, Issue 1 , 1 January 2002, Pages 27-31 ) with a promising title: "How well do we understand the neural origins of the fMRI BOLD signal?". The text does discuss the question and says that we don't really. It would have been natural for them to ask "if we don't really understand it, can be sure it generates real data?", and the first step to answer that is to check if it generates reproducible results. However, the authors don't ask this question, and seem to regard as absolutely granted that the fMRI signal gives valid results.
[ 5 Apr 2003 ] In this review (Scott and Johnsrude Trends in Neurosciences, Volume 26, Issue 2 , February 2003, Pages 100-107) of "The neuroanatomical and functional organization of speech perception" they don't check for reproducibility, but in their Box 2 (p. 103) they present a "Meta-analysis" of imaging studies of speech perception. The figure shows what is clearly a random distribution, the kind of thing you will get if you fill a box with variously coloured chips and then spill it on the floor. That doesn't bother the authors, and based on this data they say:
There is, thus, some evidence for parallel and hierarchical responses in human auditory areas.Clearly even if the data they show is not completely random, there isn't any sign in it for anything parallel or hierarchical.
In the same issue the previous item, there is a review (Ugurbil,Toth and Kim Trends in Neurosciences, Volume 26, Issue 2 , February 2003, Pages 108-114) titled "How accurate is magnetic resonance imaging of brain function?", which sounds promising, but there is nothing in it about reproducibility.
[28Jul2003] This article (pdf) (Schultz et al , Arch Gen Psychiatry. 2000;57:331-340) doesn't intentionally test for reproducibility. It compares activitity in face recogniton between two groups of normal persons, and a group of autists. While it does seem that the activity in the autist group is more different from the normal groups than the normal groups between thesmselves, the activities of the normal groups are clearly different (Figure 3. A and B compared to C and D). The authors claim that in the ROI in the right Fusiform gyrus the activity is not significantly different between the normal groups, but since it is the clear that the activities are different, that just shows that they use lousy statistics.
This study (pdf) (Disbrow et al, J. Comp. Neurol. 418:1-21, 2000) is actually a serious study, for a change. Like the Roberts group above (with which they cooperate, and Disbrow took part in both papers), they are not cognitive scientists, and their research can be best described "comparative and developmental brain anatomy" (Web page). They compare activations in the somatosensory cortex between individuals, and present the data properly, i.e. in a way that let the reader see the actual data. The somatosensory area is one of the "gross features" of the cortex, which was known from brain damage studies, and is determined by input into the cortex from extra-cortical sources, so it is one place where we should expect better reproducibility. The study does show that the activations cluster across individuals, but also that there is considreable variability ("somewhat variable" in the authors words) across individuals. It is the lack of these kind of studies for other features that were claimed by imagers which is the most convincing evidence for the general lousiness of cognitive brain imaging.
[5 Aug 2003 ] In this article (Hasson et al, Neuron, Vol. 37, 1027-1041, March 27, 2003) they show their results from 5 different subjects in Figure 2, and they are clearly wildly divergent. Nevertheless, they go on to average over all the subjects, as if it actually make sense (see discussion in the next section).
[ 13 Aug 2003 ] This paper (Dehaene et al, Cognitive Neuropsychology, 2003, V. 20, pp. 487-506) compares data from several imaging studies of number processing (using different tasks), and claim "high consistency of activatiosn" (legend of Figure 1). However, in the five locations for which they show data in table 1 from more than one study, all of them spread on 2cm or more, which is a large distance in the cortex, and certainly cannot be regarded as repruduction of the result. Since the different studies did different experiments, it is not really an example of lack of reproducibility, but it is also not example of reproducibility. In addition, of the 8 studies in table 1, six are from the laboratory of the authors themselves, so they are not actually independent. [ In a conference in the beginning of July 2003 I argued with some people about the rproducibility of imaging, and the only paper that they could quote as showing reproducibility was this one (which was still in press). ]
If fMRI results are unreliable, how can they be used to construct theories of cognitive function?Which suggests he is actually worried about replication. Later he also says:
Replications of neuroimaging studies have occurred only rarely in the literature (e.g., Chawla et al., 1999; Zarahn, Aguirre, & D'Esposito, 2000), presumably due to the economics of performing fMRI studies and the strong desire of researchers to make unique contributions to the field.Even the two reference that he brings are not real. Chawla et al, Neuroimage, 9, 508-515 don't present any data to show replication. For Zarahn et al, Brain Res Cogn Brain Res. 2000 Jan;9(1):1-17 I didn't find the full paper online, but the absract makes it clear that at most they show replication with the same subjects and the same research group. It seems the author selected these two articles simply because the word "replication" appeared in their titles.
Notwithstanding this promising statements, the author (one of the author of miller et al below) is not worried at all about replicability. He just introduces two articles that introduce new methods for analysing the data, one of which is called "Reproducibility Maps"(See here). However, this "Reproducibility" is of voxels within the same study, i.e. it is not reproducibility in the normal scientific sense of the word, which need to be reproduction across reseach groups. Maybe in principle this method can be applied across studies, but it certainly not the way it is used in the article. There is also a commentary, which also mention "Reproducibility" in its title, but it is the same "voxel Reproducibility" rather than cross-research-centers Reproducibility.
The article about "Reproducibility Maps" (Liou et al, Journal of Cognitive Neuroscience, V 15 No 7 Page: 935-945, doi) starts with this sentence:
Historically, reproducibility has been the sine qua non of experimental findings that are considered to be scientifically useful.and then goes on to discuss the "voxel reproducibility". It seems that these people have heard about reproducibility and its importance, but didn't really understood the concept (more likely, they just pretend).
It may be regraded as good news that these people at least pretend to be worried about reproducibility, but this kind of writing only creates a smokescreen, because it gives the impression that reproducibility (in the sense of reproducing results across research centers) is being discussed when it isn't. [29 Apr 2004] The latest Nature Neuroscence contains another demonstration (first author the same as above). In this review(Sharing neuroimaging studies of human cognition, Van Horn et al, Nature Neuroscience 7, 473 - 481 (2004), doi) they say:
The reproducibility of fMRI-related effects in previously published data has also been explored.And then go on to discuss the article about "Reproducibility Maps".
Here is an abstract that claims to test reproducibility across research centers and find very consistent results. I couldn't find any full article associated with this, so it probably unsubstantiated waffle rather than real result.
This article (Tegeler et al, Human Brain Mapping 7:267-283(1999)) really test reproducibility, but of whole images, rather than peaks. Since the data in imaging studies are the peaks, and whole-image reproducibility is not related to peaks reproducibility, this is actually irrelevant.
[ 7 Nov 2003] It should also be noted that by now there are quite many articles that claim to achieve reproduction within the same subject or group of subject (typically referred to as "intrasubject", "intra-subject", "test-retest", "inter-session"). While these studies are interesting, they cannot tell us anything about cross-research-center reproducibility, or even cross-subject reproducibility.
This article (Intersubject Synchronization of Cortical Activity During Natural Vision, Hasson et al, Science, Vol 303, Issue 5664, 1634-1640 , 12 March 2004) is very misleading, because it falsely gives a strong impression of similarity of activity across subjects. Their main claim is that they "found a striking level of voxel-by-voxel synchronization between individuals" (abstract). In the text they say:"Thus, on average over 29% ± 10 SD of the cortical surface showed a highly significant intersubject correlation during the movie (Fig. 1A)."
First, it should be noted that high significance of correlation is not the same as high correlation. Even very low correlation can give high significance if you have enough data points, and Hasson et al have many data-points. They do not tell us the actual level of correlation, even though it is the more useful piece of information. This may be be because the level is low, or because they don't understand the distinction. They do show correlation levels in Fig 5c, where it is between regions, rather than intersubject.
Much more important, though, is the fact the correlations between pairs of subjects are different between pairs. From Figures S1 and S3 it is clear that apart from strong tendency towards the back of the head (i.e. the visual areas), there is no similarity between the pair-wise correlations. In other words, if a voxel is correlated between two subjects, it doesn't tell us if it is correlated between these subjects and any of the rest of the subjects, or among the rest of the subjects. If anything, this is a demonstration of variability across individuals, rather than similarity (except the back-head tendency). However, only readers that will make the effort of loading the supplementary material can see it.
The fact that the data shows variability rather than similarity clearly undermines all the hype of this paper, of which there is quite a lot.
This paper also uses the usual trick of "localizing scans" to find face and building sensitive patches, and then show that they are sensitive to face and buildings. There is some novelty in that they show that for the same subjects these patches are the same for static images and a movie. However, these patches are obviously different across individuals (they don't even bother to compare them), but by including the discussion inside a paper about "intersubject synchronization" they give the impression that they are the same across individuals. Since these patches are irrelevant to the main point of the article ("synchronization"), it looks like this is done intentionally.
This article (Machielsen et al, fMRI of Visual Encoding: Reproducibility of Activation, Human Brain Mapping 9:156-164(2000)) says that "We performed an fMRI study involving within and between-subject reproducibility during encoding of complex visual pictures.", But I don't see between subject comaprison of peaks of activity. They find large variability within subjects.
In this article (Vandenbroucke et al, Interindividual differences of medial temporal lobe activation during encoding in an elderly population studied by fMRI, NeuroImage V. 21, 1 , January 2004, Pages 173-180 doi) they genuinely look at differences across individuals in the MTL (which is not cortex), and find a lot. Since they were looking at "elderly population", they conclude that in old age averaging is not reliable. Somehow it completely escaped their attention that it is possible that it is unreliable in young population too. (I don't actually know how reproducible the results are in the MTL, in general).
[ 4 Oct 2004] In this article (Murphy and Garavan, NeuroImage Volume 22, Issue 2 , June 2004, Pages 879-885) they actually test reproducibility, between two groups of 20 subjects, and say (in the abstract):
These analyses revealed that although the voxelwise overlap may be poor, the locations of activated areas provide some optimism for studies with typical sample sizes. With n = 20 in each of two groups, it was found that the centres-of-mass for 80% of activated areas fell within 25 mm of each other.This is more or less what you would expect from random distribution, so is not a reason for any optimism whatsoever. Obviously these authors simply cannot conceive the possibility that the data is irreproducible.
This article (Marshall et al, Radiology 2004;233:868-877) (if link is broken try here), explicitly state in the CONCLUSION:
The generally poor quantitative task repeatability highlights the need for further methodologic developments before much reliance can be placed on functional MR imaging results of single-session experiments.
Unfortunately, the way they say it may leave non-experts with the impression that there is no problem with multiple-session studies, and that there are significant number of such studies. In addition, the fact that they used old subjects (69 years old) leaves it open to claims that it is the age that causes the lack of replicability. But it is definitely a start.
In the abstract (materials and method), they say: "Within-session, between-session, and between-subject variability was assessed by using analysis of variance testing of activation amplitude and extent." But in the article itself they don't discuss at all between-subject variability, only within-subject variability.
[ 25 Jul 2012] There are now several articles that claim to show inter-subject correlations when viewing movies( from 2012, from 2010 and from 2008, all available full-text online, share two autjors (Jääskeläinen and Sams)). They are doing quite a lot of mathematical analysis to reach the conclusions, and it requires quite deep analysis to check it, which I didn't. However, if the results are actually significant, they should reproduce across studies. In the above studies, they don't even try to compare the actual correlations across studies. Considering that they share authors, the most likely reason for this is that it is clear to them that they are different. That suggests quite strongly that we don'thave here anything significant.
This one (2010) and this one (2010) are similar stuff from other groups. Again difficult to evalute the significance without deep analysis, and there is no effort to compare to other studies. There is here (Trends in Cognitive Science, article in press 2012, "Brain-to-brain coupling: a mechanism for creating and sharing a social world") an opinion article by authors of these two articles discussing the issue, and comparison of correlations across studies is not mentioned, even in their "Questions for future research", which does not promise much. This one (2011) by another group (referred by the opinion), not better.
Another important issue is the high degree of inter-subject and -laboratory variability observed in fMRI studies.So he is aware of the "high degree of variability", but it doesn't seem to worry him too much. He then says:
STS, LO and MT are attractive targets for a review, because there is some consensus about their anatomical locations.(STS, LO and MT are names of regions of the cortex).
In this press release (UCI Receives Major Grant to Help Create National Methods and Standards for Functional Brain Imaging, 13 Mar 2006), they report about a grant (of 24.3M$) to "... standardize functional magnetic resonance imaging and help make large-scale studies on brain disease and illness possible for the first time".
In the release the director of the receiving consortium says:
There are around 7,000 locations in the U.S. right now doing MRI scanning, yet there is no way to combine the data in a useful way from these sites,The release then says that "Although brain imaging technology has generated remarkable progress in understanding how mental and neurological diseases develop, it has been nearly impossible for one laboratory to share and compare findings with other labs". It does not explain how it is possible to progess in understanding anything if laboratories cannot share and compare their findings.
Obviously these people know that there is a problem, and it seems that they genuinely try to correct it. It seems though that they still believe that the issue is mainly variability between sites and hardware. and they don't realize that there is the issue of reporducibility across individuals. But it is certainly some progress to at least try to reproduce across sites.
Later in the release it says:
FBIRN established for the first time that the difference between MRI scanners and techniques across centers is so great that the value of multi-site studies is undermined,Obviously, the "difference .. across centers" undermines the value of all studies, not only multi-side studies. It is interesting question whether these people really don't understand it, or just being careful not to offend too many researchers.
[ 17 Oct 2007] This didn't get that far. From reading their publication in this page, they concentrated on cross-site variations, and did not deal with cross-subject variations at all. Cross-site variation is also interesting to some extent, but without data on cross-subject variations doesn't tell us much.
In this issue of Human Brain Mapping (Volume 27, Issue 5 (May 2006)) they discuss reproducibility, but what they mean is reproducibility of the results from the same data when it is analysed by different methods. They seem to think that the methods of analysis introduce quite a lot of variability.
It is interesting that they do spend that amount of effort on looking for reproducibility across methods, but don't do the same for reproducibility using the same method acorss individuals and sites. Obviously the latter is the important question, but discussing it will lead to awkward answers.
This article (Bertrand Thirion, Philippe Pinel, Sébastien Mériaux, Alexis Roche, Stanislas Dehaene, and Jean-Baptiste Poline. Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses.. Neuroimage, 35(1):105--120, March 2007; Full text) test reproducibility by using large number of subjects, splitting them to groups and comparing between the groups. Their results clearly invalidate the vast majority of studies until now, if not all of them. Most clearly, they conclude that you need at least 20 subjects to get a reliable result, and most studies use much smaller number.
It should be noted that this is based on very simple tasks, and is unlikely to generalize to more complex tasks, which are likely to be more variable.
Interestingly, the authors themselves do not seem to take their results seriously, and they published many studies with many less subjects. To check if they still do, I looked up the publications page of one of the authors, Stanislas Dehaene , and downloaded the two articles that were listed before and after the article above. The first was done with 9 subjects, the second was done with 12 subjects. Obviously, the study above shows that their results in the two latter studies are unreliable, but that doesn't bother them.
[21 Mar 2012]
This article claims in the abstract that their results are "reproducible across subjects". I didn't see yet the full text.
In this blog entry it says they use three subjects, so it doesn't sound promising.
The article is critical of other fMRI studies, and claims in the abstract to "challenge that view [localization view of brain function]".
[27 Mar 2012] In this article (What Makes Different People's Representations Alike: Neural Similarity Space Solves the Problem of Across-subject fMRI Decoding, Rajeev D. S. Raizada1 and Andrew C. Connolly 2012; Full text fron an author's web page) they figured out that you cannot reproduce activity across subjects. Instead they look at what they call "similarity space".
They say things like "However, just as the literal fingerprints on people's hands are idiosyncratic to individuals, the “neural fingerprints” of representations in their brains may also be subject-unique. Indeed, this has found to be the case." (bottom of first column). However, they dont' state explicitly that previous "results" of fMRI studies are not reproducible.
Their technique achieves something, and in principle it can work on top of stochastic connectivity, because similarity between patterns of activity of concepts can be a result of learning. Their results, however, are not as promising as they pretend. The concepts that they use are very very very gross compared to the subteleties of human thinking, and they already not 100% accurate. It seem unlikely that it will work for more fine-grained concepts.
In addition, the method as currently described suffer from combinatorial explosion. The authors discuss it and mention a further study in which they develop heauristics to deal with 92 concepts. It remains to be seen how well these heauristics work, and can they be scaled to larger numbers.
Without scaling to fine-grained, and hence numerous, cocepts, it it not obvious that this technique will be any use in understanding the underlying mechanisms. Nevertheless, it produces results with some probability of being reproducible, which is a significant improvement.
By the way, the term "connect" or any of its derivative does not appear in this article, which is a little surprising considering that the underlying reason for any result they have must be something to do with the actual connectivity.
[19 Sep 2017]
Nature Neuroscience has published in Feb 2017 an editorial, following
a paper that shows some serious problems in the analysis of fMRI: Fostering reproducible fMRI
research. But they completely miss the point. I discuss this in
detail, because it is a typical example of the way the field in
general "deals" with the issue of reproducibility.
The main problem with the editorial is what is missing, which is a
discussion of comparison of data between independent researchers. In
fMRI research, you start with raw data coming from the machine itself
(which is already processed to some extent). You then analyse this raw
data to get high-level data, like "region X is active in situation Y".
The high-level data is obviously the interesting result. Thus
"reproducible fMRI" must be reproduction of the high-level data.
In principle, if the raw data is reproducible and the analysis is
consistent, the high-level data will be reproducible too. But it is
also possible for the raw data to be different from various
experimental reasons (like different machines, different stimuli) and
still be analysed to produce reproducible high-level data. So to know
if fMRI research is reproducible in an interesting way, we must
either check that the high-level data is reproducible, by comparing it
across research groups, or check that the raw data is reproducible and
that the analysis is consistent.
The editorial completely ignores the question of reproducibility
of the data, either the raw data or the high-level data. All the steps
that they take and recommend are about improving the analysis.
It stands out the most when they write:
This is quite representative of the current state in the field. They
do talk about reproducibility, but only consider ways of improving the
analysis. The reproducibility of the data is not discussed at
all (because they know it is not reproducible and don't want to admit it).
It is worth noting that comparing high-level data across studies like
I did in 1997 is quite easy. The reason
they don't do it is because they know the results, and they are negative as
they were in 1997.
Apart from failing to discuss comparison of data, the editorial is
biased and dishonest argument in trying to defend fMRI research. After
some introduction, they mention the study that raised methodological
issues and a previous study that raised questions of reproducibility,
and then write:
The editorial doesn't consider the more detailed data, and instead
continues:
The link is to another response to the paper that provoked the
editorial, pretty technical blog by "Organization for Human Brain
Mapping" which is all about analysis, without any consideration of
checking reproducibility by comparing data across.
This blog has a link to a "OHBM COBIDAS Report" (whatever that means)
titled "Best Practices in Data Analysis and Sharing in Neuroimaging
using MRI". This is long and technical. They define "replication" and
"replicability" to mean what I mean when I write "reproduction" and
"reproducibility", and define "reproduction" to mean what I would call
"re-analysis". However, checking replication by comparing data appears
in this document only in abstract sense and in references to other
fields. They don't discuss how to actually do it, who should do it, and
how to encourage doing it, even though in the "scope" they write:
"Hence while this entire work is about maximizing replicability,...".
Here (Small sample sizes reduce the replicability of task-based fMRI studies:
Benjamin O. Turner, Erick J. Paul, Michael B. Miller & Aron K.
Barbey, 7 Jun 2018)
is an article where they actually check and conclude that you need
more than 100 subjects to get really reproducible results. They do it by
splitting the subjects in studies to two groups, and checking if the
results are reproducible between the groups. As they point out, that
means that the replicability is between measurements on the same
scanner which are processed in the same way, so is actually optimal,
and the replicability between different research groups should actually be
even worse.
This would invalidate almost all of the fMRI literature, and the authors
try to kind of hide this, by using mild language. They keep
talking about "modest replicability", for example in the abstract
Any two random patterns can have "modest replicability", all you need
to do is to ignore the bits that are different between them. So
"modest replicability" really means no replicability, but it does not
sound as bad. They suggect doing research with larger groups, and some
other ideas, without explicitly stating that the results until now are
useless.
It is difficult to blame the authors for not being more explicit about
the uselessness of existing research, because they probably wouldn't
be able to publish the paper otherwise, but it does reduce the impact
of what they say. Nevertheless, it is going in the right direction.
The journal in which it is published, Communications Biology,
is a predator journal (i.e. where the authors
pay), so the quality of the research is not obvious.
A new article
(doi)
in BioRxiv now report a survey of 135 functional Magnetic Resonance
Imaging studies in which researchers claimed replications of previous
findings. They found that 42.2% did not report coordinates at all, and
that of those that did, in 42.9% had peaks more than 15 mm away. In
other words, The majority of the replication claims were bogus.
It can be argued that this is an improvement from my survey where I found 0% replication.
However, I surveyed all the papers that I could find about fMRI, not
only ones that claim replication. Considering that they considered
only claims of replication, it is not obviously an improvement at all.
The more than half bogus claims of replication tell us that now
researchers know that they need to at least pretend replication, but
don't succeed to do it in most the cases, so instead make bogus
claims. The less than half cases where the claims are not obviously
bogus don't tell us much either, because by now there are many imaging
results, and authors can select which of these results their latest
experiment happen to replicate (there is nothing about pre-registering
in this survey). Thus we don't know if any of these is real.
On top of this, researchers will obviously will not try to replicate
results that they already know are not replicable. Thus these
replication efforts were:
Thus the progress that this shows is that:
Some researchers in the field understand that averaging necessarily
gives more significant results, but not all of them. For example, in a
collection of tutorial essays (Posner, M. I., Grossenbacher, P. G.,
& Compton, P. W. (1994). Visual attention. In M. Farah
& G. Ratcliff (Eds.), The neuropsychology of high-level vision:
Collected tutorial essays (pp. 217-239)), Posner et al
(1994) say (P. 220): This, of course, is plain nonsense. In case that is not obvious,
here is a hypothetical example. Assume the following for some
experiment (the numbers are selected so all the results are integers,
but they are typical): In this case, none of the pixels in any of the subjects in the
experiment will show significant (over 36X) increase in the task.
However, if the researchers average over 9 subjects, for example, then
the level of noise goes down by 3, to 6X. The average increase
in activity in the task (compared to control) is 9X (compare to 18X
increase in those pixels that are active in each individual). However,
this increase will be distributed unevenly, and since we assume no
correlation between subjects, the distribution will be binomial. On
average, out of each 512 pixel, there going to be: Since the averaged noise is 6X, any pixel with average increase of
activity of more than 12X will be regarded as having 'significant
change'. Thus, after averaging, around 9% ([1 + 9 + 36]/512) of the
pixels will become 'significantly active' in the task. Note that this happens even though we assume that there is no
correlation at all between the subjects. Any correlation that
does exist between brains (e.g. visual input in the back of the
cortex), even if very weak, will increase the tendency of averaging to
generate 'significant results' out of random noise. The reason for these 'ghost results' is that while the averages of
activity and their standard deviations go down on averaging, the
extremes of the distribution of activities do not go down on
averaging, and it is these extremes that appear as significant
results. Note that the threshold of what is 'significant' is a free
variable, which can be adjusted freely to generate the 'best result'
out of data. It is therefore possible to generate 'significant
results' in almost any situation. The only way to check if these 'significant results' are real is to
compare them to other studies, either with the same technique or with
other technique. If the result is reliably reproducible, it is
unlikely to be random. The problem with this, of course, is that the
results of cognitive imaging studies are not reliably reproducible, as
discussed above. [6Jun2001] Random-effect analysis is an effort to judge the significance of an
effect by comparing its magnitude to the variance between the
subjects. In cognitive brain imaging, the effect is the difference of
activity in each pixel between two (or more) conditions. The
underlying assumption is that the subjects' distribution of
differences will reflect the distribution in the general population,
even if their mean is different because of sampling error. Therefore,
this analysis can, in principle, reject spurious random effects that
are just sampling errors from a variable population. One of the reviewers of my replicapiblity paper used it as the main
counter-argument. Obviously, the first point to note is that random-effect analysis
hasn't produced yet any reproducible result in the cortex. It is thus
not actually improving the results. At most, it may have reduced the
rise in number of published papers. Assuming the underlying data is indeed irreproducible, either
because of the resolution problem or because of variability between
individuals, the reason for lack of reproducibility is clear. The
question is why random-effect analysis doesn't reject all the papers.
The first possibility, which may explain most of the cases, is that
random-effect analysis is either not performed, or performed wrongly.
The actual procedure of doing the random-effect analysis makes the
analysis more complex, and hence add possibilities of doing things
wrongly. Another reason for failure of random-effect analysis to block noise
papers is that the analysis is based on the assumption that the
distribution of the magnitude of the effect in the sample is the same
as in the general population (i.e. has similar variance). That is
(approximately) correct if the mean is different between the sample
and the population because the samples contain several large outliers,
and random-effect analysis can easily eliminate these cases. However,
it can go wrong for two cases: The "advertised" way over both of these problems is to use large
number of subjects, which should take both the distribution and the
mean of the sample closer to the values in the population. Note that
in case (2) above, i.e. for pixels in which the sample's variance is
too low, increasing the number of subjects will cause increase
in the variance of the sample. For most of the pixels, it will cause a
decrease in the variance in each pixel, and therefore the analysis
will reject less of them. Thus the number of accepted pixels will
increase. One problem with this solution that it does mean using many
subjects, which makes studies more difficult. The main problem,
though, is that it works properly only if the background activity is
flat, i.e. if the only real difference between the conditions is some
pixels becoming more/less active, and the rest of the pixels stay at
the same level of activity, with only random fluctuations. Obviously,
this is unlikely to be true, and there are variations in the mean
level of activity of regions in the cortex between conditions, and
that makes the analysis much messier. For simplicity, let's assume the only real difference in some
region between two conditions (A and B) is that the mean activity of
the pixels in this region increases in condition B. In this case, any
pattern of activity that will come out of an experiment is noise,
rather than real result, so we need to consider if random-effect
analysis can prevent getting a result, i.e. patterns of activity, in
this case. Consider three cases:
Thus if the average activity in a region is not flat across the
conditions, random-effect analysis will generate random patterns of
activity. Since it is unlikely that the average is completely flat
between conditions over the entire cortex, random-effect analysis
cannot reliably reject false results. It is worth noting that when using random-effect analysis in other
fields, it is normally use to test whether some manipulation has an
effect, and in this case it is valid. In imaging, where what is looked
for are peaks of activity, that is not good enough, because the effect
in a voxel maybe a result of a change in all the area plus noise in
this voxel.
As my paper shows, current CBI
studies are irreplicable. Thus the first target should be establishing
the replicability of CBI. As discussed in [1] above, CBI suffers two
main problems: resolution and variability across individuals. Hence a
useful step would be to decouple these problems. It is not obvious what can be done about the resolution of fMRI and
PET, but the variability across individuals can be dealt with by
repeating studies on the same individual. As I found in my paper, this is not done currently. The basic design of an experiment to test the CBI method would
be something like this: First, find the pattern of activity in the
brain of one individual that is associated with some task or concept.
Then repeat the same experiment (ideally by another research group)
with the same individual, and see if the result is the same.
What is 'the same' in this context is not obvious. A possible
criterion is whether these patterns can be used to identify the task
or concept that the individual is thinking about. For example, in the
first stage the researchers would find the patterns associated with
thinking about cow, sheep, table and chair. Then the experiment would
be repeated, and the researchers would see if using the results of the
first experiment and only the patterns of activity from the second
experiment they can identify what the individual was thinking about in
the second experiment. If the patterns are the same (at least good enough to identify
what the individual thinks about), it tells us two important things:
If the patterns are not the same, the conclusion is less clear. It
may be either because of the lack of resolution or because of lack of
replicable patterns of activity in the brain.
Haxby et al (Science,
p. 2425-2430, Vol 293, 28 Sep 2001) actually perform a similar
experiment to the one I describe above. They show the subjects
pictures from different categories, divided the data from each subject
to two sets, and check if they can figure out which category a subject
looks at by comparing the pattern of activity to the patterns of
activity found in the other set. They claim high success rates. Of
course it still remains to be seen if this is reproducible. The most interesting point about this article is their treatment of
the question of similarity across objects. Almost all of the previous
brain imaging studies were based on the assumption that there are
patterns of activity which are similar across subjects, and these
patterns is what the researchers were looking for. This paper is based
solely on within-subject comparison, and the authors do not assume
or look for similar patterns across subjects. This is quite different
theoretical assumption. Haxby et al, however, do not make this
point explicit, and a non-alert reader is unlikely to realize that the
paper is based on a very different theoretical underpinning from
previous studies. The commentators in the same issue (Cohen
and Tong, Science, p. 2405-2407, Vol 293, 28 Sep 2001)
completely missed this point. In a quite long comment they discuss
this article and another brain imaging study in the same issue.
However, they completely ignore the similarity assumption, and don't
mention at all that Haxby et al study is based solely on
within-subject comparison. Haxby et al do mention the question of cross-subjects
comparison (p.2427), but claim that the warping methods (methods to
account for differences in brains 3d shapes) are inadequate. This is a
ridiculous assertion, because the warping methods can easily cope with
the resolution of their study (3.5mm). In addition, they could show
the patterns that they show in Fig. 3 for more than one subject, to
enable visual comparison, but they didn't. They leave it to the reader
to work out that there simply aren't any cross-subject correlations in
their data, though they do hint about it in their comment that
single-unit recording in monkey did not reveal larger scale
topographic arrangement. [ 26 Nov 2004] They now a further study in this vain. This
article (Hanson, Matsuka and Haxby, NeuroImage Volume 23, Issue 1
, September 2004, Pages 156-166; probable
full text) does a more serious analysis. In this paper they
mentioned that "1) inter-subject variability in terms of object coding
is high, notwithstanding that the VMT mask is relatively small"
(bottom of p.14 of the pdf file), but avoid showing comparisons
between individuals. This paper argues against the idea of face area,
but they don't mention the problems that I discuss in section 7 below.
[ 27 Apr 2004] Today I went to a lecture by Haxby, in which he
discussed these patterns. In the end of the talk I asked him if he
tried to compare the patterns across subjects. His answer was quite
odd. He clearly didn't compare across indivisuals, and thought that it
won't work because of the "high resolution", but then also said that they
are working on better methods to try to align the patterns across
individuals. He seems to both believe and not believe that the
patterns can be matched across subjects. For the rest of the audience
(Cambridge University department of psychology) the issue was of no
interest.
([15July2005] No.)
This is a new article in the Journal of Cognitive Neuroscience:
Extensive Individual Differences in Brain Activations Associated
with Episodic Retrieval are Reliable Over Time, Michael B. Miller,
John Darrell Van Horn, George L. Wolford, Todd C. Handy, Monica
Valsangkar-Smyth, Souheil Inati, Scott Grafton and Michael S.
Gazzaniga, Journal of Cognitive Neuroscience.
2002;14:1200-1214. ([30Oct2003] dead link Abstract
only here).
They explicitly tested for variability between individials in their
group, and obviously found that there is a lot of variability. They
also found that this variability is stable within individuals.
In their discussion, they try not to attack group studies (i.e. almost
all studies until now, including their own studies), but do make it
reasonably clear that their results actually invalidate these studies.
The principle author of this study (Gazzaniga) is quite a central
figure in Cognitive Neuroscience, so there is some chance that this
paper will not be simply ignored.
While the results of this study nicely agree with what I am saying,
that doesn't say that they are going to be reproducible. Until they
are, this study cannot be regarded as supporting evidence. I actually
e-mailed the authors suggesting them to ask another research center to
reproduce the results with the same individuals, to see if they can
reproduce it. I got an interesting answer from
the first author, which suggests that they are not going to pursue
this line of research.
Logothetis et al (Neurphysiological investigation of the
basis of the fMRI signal, Nature, 2001, V.412, p.150) tried to
identify what cause the fMRI signal, i.e. what causes the haemodynamic
responses that fMRI detects. This article was hyped to some extent as
an important advance, but it doesn't promise much. The first important point is that it was found that the fMRI
signal correlates with the LFP (local field potential), but not (or
not so well) with action potentials. Since action potentials are the
way neurons transfer information from their body to the axon and hence
to other neurons, this means that fMRI does not correlate with
information processing. The commentators on the article gloss this point by writing about
"local processing", referring to processes inside the dendrites.
These, however, has no functional significant if they don't affect the
firing of action potentials, so on their own cannot be regarded as
information processing, local or otherwise. In the "news and views" about this article (Bold insights, Nature,
2001, V.412, p.120), Marcus E. Raichle infers from this article that:
"For the neurophysiologists, who seek to understand the biochemical
and biophysical processes underlying neural, the absence of action
potentials must not be interpreted as the absence of information". He
doesn't tell us how information processing can happen without action
potentials. As far as I can tell, the sole argument is based on the
'religious' belief that fMRI must be associated with information
processing. I E-mailed Raichle asking how information processing can
happen without action potentials, but I would be surprised if I get an
answer. there are some claims that the LFP also represents the activity of
small local neurons (i.e. non-pyramidal ones), for example Logothesis
himself in The Underpinnings of the BOLD Functional Magnetic Resonance
Imaging Signal J. Neurosci. 2003 23: 3963-3971. That would
give the LFP a a little more credibility, but not much. The pyramidal
cells are the mabjority of cells, and because of their size and huge
number of synapses, they form even larger part of the synapses in the
cortex. If the LFP doesn't correlate to their activity, it still
doesn't really correlate to information processing.
The other important point in Logothesis et al is that they
show that the fMRI signal lags for more than 2 seconds behind the
neural response (that was actually more or less known before, but not
measured directly). That means that it can be used to monitor only
mental operations that cause patterns of activity which are fixed for
seconds. Most of thinking proceeds much faster than that, so fMRI will
never be useful for monitoring thinking. Only simple operations, like
perceiving a fixed stimulus or performing repeatedly a simple motor
task, can be monitored by fMRI. Note that the restriction is a result
of the behaviour of the underlying variable of fMRI (haemodynamic
response), and hence cannot be overcome by improving the technique.
A point that was discussed in the article and commentaries is that
Logothesis et al show that the neural response in many cases is
much more significant (in order of magnitude) than the fMRI response.
That means that the fMRI measurements always contain very large number
of false negatives, i.e. pixels where there is a response in the level
of activity, but not a significant fMRI signal. Therefore, the
patterns of fMRI signals do not actually reflect properly the patterns
of neural activity (even the lagging LFP that they correlate with).
In short, this paper shows that the fMRI signal is not correlated
with information processing, lags behind the activity that it is
correlated with, and does not properly reflect it. [23 Jun 2003] This one made me laugh: Pessoa et al
(Neuroimaging Studies of Attention: From Modulation of Sensory
Processing to Top-Down Control J. Neurosci. 2003 23:
3990-3998), after referring this paper, say:
The "Fusiform Face Area" (FFA) was quoted by the editor of
Neuron as an example of a "reasonably robust" result (here). There are many references to the
FFA, for example, Tootel et all (Roger B. H. Tootell, Doris
Tsao, and Wim Vanduffel Neuroimaging Weighs In: Humans Meet Macaques
in "Primate" Visual Cortex J. Neurosci. 2003 23: 3981-3989), in
a mini-review article, say:
2.17 An editorial in Nature Neuroscience
We also encourage researchers to deposit their data sets in
recommended data repositories
(http://www.nature.com/sdata/policies/repositories) so that they can
be aggregated for large-scale analyses across studies, potentially
improving the statistical power and robustness of any conclusions that
may arise from these analyses.
Obviously, such data repositories can be used to compare the data
across research groups, and hence check for reproducibility, but that
possibility did not occur to the writers. All they can think of is
aggregating the data.
Unfortunately, such reports have unintentionally harmed this
technique's reputation and called into question the merit of published
fMRI research.
This sentence could have been written: "Such reports raise doubts
about reproducibility of fMRI research," but it was not. Instead, it
is "Unfortunately", and "unintentionally", and "harmed this
technique's reputation", all terms with negative connotations,
obviously with the intention of associating these connotations with
"such reports". They continue:
Are these criticisms warranted and, even if the answer
is 'no', how can the scientific community address the negative
connotations associated with this research?
Whether these criticisms are warranted is an interesting question, but
they are not actually discussing it in the rest of the editorial.
Instead, they imply that the answer is 'no', and go on to worry about
the "negative connotations" associated with "this research" (not
obvious if this refers to fMRI research or the critical articles).
They continue:
Even with the innumerable parameters that may differ between
individual fMRI studies-study and task designs, scanner protocols,
subject sampling, image preprocessing and analysis approaches, choice
of statistical tests and thresholds, and correction for multiple
comparisons, to name a few-many findings are reliably reproduced
across labs. For example, the brain regions associated with valuation,
affect regulation, motor control, sensory processing, cognitive
control and decision-making show concordance across different fMRI
studies in humans; these findings have also been supported by animal
research drawing on more invasive and direct measures.
That would be a plausible defence of fMRI research if the bulk of fMRI
research was about mapping these regions, and thus imply that this is
the case. But that is false, because fMRI is used for much more
detailed studies (because mapping these areas doesn't tell you much,
an it can be done by other methods). These more detailed
studies come with more detailed data, which is not reproducible.
These converging results should be highlighted in commentaries
regarding research reproducibility, and critiques should be
constructively balanced with potential solutions.
That is just ridiculous. They are not discussing young children that
require encouragement and guidance. They are discussing the work of
supposedly serious scientists, which should expect non-"constructively
balanced" critiques and come out with their own solutions. Really the
message here is: "Don't talk about reproducibility without pretending
that, in general, fMRI produce reproducible results (because it
"unintentionally harmed this technique's reputation and called into
question the merit of published fMRI research."). They continue:
In doing so, these critiques can provide an opportunity to revisit
methods and highlight caveats, allowing the neuroimaging community to
refine their methodological and analytical approaches and adopt
practices that ultimately lead to more robust and reproducible results
(http://www.ohbmbrainmappingblog.com/blog/keep-calm-and-scan-on).
And what about rejecting spurious results? Rejecting spurious results
is what makes science different from other approaches to knowledge, which is
why the main method of doing it, i.e. reproducible research, is so
important. But
the editors of Nature Neuroscience succeed to forget it here.
2.18 "Small sample sizes reduce the replicability of task-based
fMRI studies" - actual progress
29 Dec 2018
We find that the degree of replicability for typical sample sizes is
modest and that sample sizes much larger than typical (e.g., N=100)
produce results that fall well short of perfectly replicable.
2.19 "False-positive neuroimaging" - kind of actual progress
21 Jul 2019
Together, all these factors make the survey best compatible with rate
of replicability of few percents at most, and also compatible with 0%
replication.
3. Averaging of images
In many cases (almost all of published research in PET and MRI)
the results are averaged over several subjects, to get significant
results. Obviously, if there are variations between individuals, this
procedure will generate noise. However, most of the researchers in the
field simply ignore the possibility of variations between individuals,
and hence feel free to use averaging to enhance the significance of
the results. It is remarkable that at the millimeter range of precision
most studies have shown it possible to sum activations over subjects
who perform in the same tasks in order to obtain significance. This
suggests that even high-level semantic tasks have considerable
anatomical specificity in different subjects add is perhaps the most
important results of the PET work.
Number Number of Average
of pixels subjects showing Of increase
increase of 18X
1 9 18X
9 8 16X
36 7 14X
84 6 12X
126 5 10X
126 4 8X
84 3 6X
36 2 4X
9 1 2X
1 0 0X
4. Random-effect analysis
5. What experiments should be done in CBI
[8Feb2000] Just found this,
which tries to measure replicability within the same subject(s) over
time. They claim that there is replicability, but don't give enough
information to judge if it is serious. I e-mailed the author to find
if there are newer results. [27Jul2001: the link is dead.]
[28Nov2000] This
one is interesting, because it shows in-subject comparison
between imaging in 1.5T and 3T (T (Tesla) is the unit of the strength
of the magnetic field). They look at activity in M1 and V1,
which is most likely to be reproducible. The results (in Fig 4) show
quite large differences even inside the same object between the
different fields. If these difefrences are typical, the experiement
above will fail to show anythng useful.
5.1 They now do these experiments.
5.2 Real progress?
[07Nov2002]
6. What fMRI actually measures
[ 8 Dec 2003] There are now claims that they can measure the magnetic
field that results from neuronal activity directly here (full
article). I have no idea how real that is, in principle it can
work.
Thus, in some case fMRI may reveal significant activation that may
have no counterpart in single-cell physiology.
As mentioned above, Logothesis et al showed the fMRI will
necessarily have huge number of false negatives (i.e. where it fails
to show activity even when there is activity), because its signal has
significance in order of magnitude lower than the neural responses.
This, however, seems to have completely passed by this authors, who
infer the inverse problem, i.e. that the single-cell physiology will
miss activity that the fMRI shows. It seems that they cannot preceive
the possibility that the fMRI signal is not an accurate reflection of
the underlying activity. Note their usage of the verb "reveal", with
the strong connotations of being true and useful, rather than more
neutral terms (like "indicate", "show", "give", "suggest").
7. The "Fusiform Face Area"
This basic face/non-face distinction has been replicated consistently
in many laboratories,( Puce et al 1996, Allison et al
1999, Halgren et al 1999, Haxby et al 1999, Tong and
Nakayama 1999, Hoffman and Haxby 2000, Hasson et al 2001), a
significant accomplishment in itself.
As the side point, we can note these authors clearly know that there
is a problem with reproducibility in cognitive brain imaging, but they
completely ignore this point in the rest of the article (as do the
rest of the mini-reviews in this issue of J. Neurosci.). However,
they do claim that the FFA is reproducible. What they don't tell you
is the FFA is "reproducible" in different locations in different
individuals. In fact the variability is such, that researchers find
the FFA by looking for regions that are differentially active when the
subject view faces in each individual brain. For example, in
Tong et al(COGNITIVE NEUROPSYCHOLOGY, 2000, 17 (1/2/3),
257-279), an article from the principle
investigator(Kanwisher) who is normally credited for identifying
the FFA (in this
article), they use this method of identifying the FFA in
individual subject, and justify it by saying:(p.3 of the pdf file):
Such individual localisation was crucial because the FFA can vary considerably in its anatomical location and spatial extent across subjects (Kanwisher et al., 1997).Thus the FFA is not reproducible across individuals, but you can find face-sensitive patches in each individual and call them "the FFA". And that is what Tootell et al above call "significant accomplishment". Tootell et al are actually aware of the technique that is used to achieve "reproducibility" (at least that what Tootell told me by e-mail), but couldn't be bothered to tell it to their readers.
Faces are a very special "objects" for humans, both because all of us are "experts" in them (because we have a lot of experience looking at them and interpreting them), and (more importantly) because they are associated with emotional response (because most humans reflect their emotional state in their face). Therefore, it is plausible that we have many (learned) face-selective patches in the high-level vision areas. The "FFA" seems to be a good place to look for these kind of patches, but that doesn't mean that the area is specialized for faces. The stochastic connectivity of the cortex (for which the "considrable variation across subjects" can be regarded as another small piece of evidence), rules out any face-specific circuits inside this area. What is possible is that because the empotional content of faces, the connectivity to extra-cortical structures which take part in processing emotioal responses (most importantly, amygdala) may determine where it is easier to find face-selective patches.
It should also be noted that and there are already suggestions (backed by some evidence) that this area tend to be more active when we view objects for which we have high expertise (and we all have high-expertise for faces)(also this one) . This suggeston, because of its generality, has a better chance of actually being true (though, obviously, we still need to see reproduction of the data), but because other objects don't reflect emotions the way faces do, other objects will never have such strong responses as faces do.
[ 10 Dec 2007 ] A review article by promoters of the FFA, fails to consider the question of emotions and their reflection in faces. The words 'emotion' or 'feel' do not appear at in the article, which is quite stunning considering the strong association between face expressions and emotions.
Note also that in the original article they found FFA patches only in 12 out of 15 subjects. Because of the positive publication bias, we don't know in how many other subjects it is not possible to find these patches. It is also not known in what percentage of subjects it is possible to find such patches in other regions.
The idea of looking for patches in individual brains and then discuss them as if they are reproducible is used in other cases, too. For example, Kanwisher herself use it in A Cortical Area Selective for Visual Processing of the Human Body , Paul E. Downing, Yuhong Jiang, Miles Shuman, Nancy Kanwisher, Science Sep 28 vol293, 2001: 2470-2473. Since this kind of studies are effectively comparing the data with itself, they are more likely to generate results that are reproducible, but are not interesting (e.g. that face-sensitive patches are sensitive to faces). They therefore are effective in making the discussion of reproducibility of imaging more confused. More of the "body" stuff(full text), and more, and quite funny argument about the "EBA"(p.126, V.8, No.2, Feb 2005, Nature Neuroscience).
[2Apr2004] The latest Science contains an example how of the localizing is used (Contextually Evoked Object-Specific Responses in Human Visual Cortex, Cox et al, Science, Vol 304, Issue 5667, 115-117 , 2 April 2004). The localizing is barely mentioned in the text of the article ("Once we localized the FFA for each subject"), and there is another sentence in Figure 3. The supporting material gives more information, including the fact that for two out of nine subjects, the localizer scan found nothing. The vast majority of readers, though, are not going to notice any of this, and get the impression that the FFA is a well defined area. The fact that this is not what the article tries to show make the effect even stronger, because the readers are unlikely to pay much attention to side-issues.
[7 May 2006] The idea of localizer apparently went to the head of many researchers, to the point that some have difficulties to publish articles that don't localize. See the exchange in "Comments and Controversies" in Neuroimage, Volume 30, Issue 4, Pages 1077-1470 (1 May 2006). Note that neither side raises the question of repdoducibility of the results they get, which should be the first question in a discussion of methodology.
[6 Oct 2008] Just found this (Vul, E. & Kanwisher, N. (in press). Begging the question: The non-independence error in fMRI data analysis. To appear in Hanson, S. & Bunzl, M (Eds.), Foundations and Philosophy for Neuroimaging.) This exposes some of the mistakes in fMRI. The interesting thing about it is that it is written by Kanwisher, the inventor of the "Fusiform Face Area". She actually mentions he 1997 article (above) as one case of showing bad figures. But she ignores other problems, and reproducibility is not mentioned, as usual.
[ 2 Nov 2004 ] Other parts of the brain are more ordered than the cerebral cortex, so in principle they may show higher reproducibility, if the reason for lack of reproducibility is the variability in the cortex across individuals. This article (Schneider et al, Retinotopic Organization and Functional Subdivisions of the Human Lateral Geniculate Nucleus: A High-Resolution Functional Magnetic Resonance Imaging Study The Journal of Neuroscience, October 13, 2004, 24(41):8975-8985 (full text)) doesn't promise much. In their figure 2 they show data from two "representative" subjects. They say (p. 8978):
In the coronal plane, the representation of the horizontal meridian was oriented at an 45° angle, dividing the bottom visual field, represented in the medial-superior section of the LGN, and the top visual field, represented in the lateral-inferior section. Although the extent of activations varied somewhat among subjects, the general pattern of retinotopic polar angle organization was consistent.The last sentence is a very optimistic view of their data. Apart from the gross similarity that they describe in the first sentence, the data for the two subjects is completely different. It is obvious that not only every voxel is different, but even each group of 2x2x2 pixels varies across the individuals almost randomly, apart from the gross similarity.
This article (Voodoo Correlations in Social Neuroscience, Vul et al, In Press, Perspectives on Psychological Science, First author web page) contains very serious criticism of some papers using fMRI. Surprisingly, they don't mention reproducibility in their paper.
In one rebuttal of this paper, Response to "Voodoo Correlations in Social Neuroscience" by Vul et al. - summary information for the press, they state that there were many replications of the findings that are discussed by Vul et al. Of the refrences they give, one doesn't actually appear in the list of refernces (Gu and Hahn 2007), and I couldn't easily find online the text of the review (Singer and Leiberg, 2009) and one of the papers (Lamm et al 2007a). I checked the other four references they give (Lamm et al 2007b, Saarela et al 2007, jackson et al 2006, Jackson et al 2005). I couldn't see anything that could count as comparing the actual data (rather than interpretations) between these articles and previous publications. This is a clear case of the bogus references technique.
In a sense, it is progress that they actually think about reproducibility, but they obviously don't take the concept seriously.
The authors of the original paper answer here. They seem not to have bothered to read the references.
The first author of the original article, Edward Vul, also wrote this book chapter. Again, there is serious criticism of some articles, but no reference to reproducibility.
-----------------------------------------------------
Yehouda Harpaz