Critique of Vision by Marr

1. Preface

[1.1] Vision was written by David Marr, and is Marr's major work. However, it is not identical with Marr's work, both because it does not cover everything Marr did, and because it was published posthumously, and hence may contain some inaccuracies that Marr would have corrected if he could. This text discusses the ideas presented in the book Vision, rather than Marr's work as a whole.

2. Introduction

[2.1] The ideas that are presented in Vision (David Marr (1982). Vision. San Francisco: freeman.) had, and still have, large impact on research in human perception, and also in human cognition. They are widely quoted, and used as a base for many models and research programs.

[2.2] It is easy to see why. The ideas in Vision are presented clearly and succinctly. In a relatively thin volume, the author packed more theory, and explained it more clearly, than many other authors in far larger volumes. In addition, these ideas simplify theorizing about perception, and give a firm base for model building. Not surprisingly, they are adopted enthusiastically by many researchers.

[2.3] In this text, I will argue that this is a mistake. The ideas in Vision are based on many assumptions which are, at least, unsupported by evidence, and in many cases seems to contradict it. In principle, it is possible to base a theory on unsupported assumption, but this is conditioned on including tests of these assumptions in the research program. It is not always possible to test the basic assumptions directly, but their validity must always stay open for doubt.

[2.4] That is not what is done in Vision: The basic assumptions are presented as obvious truths, or introduced with a hand-waving argument, and then taken for granted in the rest of the book. Rather than use the evidence, with the help of models, to test the basic assumptions, these assumptions are used as the main criterions for checking the validity of the models. The evidence from the real system (the brain), is relegated to a secondary role, if it is considered at all. This leads to unrealistic models, which are unlikely to contribute to our understanding of the human brain.

[2.5] Other researchers that adopt these ideas are likely to do the same mistakes in their models. More importantly, adopting these ideas uncritically blinds people to criticism of their work. My intention in this text is to try to dispel the 'magic' of these ideas, and to show why they are unlikely to be correct, and how misleading they are.

[2.6] The text is directed to people that have read Vision, and found it at least interesting. It would be difficult to follow the logic without reading the book. In fact, as I write this, I assume that the reader is ready to go back to the book and check if what I claim it says is actually correct.

3. Organization

[3.1] In part I, I discuss the main assumptions in Vision. This more or less corresponds to Part I in the book, though not exactly. In Part II, I discuss the more detailed discussion and models that appear in Part II of the book. In particular, I try to show how the basic assumptions are leading to unlikely ideas and models.

[3.2] In sectioning the text, I do not follow the sub-heading (chapters and section) as in the book. When I quote from the book, I give the number of the page in the book in the first edition. In some places, I have added the section name in Vision in italics, to make it easier to find the quote if the book is printed in different format.

Part I : The philosophy and the approach

I.1 Representation

[I.1.1] Representation is a central concept in Vision, and is a central concept in current thinking about human cognition. However, it is slightly vague term, and this vagueness is used to introduce confusion in the discussion. The treatment in Vision is a good example of this confusion.

[I.1.2] The idea of representation is introduced in the second paragraph of Vision (p. 3) : (1) "For if we are capable of knowing what is where in the world, our brains must somehow be capable of representing this information." This statement may seem reasonable, but that is because the term representation is vague enough, so the actual meaning of this sentence is not really clear.

[I.1.3] Later, representation is defined (P. 20 Representation and description): (2) "A representation is a formal system for making explicit certain entities or types of information, together with a specification of how the system does this."

This, also, does not look objectionable.

[I.1.4] However, Combining the two statements gives: (3) 'If we are capable of knowing what is where in the world, our brains must have a formal system for making explicit certain entities or types of information, together with a specification of how the system does this.'

[I.1.5] This Statement may be what the author actually thinks, but I suspect that even he would admit that its correctness is not obvious (unless the term 'formal' is stripped of any meaning). Faced with statement (3), most readers would probably require some argument to support it, in particular the requirement of formal system. Instead, statement (3) is sneaked in in two portions, as statement (1) and later definition (2). It works because the uncritical reader automatically adjust his understanding of the term representation when he reads statement (1) to fit this statement, and later accept definition (2) without checking if it agrees with the sense it was used earlier.

[I.1.6] I am not sure if the author himself was aware of the trick he is playing on the reader. If he unconsciously considered statement (3) as obviously true, he could miss completely the confusion. This kind of confusion, where an identity between two different entities is achieved by using the same word to describe both (in this case representation for "whatever brains use" and for "formal systems") is very common in cognitive science. See in Reasoning errors for more examples.

I.2 Levels Of Analysis and the question of implementation in neurons

[I.2.1] The concept of levels of Analysis is probably the most used (mostly abused) idea in Vision. The concept itself is reasonable, but it can be easily distorted. it is already distorted in the definitions of the three levels. For example, in the top level (P. 24 The Three levels): "[T]he performance of the device is characterize as a mapping from one kind of information to another, the abstract properties of this mapping are defined precisely, and its appropriateness and adequacy for the task at hand are demonstrated."

[I.2.2] This is based on the assumption that the device can be described as performing a precise input-output mapping. This assumption, which is at the foundations of the approach, is not made explicit, and is not discussed anywhere in Vision. The definition above also assumes that we can judge the 'appropriateness and adequacy' of the mapping, again an assumption which is not discussed anywhere. The levels are introduced by discussing the properties of a cash register (p. 22-23 Process), and in this case the assumptions above are clearly true. But the projection from the cash register to human perception is unjustified.

[I.2.3] Both of the above assumptions are probably based on the assumption that the processes in the human brain are 'mathematical', by which I mean processes that are, apart from being formal, also expressible in relatively compact mathematical notation. This assumption is made clear (though not explicit) by the names of the top and middle levels ('computational' and 'algorithmic'). Both of these terms 'smell' of a 'mathematical' processes.

[I.2.4] In addition, the word 'computational' add another opening for confusion. As quoted above, this level describes the performance of the system, so something like 'performal' (or maybe 'behavioral') would be more accurate. For most of people, 'computational level' indicates the level in which the internal computations are described, which is what is called in Vision the 'algorithmic level'. The literature if full of algorithmic models that are described as 'computational', and even in Vision the distinction is not kept. For example, when the 'computational theory' of stereopsis is introduced, it starts (p.111-112 Measuring Stereo Disparity):

"Three steps are involved in measuring stereo disparity: (1) A particular location on the surface in the scene must be selected from one image; (2) That same location must be identified in the other image; and (3) the disparity between the two corresponding images points must be measured."

This is clearly an algorithmic theory, because it describes the internal operation of the system, and it is not based on measurement of input and output. (see the definition above).

[I.2.5] This confusion leads to the worst abuse of the concept of levels: disregarding the question of implementation. That is not actually what is suggested in Vision, but it is taken for granted in it that anything that can be implemented by mathematical functions and computer programs is also implementable by neural systems. For example, in p. 152 neural implementation of stereo fusion : "A complete neural implementation of the second stereo matching algorithm just described has not yet been formulated. One reason is that such a formulation was not worth the considerable work involved until we were reasonably certain from implementation studies and psychophysics that the algorithm works and is roughly correct". The 'implementation studies' here obviously refer to computer implementation studies, and it is assumed that once they show the model is implementable on computer, it will follow that it is implementable in neural systems.

However, many researchers interpret the level theory as saying that computational models do not need to be implementable at all. They use the concept of levels to justify models which cannot be implemented by real neurons, by saying that they are 'computational models', and therefore need not be implementable. For example, the most frequent objection to my paper that shows that symbolic models cannot be implemented by neurons is that 'they are computational models'.

The view in Vision, that the models has to be implementable, but you can check it on computer, is less nonsensical, but still unreasonable. At the level of implementation, computers and neural systems are totally different things, and you simply cannot deduce from one to the other (see Computer models of cognition and brain symbols for a discussion). You can simulate neurons on computers, and if the simulation is good then some deductions about neurons may be possible, but that is not what is envisaged in Vision.

In the Conversation in the end of the book, the implementation in neurons (as opposed to in computers) question is touched several times. One of these is the 'implementation' of the del-square-G function (d2G), which is discussed below [II.2]. In another place, the 'interviewer' asks (in response to a specific suggestion the author makes) (P. 355): "But how on earth do you do that with neurons?", and is answered: "Hold on there - we'll face that next. But note that basically, it's not difficult computationally." The 'face that next', however, boils down to the guess that the implementation is more like Barlow's neural dogma than Hebb cell assembly. There isn't even a hint of a description of implementation with neurons.

I.3 The vision system : Basic assumptions

[I.3.1] The discussion of the vision system starts by defining a representational framework (p. 31-39 A Representational Framework for Vision). This definition, however, is arbitrary, and it is base on several unsupported assumptions, rather that real argument.

[I.3.2] The first of these assumptions is that the vision system is separate from the thinking system, with well defined boundaries. Evolutionary, this is extremely unlikely, because there is no pressure for separating vision from the rest of the thinking system. Yet in Vision it is taken for granted without any supporting argument. (This can be regarded as an aspect of the modularity assumption (below)).

[I.3.3] The second assumption is that the vision is representational, which, as quoted above, means it is a formal system . Again, there is no serious discussion of this assumption. This assumption is also a-priory unlikely to be true, because a formal system, in general, would tend to break in a more consistent way than the way human vision breaks after a brain damage.

[I.3.3] in the end of the book, in conversation there is a discussion of some possible alternatives to the approach taken in Vision (p. 340-341), but they are all based on these two assumptions as well.

[I.3.4] That the purpose of the visual system is recognizing objects, and the rest can be ".. hung off a theory in which the main job of vision was to derive a representation of shape" (P. 36 Advanced Vision) is an explicit assumption. This one is obviously nonsense: the purpose of any information processing is to improve the behavior of the individuals, and hence to improve their reproductive rate. Identifying shapes is only part of the job, and other attributes, like movement and functional properties, are as important, if not more. There is no explanation in the book how these can be 'hung off' shape-recognition, though they require temporal and detailed information, both of which are explicitly discarded by the model presented in Vision.

[I.3.5] Additional fundamental assumption is that the visual processing is done in discrete stages. Up to the cortex, this make sense, but most of the ideas in Vision are about what happens in the cortex. There is no evidence at all for discrete stages in processing in the cortex. This assumption is tied up with the modularity assumption, which is discussed below.

[I.3.6] The 'mathematical' nature of the processes in vision (on top of them being formal) is another fundamental and unlikely assumption. Even if the processes are formal, the complexity of the connectivity in the human cortex make it unlikely that they can be describe by compact mathematical notation.

On top of the 'mathematical' assumption, It is also assumed that the processes in vision are precise. Without precision, most of the specific models that are suggested will not work. This assumption is unsupported by evidence, and is unlikely on two accounts:

To calculate mathematical function precisely, a precise connectivity is needed, at the level of single connections. Except the most simple neural systems, which involve only few individual cells, this kind of precision is not seen in any neural system, including the human brain.
If the underlying systems can calculate mathematical function precisely, there shouldn't be any problem for the whole system (i.e. the person) to learn to use this when needed, yet humans cannot calculate mathematical functions precisely (except simple arithmetic).

I.4 The Modularity Assumption

[I.4.1] The modularity assumption is an important ingredient of the approach in Vision, but it is presented explicitly quite late (p. 99-102 Modular Organization of the Human Visual Processor). The section starts by some irrelevant discussion, the main point of which is that when a human observer is deprived of any other interpretable visual input, shhe can use disparity to perceive depth (in the book, it is assumed it is absolute depth, which is obviously wrong, because the perceived distance between the floating square and its background is smaller than the resolution of absolute depth perception. It must be relative depth). It is deduced from this that there is no top-down component in the processing (P. 102). It is not clear whether the processing here is meant in this specific case, or in general. If the former, then the deduction is obviously right, but is also useless. The correct interpretations is probably the latter, i.e. that in general the top-down component is weak. In this case the deduction is simply invalid, because there is no reason to believe that when many kinds of input are interpretable, the system will behave the same way as it does when only the disparity is interpretable.

From this the discussion jumps to modularity, without explicitly giving any explanation how. Presumably, it is assumed that the experiment above shows that humans have a separate module for stereopsis. However, this deduction is obviously invalid, because a non-modular system, when it deprived of other sources of information, would also be expected to use disparity only. That is probably the reason that the jump from the experiment to modularity is not explained.

Then modularity is introduced (p. 102): "Computer scientists call the separate pieces of a process its modules, and the idea that a large computation can be split up and implemented as a collection of parts that are as nearly independent of one another as the overall task allows, is so important that I moved to elevate it to a principle, the principle of modular design. This principle is important because if a process is not designed in this way, a small change in one place has consequences in many other places. As a result, the process as a whole is extremely difficult to debug or improve whether by a human or in the course of natural evolution, because a small change to improve one part has to be accompanied by many simultaneous, compensatory changes elsewhere."

[I.4.2] This is a display of complete misunderstanding of natural selection. In nature, there isn't such a thing as a "small change to improve one part". All changes are random mutations, and the selection is done on the performance of the system as a whole. Therefore natural selection does not have to compensate elsewhere: It simply favors those changes that are overall positive, and rejects those that are overall negative. Thus the argument above, while true about human designer (in general any designer which understands the internal working of the system), is simply irrelevant in the case of natural selection.

[I.4.3] To support the principle of modularity, it has to be argued that the peaks of performance in the field of possible systems, which natural selection would tend to converge to, are modular. In the general case, this is obviously false, because modularity is a restriction of the possible systems, without any direct benefits. Hence, we can expect modularity only when there is some specific reason for it.

[I.4.4] One possible reason for modularity is that two (or more) functions present contradicting requirements to the system. For example, it is probably not possible to build an organ that effectively pumps blood and oxygenates it at the same time, so these two functions are handled by two separate modules (heart and lungs). Other reason may be simply historical: a new module, handling a specific function, developed separately from the system, and then merged with it sometime ago.

[I.4.5] In Vision, however, all this is ignored, and it is far from being the only place. Large number of researchers do the same mistake, either explicitly in implicitly. This is because once modularity is assumed, it is relatively easy to come with models for the specific modules of the system. However, if the modularity assumption is wrong, all these models are a waste of time, so it is essential to consider the plausibility of the modularity assumption.

[I.4.7] In the conservation in the end of the book, the modularity assumption is raised again (p. 344-345). The argument there seems to be that "we can actually prove theorems that show the modules will always work in the real world." This is another piece of nonsense. The theorems show that the modules will perform some mathematical functions, but they don't show that the real vision system performs the same functions, and they don't even show that any vision system can be built from these modules.

I.6 The primitives of the visual system

[I.6.1] In table 1-1 (p. 37 ) the primitives in the visual system are listed. Most of the list seems agreeable, but it is important to note that this listing implicitly says that that these primitives are identifiable in some way in the visual system and that other primitives do not exist, or at least are not as important. Without this implicit statements (i.e. allowing other important primitives, or making these primitives abstract entities), the rest of the book does not make much sense, because it ignores other primitives and assume that these primitives are concrete entities. Thus to support the model in Vision, it is needed to show not only that the evidence is compatible with these primitives, but also to show that they are concrete entities (in at least some sense), and that all other primitives are less important. This is worth noting, because some people claim that evidence that is compatible with some primitives is an evidence for this model.

Part II : Vision

II.1 Early visual processing

[II.1.1] On P.41 ( Physical background of early vision) it says: "The purpose of early visual processing is to sort out which changes are due to what factors and hence to create representation in which the four factors are separated". No evidence or argument is brought to support separate representation, here or anywhere else in the book.

[II.1.2] Is there evidence for separate representation? The answer is plain no. Some authors claimed two separate "streams", dorsal and ventral, but the distinction is more imaginary than real. The data from brain damage, experiments on monkeys and non-invasive techniques on humans suggests that some regions are more sensitive to colour than others, and some regions are more sensitive to movement than others. It does not suggest anything like a separate representation. Considering the fact that many researchers looked for separate representations, the fact that it was not found hints that they don't exist.

[II.1.3] Hence, the evidence is hinting against the assumption of separate representations. So why make it? Because it fits with the basic assumptions of section I. The main point is that if all the factors are weak-represented (i.e. represented in a non-formal way) together, it is more difficult to postulate a 'mathematical' description for them. Thus it is assumed they are separated, to fit the assumption about the 'mathematical' nature of vision. That it doesn't have any support by evidence is not considered as relevant for the discussion.

II.2 Zero crossing

[II.2.1] The discussion of edge detection is the most ridiculous part of Vision. In the end of the discussion (p. 64 The physiological detection of zero-crossing), it is acknowledged that in neural system, there is a very simple solution to the problem of edge detection. Yet, more than 12 pages (54-64,337-338 Zero-Crossing and the raw primal sketch)) are spent on introducing and discussing the Del-square-G function (d2G).

[II.2.2] The d2G is coming from nowhere. Using it requires a considerable amount of precise computation, which would require a precise and complex connectivity of the neurons that execute it, far above what is seen anywhere in a biological system. In the models in Vision, it is used to find the edge, which, as acknowledged in the book, can be done easily with a simple neuronal circuit.

[II.2.3] There is an effort to support the d2G by showing a fit with experimental result (figure 2-17). However, to get the fit, all that is needed is a response function that has a negative peak following by a positive peak, and is flat elsewhere (see the top left corner of figure 2-17). The number of functions that fit this descriptions is infinite, and it can be easily made by two receptors and an inhibitory cell. In addition, the fit is actually quite lousy, considering that the author could select the best results of the best experiments, the results shown are from two separate experiments, and are only in one dimension (The G function in the discussion is two dimensional). He could easily have shown a better fit to any function he wanted, by selecting results from the right experiment. He probably didn't bother, because he was so convinced of the correctness of his assumptions.

[II.2.4] In the Conversation at the end of the book (P. 337-338), the author continues to insist that the retina calculates d2G. He goes as far as displaying a diagram of the retina, the mathematical representation of d2G, and an electronic implementation of it, and claims that these "are similar at the most general level of description of their function". Anybody that has little acquaintance with the computational properties of neurons, and knows what calculating d2G and convolution means, can clearly see that this is nonsense.

[II.2.5] The d2G is used again to explain directional selectivity (p. 167-175 Directional Selectivity), again essentially for edge detection which can be easily done by few neurons. It is mentioned again and again all along the rest of the book.

[II.2.6] The nonsenseness of the d2G idea does not actually do any harm to the rest of the theory in Vision, because it is used only for edge detection, which human clearly do by some mechanism. It does show how an extremely unlikely and complex mechanism is preferred, even if a much simpler and more likely mechanism is known, simply because the unlikely solution is mathematically nicer. Another problem with this section, mainly the conversation part (P. 337-338), is that it gives the reader the impression that neurons are known to calculate precise mathematical functions, without bringing any real evidence for it.

II.3 Stereopsis

[II.3.1] The discussion of stereopsis is amusing. Several pages (116-122) are spent presenting an algorithm to explain the 'computational theory' (which is really algorithmic) of stereopsis. Then there is a comparison to other models, and the author boasts about how much better his model is, because it is based on computational theory:

"The reason for elaborating upon this point is simply to help my overall argument that intellectual precision of approach is of crucial importance in studying the computational abilities of the visual system. Unless the computational theory of a process is correctly formulated, the algorithm will almost certainly be wrong." (p. 124 Cooperative algorithms and the stereo matching problem ).

[II.3.2] However, on the next page he goes to discuss further evidence, which shows that his model is wrong as the rest of the models. According to the text in Vision, this model, like the rest of the models, cannot account for human performance even on the experiments which were used to generate it, because it ignores the importance of eye movements (P. 125). On P.127, it is admitted that the basic idea of this model, and the rest of the models, is wrong. Thus, the "intellectual precision" was not actually useful in preventing postulating a wrong model anyway. It should be noted that most of the evidence that is quoted is older than the model, and hence was available when it was formulated.

[II.3.3] It is quite amazing how the author uses as an example of the usefulness of his approach a case where it failed like all the other approaches. The only advantage of his model was that it was compatible with several assumptions that he thought are important, and as far as the author is concerned, that is all that is important.

[II.3.4] in the rest of the chapter, another model is presented, which is more sensible, because it does not rely so much on precise and iterative computations as the first one.

II.4 Motion

[II.4.1] In his discussion of motion, there is again a presentation of an algorithm as a 'computational theory' (p. 165, Directional Selectivity, computational theory), and then the d2G function is invoked again, repeating the same mistakes as before,in [II.2] above.

[II.4.2] Later there is a discussion of the correspondence problem and a model for it (Ullman's), to be followed by 'A new look at the correspondence problem' (p. 202). In this, two problems are distinguished, the object-identity problem and the structure-from-motion problem. It says (P. 204): "My argument is that the theory should consider the two problems separately, because they have somewhat different computational requirements". Again, the 'computational' considerations take over, helped by the modularity assumption, and the question of what the real system (the brain) actually does has no role at all.

[II.4.3] Then two theories are suggested: one for an object moving and changing and one when it is only moving (P. 205). The rest of the discussion is about the structure-from-motion problem. i.e. the theory of object moving and not changing. The other problem is completely ignored.

[II.4.4] This is obvious nonsense: the problem of objects moving and changing is a much more demanding problem, and humans obviously have a system to deal with it quite effectively. This system must be able to cope with all kind of changes, so the simple case of no-change would be easy for it. Thus there is no need for a system to deal with the latter problem, and the assumption that there is such a system is extremely unlikely. The only justification for it is that it is relatively easy to discuss in precise mathematical terms.

II.5 Shape

[II.5.1] In discussing shape a new idea is introduced: the generalized cone (P. 223 implications of the assumptions ). The author is obviously fascinated by the properties of these entities, and is convinced that they play part in vision. This conviction is based solely on the properties of generalized cones, and not on any evidence about human performance.

[II.5.2] The idea of generalized cones for representing objects is another obvious nonsense. Objects (the text mentions football, pyramid, leg, arm, snake, tree trunk, stalagmite, horse (8 cones)) can be represented this way only when they are stripped of all the unique features that make them what they are. Human can easily recognize an arm as an arm (and not, say, a branch) because they do use the unique features of the arm. This, however, make life complicated, so it is ignored, and the mathematically neat solution is adopted.

[II.5.3] The rest of this section is an interesting contrast to (most of) the rest of the book. The author seems not to have done much work on this area, so he discusses only the appearance of the real world, and other people theories. Thus, he is able to give more objective opinion, and actually raises objections which are similar to some of the objections that I raise against his theories.

II.6 The 2.5D sketch

II.6.1 general

[II.6.1.1] Some people regard the concept of 2.5D sketch as an important contribution of Vision. However, the vision system must have weak-representation features (i.e some neural activity) in the visual input that are not 2D anymore, yet are not 3D yet. The only way around this is the extremely implausible assumption that 3D information is somehow read off directory from the 2D input. Thus the 2.5D sketch is simply the last stage which is not 3D, and the novelty of the treatment in Vision is in the description of the 2.5D sketch, rather than its existence.

[II.6.1.2] There are several significant points in the treatment here. First is the assumption that it is a separate stage with a well defined characteristics. This is just the modularity assumption, and, as usual, is not based on any evidence. The second point is the contents of the 2.5D sketch, which are assumed are depth (distance from the observer) , surface orientation and discontinuities. As discussed in [I.6] above, this implicitly exclude anything else, but without any evidence.

II.6.2 Depth in the 2.5D sketch

[II.6.2.1] The discussion of depth in this section deserves a close look. The main question is the relative importance of absolute and relative depth in the vision system.

[II.6.2.2] The discussion about the 'possible forms for the representation' [in the 2.5D sketch] starts by arguing for the importance of absolute depth (P. 279, bottom), basing the argument on the stereopsis theory, and go on to bring some 'supporting evidence'. However, on p. 282 the author seems to suddenly realize that there is a problem with this, as it doesn't fit the evidence from human performance, which are far better on relative depth than on absolute depth. He therefore changes his mind, and then says (P.282):

"But in fact, stereopsis and structure from motion are both suited to delivering information about how things are changing locally rather than the absolute depth - stereopsis because the brain rarely seems to know the actual absolute angle of convergence of the two eyes, dealing instead only with variations in it, and structure from motion because the analysis is local and orthographic, thus yielding only local changes in depth."

[II.6.2.3] As far as stereopsis is concerned, this statement is in contradiction with the argument from three pages before. No explanation is given for these contradictions.

[II.6.2.4] Even after this, the author seems to find it difficult to deprive absolute depth of its primary role. In the conclusion of this discussion (P. 283), the absolute depth (r) and orientation (s) are the primary variables, and relative depth is a qualified add-on.

[II.6.2.5] It is interesting to note that like in the d2G case [II.2], the model does not actually requires absolute depth, because the next stage, the 3D model, can be based on relative depth just as well. Thus, the insistence on looking at the absolute depth is solely the result of the 'computational approach'. It stems from the fact that it is quite straightforward to see the mathematics of it. On the other hand, computing relative distance from disparity, without first computing the absolute disparity and distance accurately, is a much more complex problem. In Vision, this is good enough reason to ignore the evidence and concentrate on absolute distance.

II.6.3 Coordinate systems in the 2.5D sketch

[II.6.3.1] A third significant point about the 2.5D sketch in Vision is that it is retinocentric (p. 283-285 Possible Coordinate systems ).

[II.6.3.2] Normally, 'retinocentric' means a coordinate system which is centered on the retina, and also rotate when the retina (eye) rotate, and that is how it is used most of the time in Vision. However, in the first paragraph of the discussion of coordinate systems (p.283-284), it says: "Relative depth and surface orientation are obtained along and relative to the line of sight, not any external frame. So at least initially, we are almost forced to expect a retinocentric frame within which to express the results of each process." This argument says nothing about rotation, so cannot be used to distinguish between retinocentric, head-centric (rotate with head), body-centric (rotate with body), or viewer centric (never rotate) coordinate system. All that this arguments is supporting is the weaker statement, that the coordinate system is centered on the retina. Thus this argument support retinocentric coordinate system just because the meaning of the term is confused.

[II.6.3.3] The other argument for retinocentric coordinates system is that otherwise it would require too much internal buffering (P. 284): "This provide another reason for expecting a retinocentric frame, because if one used a frame that had already allowed for eye movements, it would have to have the foveal resolution everywhere." This statement is nonsense on two accounts:

Since the discussion is of the 2.5D sketch, which is quite removed from the raw input, it may buffer only the useful information that the earlier processing generated, rather than all the information of the raw input. That this point is ignored is really odd, as that exactly what the presented approach would predict.
The number of neurons in the vision system in the cortex (> 10**9), is far larger than the number of neurons in the optic nerve (~10**6), which delivers all the information from the retina to the brain. Thus the vision system can easily buffer all the output of the retina many times over, and could easily cope with non-retinocentric frame, even if it kept all the information.

[II.6.3.4] The text continues (P. 284): "Such luxurious memory capacity would be wasteful, unnecessary, and in violation of our own experience as perceivers, because if things were really like this, we would be able to build up a perceptual impression of the world that was everywhere as detailed as it is at the center of the gaze."

[II.6.3.4] The first claim of this sentence, i.e. that the memory capacity (which obviously exists) "would be wasteful, unnecessary", is not explain here. We cannot even explain it by the fact that the 3D model does not use it, because it could.

[II.6.3.5] The second part of the last quoted sentence is odd. First, it should be noted that foveal resolution is available only for areas that are in the center of the gaze. Thus, 'everywhere' here have to be replaced by something like 'everywhere that was in the center of the gaze in the last short time interval'. The "short time interval" corresponds to the typical time it take the visual input to fade away, and is probably in the region a second, very variable and context dependent. (The same point is also applicable to the use of the term 'everywhere' in the previous quote [II.6.3.3].)

[II.6.3.6] After this change, it seems to me that the last part of the quote is quite a reasonable description of what most of people perceive. It is certainly more likely than what we would expect from the assumption of information only in retinocentric coordinate system. This would predict a small circle, with radius of about 2 degrees, of high (foveal) resolution, and the rest an undifferentiated mass with only colour, texture and movement, but without shapes and objects.

[II.6.3.7] In the conversation, the point gets some more discussion (P. 352): "Just suppose, for example, that a 2.5D sketch had foveal resolution everywhere and was driven by a foveal retina in the usual way. Immediately, the memory has to contain out-of-date information (or nothing) in most of its capacity." What is wrong with 'out-of-date' information? There are at least two reasons why we do need this 'out-of-date' information:

If the system hasn't finish to extract all the information it can from the visual input. As long as the memory capacity allows, it does not make sense to throw away the 'out-of-date' information in the visual input, unless the system is sure that it cannot use it anymore. There is no reason to assume that this happens frequently, or even that the vision system has any mechanism to verify whether the visual input is still useful or not.
The 'out-of-date' information is useful in figuring out movements of objects. The importance of this is obvious, but in Vision, motion is assumed to play a role only in earlier stages, and any role of motion analysis in later stages is completely ignored. This is derived from the assertion that the vision system is a shape-recognition system, and the rest can be 'hung off' this [I.3.4], (though movement of objects is probably used for shape recognition as well).

II.7 The 3D model

II.7.1 Object-based representation

[II.7.1.1] The most standing out suggestion in the 3D model is that humans construct representation of objects in an object based coordinate system. (p. 298-300 Choices in the design of the shape representation). There is no evidence for this, and the argument for it is based on the assertion that this is necessary to explain recognition of complex objects.

[II.7.1.2] This, however, is simply false. For example, recognition of a horse is used an example in the discussion (p. 322-325) Interaction between derivation and recognition). However, humans can recognize efficiently a horse by seeing only a small part of it (e.g. a hoof, the nose, the ears, the mane, the tail (note that we are talking of real horse, not a picture of it)), by seeing its silhouette, from a sketch, from a chess piece. etc., all of which do not contain enough information for a full 3D model of a horse. Thus for recognizing a horse, humans do not need 3D information about the horse as a whole.

[II.7.1.3] This does not explain how humans actually do it, but it makes it clear that humans use mechanisms which are much more complex and efficient than anything that is described in Vision. It also makes it clear that these mechanisms, in most of the cases, do not need a full 3D model of the object that is being recognized. This contradicts the assumption that humans use the 3D model of the object (e.g. the horse) to search internal database.

[II.7.1.4] For more difficult cases, humans presumably require full 3D information, but that does not show that they construct object-based representation. Because the recognition mechanisms are far more complex than the description in Vision, the arguments that rely on the latter mechanisms are not actually relevant.

[II.7.1.5] Note that I am not arguing here that humans never construct an object-based representation, only that this is not a normal part of the object recognition process, and if it is used, it is done only in very special cases.

II.7.2 Simple Primitives

[II.7.2.1] The other significant suggestion is that the primitives in the 3D model are very simple. This is explicitly argued for, by claiming that the complexity of the primitives is limited the type of information that can be reliably derived by prior processes (P. 301 Primitives). This is wrong on two accounts:

The analysis of the vision system up to this point is obviously far from complete, so we cannot actually know what information the earlier process can reliably derive.
More importantly, there is no reason to believe that the higher parts of the vision discard unreliable information. Everything from our experience suggests that vision uses any information it can get from the visual input. Combining unreliable information correctly can lead to reliable results, and that is probably what the brain is doing. The natural behavior of networks of neurons makes this kind of system extremely easy to implement.

[II.7.2.2] That the second point is ignored is somewhat surprising. There is nothing non-mathematical about using unreliable information, and unreliability can be quantified and represented like any other entity (most easily by expressing unreliability as probability of being correct). Hence it is not ignored here because it doesn't fit the basic assumptions, but because that supports the line of argument.

[II.7.2.3] The simplicity of primitives leads to a system of object recognition which works with object representations like the ones drawn in figures 5-10 and 5-11. These seems to be extremely counter-intuitive. For example, they don't resemble primitive drawings of the same objects, either by children or by adults.

II.7.3 Modularity(?) in the 3D model

[II.7.3.1] The third suggestion about the 3D model is that it is modular (P. 302 Organization ). However, here modularity means something else than in the rest of the text, because the modules are not different part of the process or computation (see the quote in [I.4.1] above), but simply different chunks of data. This is similar to a computer storing different chunk of data in different part of its storage system. Calling this modularity is overloading the word, and serves only to increase the confusion about modularity.

---------------------------------------------------------------------------- -

Yehouda Harpaz
yh@maldoo.com
26Sep96
http://human-brain.org/