Multimodal Spatial Cognition

The traditional notion of spatial cognition, broadly defined, refers to how we think about and represent the structure, entities, and relations of space; important in this endeavor is how we learn and act in our environment. Our research interests bridge several domains, theoretical and applied, but the core of our program is linked by a fundamental interest in what we call multimodal spatial cognition (MSC). MSC deals with topics such as spatial learning and navigation from different sensory inputs, the effects of multimodal and cross-modal interactions on the mental representation of space, and a comparison of spatial computations, spatial problem solving, and spatial behavior between different information sources. The aim of MSC is to make comparisons between different combinations of the primary spatial senses of vision, touch, and 3-D spatialized audio (where objects are heard as coming from a specific location in space), as well as non-perceptual inputs such as spatial language (descriptive terminology such as left-right, front-back, etc.) in supporting complex spatial operations.

In most cases, spatial cognition research is synonymous with visual spatial cognition, as the vast majority of work in this area deals with vision as the means of accessing, remembering, and acting in the world. This visuocentric focus is not without merit, compared to its sister senses (e.g., touch and hearing), the visual system is exquisitely tuned for conveying spatial information by providing distal access to the environment via a large field of view, using parallel information processing, and making use of a large bandwidth “pipe” to the brain. However, to repeat one of the core conceptual refrains of this site (and our research), vision does not hold all the spatial cards—all of our senses encode spatial information to one degree or another. Although your phenomenology may suggest otherwise, vision is not necessary for successful spatial behavior. Good evidence comes from superior navigational abilities of blind animals and humans alike but is also evident from accurate performance by sighted folks on many common tasks done without visual support (e.g. your ability to walk from your bedroom to the bathroom in the middle of the night). When one considers that much of what is perceived through vision is spatial, and that audition and touch convey many of the same spatial properties as vision (e.g. position, direction, configuration, relation, and the like), the ability to accurately perform spatial behaviors without vision is not surprising. What remains unknown is how far the envelope can be pushed for non-visual inputs to specify information and support tasks generally performed using vision. This high level theoretical question motivates much of our research.

Example:

Imagine that you are asked to learn four different maps of equal size and complexity but each must be learned from a different spatial rendering condition. Your task is to learn each map and find the best route from a given start position to a pub which is equidistant from all four starting points. The first map you must learn the space by hearing what is around you (spatial auditory information). The second map you feel what is around you (haptic information). The third map you see what is around you (visual information). The fourth map you receive verbal descriptions of your surroundings (spatial language information). Assume that all four maps provide access to the requisite landmarks, distances, and turn angles to find the target and acquire metric and topological knowledge.

Do you think that you would be able to find the route and learn the map equally well from each of the four presentation modes? Would you use the same exploratory strategies for each? Would your mental representation of the map be based on the modality used at learning (e.g., audition, touch, vision, or language) or would it be independent of sensory-specific information (e.g., based on amodal spatial information)?  If you believe there is a fundamental difference between inputs, what is the critical information that is, or is not, available between the conditions? Could inputs that poorly convey this critical information be “augmented” or “supplemented” such that their information content would be as useful as others for supporting spatial behaviors? Thinking beyond this map example, do you think parameters like scale (large or small scale spaces), environment type (indoor or outdoor spaces), or information content (quantity and complexity of the space) effects its learnability, memorability, or navigability as a function of the input modality? Such questions are at the heart of much of our research in the lab. We use various techniques to address these issues, including psychophysical, cognitive, and usability paradigms incorporating physical, virtual, augmented, and mixed reality environments.

In a recent project similar in spirit to the hypothetical map example discussed above, we found that almost all participants subjectively reported that learning a map using vision was significantly easier and more natural than hearing verbal descriptions or feeling a tactile rendering of the space. One might reasonably conclude from this intuition that their learning would not be the same for maps rendered from these three inputs and that visual apprehension should yield the best test performance. Interestingly, even when people self-reported that they did badly using non-visual modes of environmental access, their data didn’t “agree”. Indeed, our findings from this research demonstrated that their test performance on wayfinding tasks was almost identical between all three conditions, suggesting the building up and accessing of functionally equivalent cognitive maps. These results are consistent with various other multimodal projects showing similar findings—people may be unfamiliar with spatial tasks performed without vision, and often doubt their abilities and accuracy, but in actuality, they are quite adept at using these information sources for carrying out all matter of tasks in an equivalent manner to vision (see the Functional Equivalence section for a discussion).