Indoor Scene Knowledge Acquisition using Natural Language Descriptions
Published: 2013
Publication Name:
Abstract:
Saranya Kesavan. Unpublished Master’s Thesis, May 2013, UMaine. (N.A. Giudice: thesis advisor)
The existing research addressing non-visual indoor navigation is limited to route guidance between locations (i.e., the corridor network). This focus ignores many critical regions contained within indoor spaces (e.g., rooms, lobbies, etc.), locations which are often as challenging to learn and navigate without vision as are the routes connecting them. To address this challenge, this thesis investigates the use of natural language (NL) descriptions as a non-visual medium for providing access to indoor scenes, including room structure, furniture placement, and location of salient landmarks. The work is part of a larger project to develop a system, called the Describer for Indoor Scenes (DISc) that uses automatically generated NL descriptions to represent indoor scenes based on photos taken in real-time from mobile devices. In order to develop cognitively comprehensible NL descriptions of indoor scenes, it is critical to first understand how humans describe and interpret the scene in order to support spatial behavior. To this end, six behavioral experiments were conducted to characterize scene descriptions generated by human observers and to optimize these descriptions based on cognitive constraints and the structure of linguistic information to be included to best support non-visual learning, representation, and navigation.
The visual information that can be captured about a scene from photographs is potentially limited, both in quality and quantity, compared to the information apprehended from real time scene perception. Importantly for the DISc system, results from experiments 1, 2, and 3 converge to demonstrate that photographic observations are functionally equivalent to real time observations of indoor scenes in supporting spatial behavior and show that photographs can be used as information source in DISc. The data collected in these experiments showed that humans adopted different scene description strategies. To understand how the description strategy (i.e., order of objects) affected scene learning and reconstruction, a 4th behavioral experiment was conducted. Results from this experiment suggest that following a cyclic path while describing an indoor scene (called a “Round-About strategy”) was the most efficient approach for acquiring and representing spatial knowledge.
The results from the first four experiments elucidated that people used two different angular units (clock face and degree measurements) to describe directional information. However, it was not clear from the extant literature how angular units affect spatial apprehension of the listener or which measure yields the most accurate performance. As directional information is critical for specifying the location of objects in a scene, this question was addressed in a fifth experiment, with results demonstrating that the most accurate performance manifested when angular directions were given as clock face units rather than degree measurements (i.e., 1:00 versus 30 degrees). Results also demonstrated that participants were equally accurate at producing angular values of 15 degrees or half hour increments (e.g., 1:30), which is meaningful as this is a 100% increase in precision from the standard clock face units employed in previous studies.
The sixth and final behavioral experiment was conducted to investigate whether the optimized NL scene descriptions support non-visual navigation of indoor scenes and if performance differs when using static or updated descriptions, meaning that they either were given from a fixed user perspective in the scene (as was done in the earlier experiments) or that the perspective changed based on the user’s position and orientation. Results showed a clear advantage for updated NL descriptions on navigation accuracy, indicating that to be maximally effective, DISc should implement descriptions based on the user’s real-time position and orientation as they move. Taken together, the results of six human experiments extend earlier research with route navigation by showing that optimizing NL indoor scene descriptions based on perceptual and cognitive factors led to efficient spatial learning, representation, and navigation. These empirical results provide the much needed proof of concept for the efficacy of future development of DISc as a fully automated NL scene description system.