May 2, 2016
by Chris Curran
Advancements in gesture tracking, motion tracking, eye tracking, and other technologies are laying the groundwork for natural interaction methods that will be essential for the success of augmented reality.
Augmented reality (AR) experiences present new challenges for the interaction between humans and devices. Augmented (and virtual) reality devices primarily are designed to be used without any tangible interface at all. As such, nearly all existing interactivity technologies have become irrelevant and obsolete in the AR world. In a typical AR environment—working in an industrial scenario or interacting with a video game, for example—it simply isn’t possible to type, tap, click, and swipe.
Whenever a new technology emerges, the way people interact with that technology often must play a game of catch-up. The advent of computers in the 1940s and 1950s, with their punch cards and manual switches, eventually led to the first alphanumeric computer keyboards in the 1960s. When the graphical user interface (GUI) became popular in the 1980s, the mouse finally came into common use. As mobile devices have begun to dominate the technology scene, touch-screen interfaces have become the preferred means of interacting with them. (See Figure 1.) How will people communicate their intentions in augmented reality?
This article examines the trends in interaction methods in AR.
Figure 1: Each computing platform has evolved with a dominant interaction method. The methods that will dominate AR and VR applications are yet to be established.
Evolving to natural interaction methods
Human-computer interaction methods of the past forced the human action to fit what the computer could understand. Initial interfaces for computers were switches and lights. Keyboards were created so people could place commands onto punch cards and paper tape. The keyboard was connected directly to the computer only to be enhanced by the addition of the mouse. Touch interfaces began to take shape with the introduction of the Apple iPhone, and touch screens are now the standard interface for phones, tablets, and computers. Any tradeoffs were optimized for the computer.
AR experiences are different because the computer augments reality by providing computer-generated content and overlaying it on the real world. Because of this real-world interaction, the computer must fit to what a human can understand. Interaction methods based on speech, gestures, motion, and eye movement are more natural for humans, as these are the methods they routinely use to interact with the physical world and each other.
Figure 2 illustrates how interfaces have been evolving from interfaces optimized for computers toward interfaces optimized for humans. When optimized for humans, they are more natural. Interfaces are also evolving along another dimension. “The [AR] system will know what you are doing, and it will anticipate what you will do next. This capability will move the interaction from deterministic to probabilistic scenarios,” suggests Barry Po, senior director of product and business development at NGRAIN.
The shift is how accurately the computer understands the intent of the user. For instance, when a user presses a key on a keyboard or clicks the mouse, the computer knows exactly what the user wants and does it. However, in speech or gesture, the system must interpret the intent of the user from the action. There is always some possibility of ambiguity in such interpretation, and hence the intent is understood in a probabilistic manner. As systems become smart and understand the context in which the user is working, they will get better at understanding the user intent.
Figure 2: Over time, interfaces and interaction methods are becoming more and more natural and are optimized for humans rather than for computers. At the same time, interaction methods move away from being deterministic to probabilistic, in that the intent of the user is interpreted from the action in a probabilistic manner.
The evolution and utility of interaction methods are tied to their usefulness in the real world. Workplaces can be dirty, noisy, bright, or dark and can present other challenges not usually encountered in a workspace where a desktop or laptop computer typically has been used. Interaction methods must work across such conditions.
Different interaction methods have utility in different use cases. In some cases, the involvement of the fingers and hands is not ideal. For example, if a user is wearing a head-up display and is disassembling a piece of equipment or driving a car, conventional typing is almost always impossible. Using a touch screen also can be unrealistic or can present a safety hazard.
Many new interface technologies are advancing to offer a set of building blocks that developers can use in various AR devices and AR applications. These technologies will be the basis for intuitive interfaces that, like the mouse, keyboard, and touch screen today, enable the user to select the best interface for the task.
The following sections highlight some of the most promising interface technologies.
Gesture and motion tracking
Gesture and motion tracking already have an established history in consumer electronics. Smartphones and tablets have long incorporated the fundamental building blocks of motion tracking—the gyroscope, compass, and GPS—to determine which way the device is oriented. Simple AR apps are widely available for these devices, letting users overlay restaurant locations, real estate prices, or the location of their parked car atop a live image of the real world.
Video game technology has incorporated gestures. The Nintendo Wii relies exclusively on gestures, delivered via gyroscope-embedded controllers, so users can interact with many of its game titles. The Kinect system (part of the Microsoft Xbox ecosystem) allows users to control the action in some activities without a controller at all. Rather, a camera tracks and records the motions of the player. In some games, the player’s face or body is incorporated into the game environment directly, a sort of reverse AR.
Motion and gesture tracking often go together. Motion tracking only orients the device or user in space. On its own, it doesn’t provide an interface but rather points the direction for the user’s interactions. Gestures take over from there, and then some complex analysis must reconcile the position of both real and virtual objects in the AR space and manage the way they interact with one another.
Leap Motion was an early innovator in gesture and motion tracking. Cofounder and CTO David Holz started the company in 2010 after he became frustrated with the 3-D modeling tools available then on conventional PCs. Recalls CEO Michael Buckwald, “David realized quickly that creating something really simple like a coffee cup took longer on a computer than it would take a five-year-old to create the same thing out of clay.” The goal was to make computer interactions more natural by using gestures that people already intuitively use and understand.
While human hands and fingers comprise a relatively small number of moving parts, tracking them all is a surprisingly difficult task, as it entails accurately tracking 10 fingers through a camera. Systems used for gaming and related applications track gross movements of hands, arms, and body and do not generally track fingers. Leap Motion has developed a solution that can accurately follow 10 fingers and their joints in real time. (See Figure 3.)
Figure 3: Leap Motion tracking in augmented reality—tracking the 10 fingers and joints for a gesture interface.
Source: Leap Motion
Cameras aren’t the only technology being used to track movement. Google’s Project Soli, currently in development, relies on high-frequency radar to track motion very precisely—and does not require camera lenses or moving parts. The MIT-developed WiTrack and the AllSee developed at the University of Washington use low-power radio waves to track motion and are capable of tracking a person’s movement—even through obstacles and walls. A much different approach, such as that from Xsens, uses inertial sensors to detect objects and measure their motion.
While new methods have been evolving, gesture-based interaction methods that rely on a camera are being integrated into some AR devices. Many smartglasses have a front-facing camera and can track a user’s hands the way a computer would track the movements of a mouse. In some cases, as through Microsoft’s HoloLens and Meta’s Meta 2 headset, users can directly manipulate holographically projected objects in their field of view.
These are early days, and most gesture interfaces have issues with responsiveness, accuracy, and supporting a wide range of gestures. Continued advances in technology and algorithms to process and recognize gestures will mitigate these concerns.
Another technology that offers great promise as an AR input system is eye tracking, which monitors the motion of the eye to determine where the user is looking. In a sense, the eyes become the mouse pointer and can even be used to select icons or items from a list. That capability is particularly useful in situations where a voice command isn’t realistic or a gesture can’t be made (such as when the user is carrying a load with both hands).
Eye tracking historically has been used not as an interface technology but as a diagnostic one. Researchers—primarily in marketing and usability testing—commonly use eye tracking to measure the engagement of ads, web pages, or other marketing materials. By determining what the subject is looking at, the researcher can determine which elements on that ad or web page are the most engaging or where the user is becoming confused.
When eye tracking is used as an interface, the underlying technology is basically the same as previous approaches. A high-resolution, high-speed camera is aimed at the eye, and even the tiniest of movements are detected and recorded. Software interpolates what the user is actually looking at.
Sune Alstrup Johansen, CEO of The Eye Tribe, notes that this capability opens up some exciting new avenues for augmented reality UI development. “The UI will know what a user wants to do before the user even thinks about it,” he says. “If someone is reading an e-book, for instance, the book will know when they want to turn the page” by following their eye movements, sentence by sentence.
The Eye Tribe’s technology uses infrared cameras to track eyes, which Johansen says is a far more accurate way to track the movements of the pupil than through cameras that capture visible light. SensoMotoric Instruments (SMI) also uses infrared light in its eye tracking technology, which was most recently integrated into the Epson Moverio AR headset as a reference design.
Why infrared? Visible light cameras, Johansen says, can capture crude eye movements—whether a person is looking up, down, left, or right, for example—but those cameras don’t have the accuracy required to capture the delicate pupil movements involved in a high-grade interface.
Although eye tracking doesn’t yet have the adoption of gesture tracking, it is a natural fit for the VR and AR market. Adoption of the technology likely will occur in VR headsets first, as VR is a more controlled environment without any issues related to ambient light or brightness. “Every professional AR headset will include eye tracking,” Johansen predicts. For now, eye tracking remains a niche technology, but it could gain much wider adoption in future AR devices.
Other AR interface technologies
Gestures and eye tracking aren’t the only emerging interface methods. A few additional technologies that present interesting alternative interface methodologies are:
- Voice: Voice control is the most used of the modern AR interface technologies. Voice-based control has long been a dream for a wide range of high-tech equipment, ranging from computers to cars. Apple’s Siri, Microsoft’s Cortana, and Amazon’s Echo all have speech-based interaction solutions that continue to get better. However, using the technology in high-noise environments and accommodating hundreds of accents continue to prove challenging.
- Facial tracking: Similar to eyes, facial tracking can support interaction by recognizing a user’s facial expressions. Dynamixyz primarily has worked with motion capture and facial modeling for Hollywood and is now adapting its technology for the AR market. CEO Gaspard Breton says detecting simple expressions like a smile or frown is easy. It is more interesting when a device can detect a person’s general mood. “When a person is driving a car, the device could know that the driver is getting tired,” Breton says.
- Brain control: And of course, there’s the possibility that humans won’t need to use any of their external appendages as an interface. Emotiv has designed a headset that incorporates mental commands detected through 14 EEG channels. The current release of its standalone device works as a crude PC interface, but it shows promise for gaming control and digital art creation.
Augmented reality presents myriad challenges for how humans interact with it, requiring technology to stitch together various realities into a cohesive and seamless experience. No single technology is likely to achieve this cohesiveness on its own, and the future of AR will likely revolve around a collection of these technologies coming to prominence together.
Approaches that span gesture tracking, motion tracking, eye tracking, and speech are evolving in their capabilities and accuracy. What the relative mix of these approaches will be in future AR solutions is unclear. But the approaches clearly complement each other, as different interface technologies are bound to work better in different environments. For example, speech won’t work well in a loud factory, and many tracking systems—gesture or eye—currently have problems working in bright environments, such as outside on a sunny day.
Today, all of the interaction technologies are merely isolated components, and innovators are still learning to build with them. In the future, those components likely will come together under a common interaction fabric that developers take advantage of and that users use to intuitively interact with physical items and digital items in the physical spaces where they work.