August 19, 2016
Innovations and improving capabilities across 3-D perception, 3-D modeling, 3-D mapping, and 3-D display will bring augmented reality solutions to the 3-D physical world that people work and live in.
For the most part, the use of digital technologies is locked away in a 2-D screen and is thus two-dimensional. Yet, the physical world is three-dimensional. The promise of augmented reality (AR) is to bring the digital world seamlessly into the physical world. For instance, Volvo is experimenting with AR smartglasses that customers in the showroom can don, so they could view a 3-D representation of a car outfitted with their choices of features, options, and colors.
To know where and when to place such content, an AR system must be able to see in three dimensions, build models of working locations, recognize objects that might be in the way, tell the difference between a permanent wall and a temporary group of people, and see where users are—from the perspective of the users.
To achieve these requirements, AR technology must advance in many 3-D capabilities. This article describes the innovations in 3-D perception, modeling, mapping, and displays.
The 3-D ecosystem of augmented reality
In the physical world, AR solutions must contend with 3-D information at three levels. First is the environment in which the user operates, which could be outdoors or in a warehouse, office building, factory floor, or other interior space. Second is the particular object or equipment of interest to the user, which could be a part, component, or machine the user is working on. The third is the digital content projected into the user’s field of view, which could be 3-D objects or 2-D instructions, charts, video, and so on. Across these three levels, 3-D capabilities will span four distinct but somewhat overlapping areas, as illustrated in Figure 1:
- 3-D perception: The ability to sense depth and capture 3-D information about objects and spaces
- 3-D modeling: The ability to generate a model from the 3-D information ahead of time, or in real time
- 3-D mapping: The ability to locate the position and the orientation of the user and any objects in the 3-D environment
- 3-D display: The ability to project 3-D information to the user within the field of view in the smartglasses, smartphones, or tablets
Figure 1: To effectively engage with a 3-D physical world, AR technology needs to advance on many 3-D capabilities. AR solutions will need content with 3-D information at three levels: environment, equipment and AR devices. Across these level 3-D capabilities will include perception, modeling, mapping and display.
3-D perception: Capturing and recognizing the physical world
3-D perception is accomplished through devices that sense depth and collect 3-D information in raw form, often called a point cloud. A point cloud is a collection of points that represents the 3-D space or object. As Figure 2 illustrates, there are three main approaches to capturing such information:
- Stereo vision: Emulates human sight. Each eye perceives a slightly different image, and the combination is what gives humans depth perception. The stereo vision approach uses two offset cameras and triangulation to calculate the distance to the object. This method is relatively inexpensive and is suitable when absolute accuracy is not essential.
- Time of flight: Pulses infrared light or lasers (invisible to the eye) at the object, measures the time it takes for the light to return, and computes the distance. The system acquires a 3-D equivalent of an image bitmap, where the collection of points approximates the object. This approach is accurate but requires more expensive equipment.
- Structured light: Projects an infrared pattern (invisible to the eye), photographs the pattern through a separate camera, and then calculates distances and angles from the distortions of the pattern. This method provides the appropriate balance of cost and accuracy and can also be packaged in small form factors.
Figure 2: The methods of depth sensing and scanning.
Although all three methods have been used over the years, structured light is the approach in many of the solutions emerging today. Detecting depth is useful, but detecting motion is even more powerful as it can help track faces, hands, and gestures and thereby enable interactions. Many solutions today combine depth sensing and motion tracking.
One of the first consumer products to use structured light was the Microsoft Kinect sensor for Xbox gaming applications in 2010. It was based on technology licensed from PrimeSense, which Apple acquired in 2013. More recently, similar technology is being packaged to generate models and maps of spaces. For example, Occipital has developed Structure Sensor as an accessory for tablets to scan objects and people as well as build 3-D maps of interior spaces. Matterport, a 3-D mapping vendor, uses structured light in its solution to capture rich 3-D data of homes and buildings to enhance the real estate sales process.
Intel RealSense technology combines an HD camera, an infrared camera, and an infrared laser projector to sense depth and track motion. The Google Project Tango solution combines 3-D motion tracking with depth sensing and works in a tablet or a smartphone. Lenovo is planning to introduce a smartphone in mid-2016 that has built-in support for Google Project Tango.
Integrating 3-D perception into smartglasses places high demands on processing and bandwidth. This will add to the size and weight of the smartglasses, which must become lighter to spur adoption. Some innovations that are in labs will help improve this tradeoff.
The Korea Advanced Institute of Science and Technology (KAIST) has integrated stereo vision cameras into its K-Glass smartglasses to provide 3-D sensing and perception. A key innovation was the creation of a low-power-efficient vision algorithm, so the processing wouldn’t drain the onboard power quickly.
Motion contrast 3-D laser scanner technology from Northwestern University mitigates processing and bandwidth constraints by rescanning only the areas where visual changes are detected. This approach is not unlike video compression techniques, where a video is compressed by storing only the visual changes, thereby requiring less storage and bandwidth. Such advancements will make possible the integration of 3-D perception into smartglasses.
A cloud of points is of little use until it is assembled into a coherent picture. That is what modeling does. Modeling provides flexibility in where and how AR can be deployed. Without a model, a system depends on users being in predetermined places.
Modeling transforms point cloud data into a 3-D model that can be used in computer-aided design (CAD) programs or other 3-D applications, such as training, gaming, and previews in auto showrooms. (See Figure 3.) In some sense, modeling is the most mature capability among the four discussed in this article, building on years of research and development in robotics, computer graphics, machine vision, and other areas. There are open-source libraries that have the algorithms to generate a 3-D model from a point cloud. Open Source Computer Vision (OpenCV) and the Point Cloud Library (PCL) are free for commercial use. OpenCV originated with Intel in 2006 to supply open-source code for computer vision. PCL was created for 3-D perception applications in robotics.
The model can be built ahead of time or in real time as the user navigates a space. “Having the map ahead of time is significant, because then someone can preplan the important workflow,” explains Amir Rubin, CEO of Paracosm, a cloud service for generating 3-D models. Paracosm works with third-party perception sensors and scanners, and the service can be used for AR and for more general machine perception.
Matterport is another vendor that builds 3-D models ahead of time. Matterport uses its own hardware and focuses on real estate and virtual reality applications. Google’s Project Tango, on the other hand, creates a model of the environment in real time. Over the long term, devices will likely take advantage of both—models created ahead of time, and models generated in real time to be precise and to account for changes in the environment.
A common and popular approach for 3-D modeling is based on representing the surface of an object or environment by using polygons or a mesh. NGRAIN, a 3-D modeling vendor, uses the concept of voxels. “You can think of voxels as tiny grains of sand. The same way that you can build a sand castle out of grains of sand, you can build 3-D content in voxels,” explains Barry Po, senior director of product and business development at NGRAIN. One benefit of using voxels is the ability to represent the interior of objects rather than just the surfaces. “We can model any equipment or material that has an interior density of some kind,” suggests Po.
3-D mapping for navigation and orientation
The key challenges for mapping come from what an AR system must accomplish:
- The system must know where the user is: location.
- The system must know which way the user is looking: orientation.
- The system must choose the correct content given the user’s location and orientation.
- The content must appear in the right place.
These challenges are not unique to AR and are shared with robotics, indoor location-based services, and other technologies. All of them depend on precise methods of identifying locations indoors. GPS works outdoors, but is not reliable indoors. And GPS accuracy is probably not adequate for superimposing digital information in precise proximity of specific objects.
Many technologies—such as beacons, Wi-Fi triangulation, Bluetooth, RFID, and others—are competing to fill the void and provide the infrastructure that will produce precise location information indoors. However, standards are yet to emerge, and, in many cases, building owners must invest in an infrastructure that provides location information. GPS coverage has been global, but infrastructure indoors will likely take advantage of multiple technologies and solutions.
Techniques such as simultaneous localization and mapping (SLAM), which is commonly used by robots for navigation, allow devices to generate a real-time map of the environment and locate themselves in that map.
Over time, the 3-D mapping capability will depend on the infrastructure (like GPS), and the operating systems of the smartglasses will likely surface the location and orientation information and make it available to applications.
Once the location and orientation are known, based on the model of the environment, 3-D content will be displayed in the field of view. The 3-D content must be correctly matched to its surroundings, or the differences could disrupt the user’s attention. The system must match lighting and get enough accuracy in the model to place content adjacent to physical items and have everything look correct. For smartglasses, display performance is largely a function of their optical capabilities. Trends in optical capabilities are covered in the article “Breakthroughs in optics that are reshaping augmented reality.”
3-D content is data rich and can hinder the bandwidth to the device and the onboard processing that might be needed. Techniques to reduce the file size, by compression or innovative standards, will play a role in the future. “You need to reduce the files and encode them into something we can work with,” says Andy Gstoll, chief marketing officer of Wikitude, an AR authoring solution provider. The company has a 3-D encoder to optimize file size, often by reducing unnecessary detail or other visual information, and return a proprietary format that it can store and then ultimately display.
Innovations in 3-D technologies are integral to the evolution of AR. “The world we live and work in is 3-D, and a lot of augmentation will make sense only when it is anchored to 3-D spaces,” observes Gstoll.
The promises of 3-D capabilities in AR will open up many opportunities. Engineers could bring up 3-D models of parts and rotate them in space, getting a view that is more encompassing than what is possible on a computer monitor. Retail customers could see themselves wearing clothing or accessories, turning to view all angles in a mirror, without physically donning the objects. Field service personnel could see how an equipment casing opened and then have the virtual version guide them through a series of repair actions.
Advances in 3-D perception likely will be integrated into AR devices over time. The 3-D modeling capabilities will combine models generated ahead of time with real-time modeling and mapping. The infrastructure for indoor positioning, both in coverage and precision, is yet to develop. But AR devices will provide operating-system-level support to surface the location and orientation information and make that available to applications and developers. Already, 3-D displays are very good and are expected to get even better. When all of these capabilities come together in a seamless manner, the technology will indeed merge the 2-D digital world with the 3-D physical world of work and play.