Download ArticleDownload New Intel Center Driving the Future of Visual Computing [PDF 258KB]
AbstractImagine taking a virtual 3D tour of Paris, creating a lifelike model of your home on the fly, or dancing in front of a screen that captures your moves and creates a virtual practice session to discuss with your instructor. These are a few potential applications of visual computing, a fast-growing field of technology that combines photorealism, HD video and audio, interactivity, and computational modeling to enable real-time, lifelike immersive experiences.
The demand for visual computing is accelerating in parallel with the growth of mobile devices such as smart phones, tablet computers, and net books. Consumers already use these devices to access multimedia content. In the future, they will want to use the devices to interact with the world in richer ways, to enjoy realistic simulated experiences that blur the line between the physical and virtual worlds.
Investing in the futureIntel is launching a new research center to advance the field of visual computing. The Intel Science and Technology Center for Visual Computing (ISTC-VC) will develop innovations in lifelike computer graphics, natural user interfaces, and realistic virtual humans that will make people's technology experiences more immersive in the future. The goal is to drive visual computing applications that look, act and feel real, and to make the technology broadly accessible to consumers.
The ISTC-VC is a collaboration between Intel and experts from eight top US universities in the field of visual computing: Stanford, UC Berkeley, Cornell, Princeton, the University of Washington, Harvard, UC Davis, and UC Irvine. Stanford will be the hub of the virtual center, driving collaborations that will explore fundamental research problems and potential applications in this emerging space.
The center will be co-led by Stanford professor Pat Hanrahan, Chief Scientist of the center, and Intel Senior Principal Engineer, Jim Hurley, who will assume the role of Technical Director. Initially about two to three dozen academic and Intel researchers will participate in the center. The ISTC-VC also expands on the work of the Intel Visual Computing Institute in Germany, at Saarland University, launched in 2009 as Intel's hub for visual computing research in Europe. The two centers will coordinate their work and collaborate on complementary projects.
While much of the research of the ISTC-VC will focus on the development of systems and application software to enable visual computing, Intel will use the results to guide the development of future hardware platforms. Intel has been at the forefront of developing hardware to enable visual computing, most recently with the January launch of the second generation Intel® Core™ processor family microarchitecture. The new processor-the world's fastest, and the first delivered on Intel's cutting-edge 32nm process technology-combines a GPU and CPU to deliver powerful graphics and computational capabilities to a broad range of mainstream devices, including smart phones and other mobile technologies. Sandy Bridge will be used as a platform for the center's research as well.
Overview of the ResearchThe work of the ISTC-VC is divided into four overlapping research themes: scalable real-time simulation; perceiving people and places; content development; and graphics and systems. Intel is funding the research, but the findings and software prototypes developed by the center will be made widely available to the research community, to encourage the development of new visual computing applications and supporting technology.
Real-time simulationImagine: A woman stands in front of a virtual dressing room displayed on her 3D TV. As she gestures with her hands, the lifelike avatar on the TV screen mimics her movements, selecting dresses off a virtual rack of clothing and trying them on, one after another. As the woman raises her hand, so does her avatar. As she adjusts the collar of one dress, so does her virtual twin. As she twirls in front of the TV, the virtual dress flares and she observes how gracefully the fabric moves.
One focus of the center's research is on developing the capability to create such realistic simulations, and to enable scalable (up to Internet scale) simulations. This is the largest of the four themes, with more than ten collaborators involved.
The researchers will apply physics-based modeling to simulate natural processes, such as water flowing, cloth draping, and facial animations that look, act and feel real (Figure 1). To create greater realism in appearance and movement and enable real-time execution, they will strive to integrate motion, sound and light automatically into the simulations (e.g., the motion of simulated waves would automatically generate the sound of waves as well as realistic light reflections as a byproduct of the simulation). That's a difficult technical problem to calculate in real time, and will require the development of sophisticated algorithms.
Figure 1. 3D models of a woven wool scarf (created at Cornell University), facial muscles (Stanford University), and faucet dripping water ( Cornell University). One goal of the researchers is to produce simulations with integrated light, sound and motion, driven by computer algorithms based on simulated physics.
Achieving realism in realtimeThe visual computing community has succeeded in developing highly realistic simulations, such as the simulated woven wool scarf shown in Figure 1. The scarf has exquisite detail, even at high resolution, because each fiber was modeled. The movement of the fabric is highly realistic; if you drop the scarf, it floats naturally to the ground. But it can take a week or more to run the simulation. The challenge for the ISTC-VC is to generate such simulations in real time. Unless simulations can be fast enough to be interactive, developing applications such as the virtual dressing room will not be possible.
Integrating light, sound and motion automatically is one requirement for generating real-time simulations. Another is parallel computing, to accelerate execution by parceling out the workload to multiple processors. For a mobile device with limited processing power, that might mean executing simulations in the cloud and sending the rendered images back to the device display. For a desktop PC, it may make more sense to do the rendering locally. Partitioning the computational tasks among various devices and the cloud will be another challenge for future visual applications.
Simplifying simulations to reduce the number of calculations required also can help to optimize the applications. It might be possible to reduce the detail in a scarf, or the number of times that light bounces off a virtual object in a scene, without compromising the user experience. The objective is to produce simulations that are realistic to the eye but also can be executed rapidly.
Scaling simulated worldsThe complexity of simulations increases with the number of people and objects involved. Consider a virtual character who dives into a swimming pool. If another character enters or leaves the pool, the motion of the water changes. Multiply that by dozens or hundreds of characters and actions, and the simulation becomes highly complex. Generating such a simulation in real time adds to the complexity.
That's a challenge the Center will face as the investigators explore how to develop Internet-scale simulations in real time. Such a capability would enable participants in online games, virtual worlds and social networking sites to generate their own 3D objects, avatars and animations, creating new possibilities for richer interaction, far beyond the scripted content offered in virtual environments today.
Today many virtual worlds are proprietary, and the underlying architectures restrict the number of participants and the ability to move content within regions and across virtual environments. The researchers will look beyond these current constraints and search for ways to create highly scalable simulations that could be moved across virtual environments by distributing workloads among many servers.
Perceiving people and placesIn the future, devices will be equipped with suites of sensors, from cameras, microphones and accelerometers to depth sensors, GPS systems and accelerometers that will capture huge amounts of data about the surrounding environment. The center will focus on how devices can understand or "perceive" the content they are capturing.
The ability to perceive people and places, to understand the user's context based on images, location and other data, could enable devices to perform relevant activities. For instance, if the user is viewing a historical building, sensors might trigger an online search to find matching images of the building and display background information about the site on the device's screen.
To achieve this vision of the future will require tackling a fundamental issue in computer vision: how to enable a camera to interpret what it sees. To a camera, a photograph is a collection of pixels that blend into the background. Enabling a camera to identify shapes in a photo and attach semantic meaning to them-to evaluate groupings of pixels and "know" that they represent a person or object-requires sophisticated algorithms. Understanding context (e.g., the photo was taken at a golf course, and Joe is present) adds to the complexity, but this contextual information enables you to make richer queries, such as searching for other photos of Joe at a golf course.
To meet this challenge, researchers will strive to develop algorithms that infer what the sensors in the camera are "seeing." One goal is to segment the images captured by the camera (including people) into components that can be tagged automatically, analyzed and reassembled to create new 2D and 3D content. Facial recognition software is a start, but the researchers want to go further, to deconstruct images of people into their component parts (arms, legs, hats, coats, and so on). Collaborators at the University of Washington have made progress in this area (Figure 2).
Figure 2. Researchers at U.C. Irvine are exploring ways to segment photos of people into components that can be tagged, tracked and manipulated to create new 3D content.
Computer vision "in the wild"To make computer vision useful to consumers, it must work "in the wild," outside of the controlled setting of a lab, and on a variety of devices. People must be able to take photos with their cameras wherever they go and have the devices recognize the images they've captured.
It's difficult to make computer vision work in the wild, because so many factors come into play. In a highly constrained environment, such as a studio or lab, lighting is under control, the subject can be carefully placed, and it's possible to experiment all day to achieve the desired image, or to capture images and infer shapes relatively easily. In the wild, by contrast, the environment is unconstrained, making computer vision a far more difficult challenge. The lighting changes as clouds move across the sky, and the background changes as people pass by or as a breeze causes objects to sway. Some objects might reflect strong light into the camera, while others might blocks the camera's view of the subject.
Such variables each could require an order-of-magnitude more computation to create a robust solution. In some circumstances it might not be possible to develop a perfect solution.
Content creationToday the development of sophisticated computer graphics content requires expertise in sophisticated professional content creation tools. Researchers will study ways to bring this capability to the mainstream, for the amateur. With that in mind, they will focus on creating algorithms and interfaces that will make it easy, fast and inexpensive for the average consumer to create and manipulate 3D content, such as models, videos, and animation, from the data captured by their devices.
At its most basic, this will involve creating 3D models from sequences of 2D images made by snapping numerous photos of a person or object from different angles. The center will pursue two projects to learn how to transform 2D content into 3D models. In one project, called "Little People, Big World," researchers will scan thousands of photos of a large urban environment that have been uploaded to the internet, to construct a 3D model of the environment . Another project, dubbed "Big People, Little Items," will focus on generating smaller scale 3D models of individual items.
The center also will explore the tools needed to enable collaborative 3D content creation, leveraging the wealth of visual content that consumers have uploaded to photo sharing websites. One project in this area will employ a mobile phone game to encourage people to upload photos to help build a model of a place. Users will be awarded points based on the uniqueness of their photos-not for duplicating photos already snapped numerous times but for the shot no one else got. The idea is to make the work of creating models collaborative and fun.
Developing more natural, intuitive interfaces is essential to making content development more accessible. The computing community is moving beyond the pull-down menu, mouse and keyboard, developing new interfaces such as gesture recognition software. Among other things, the center will focus on enhancing gesture recognition to make it more natural and responsive, and bringing this enhanced capability to a wide range of devices. Such technology could be leveraged to make it possible for consumers with no CAD expertise to "sculpt" virtual models onscreen, and it could help to enable simulations such as the virtual dressing room, whereby computers recognize and respond to gestures.
Another area of focus is 3D animations, which are more difficult to create than 3D models. Among other things, the researchers will experiment with a puppeteering system whereby a person could act out an animation in front of a camera, using puppets. The computer would record the puppets' movements and use them to create a basic animation, which users could edit and refine. The ultimate goal is to develop animation tools that are so natural and intuitive that a child could use them.
Potential applicationsOnce computers are able to capture, tag, analyze and manipulate sensor data, the potential for user-generated 3D applications will be enormous. Realtors could upload 3D models of homes for sale. A homeowner could generate a 3D model of her living room, enabling her to move furniture around virtually, change the fabric of a sofa, and envision a new decorating scheme. Small businesses could create compelling 3D models to help sell their products online, or to build more compelling websites. Medical schools could use realistic models of humans to train physicians. Urban planners could take photos of a neighborhood and have the camera attach semantic labels to objects (cars, houses, fire hydrants) and capture spatial elationships through sensors, to aid in understanding infrastructure needs (Figure 3).
Figure 3. Semantically labeled 3D model, suitable for urban planning. (Cornell University)
The ability to generate realistic animations easily could be useful in numerous applications involving movement, such as dance, yoga, martial arts, and sports instruction. For instance, a yoga teacher could perform poses in front of the camera, which would record his movements and map them to a virtual character to create a 3D animation. Students could pause the animation and spin the 3D model around to view the instructor's yoga positions from different angles, creating a powerful learning tool. The technology could be used to create more realistic "machinima"-3D computer animation generated in real-time while gaming-or to create more lifelike videos and short movies to post online.
Virtual 3D tourism could be a compelling application. For instance, today the world's major museums only have the capacity to display a tiny fraction of their collections at any given time, due to physical constraints. In the virtual world, there are no such constraints, so 100% of a museum's content could be made accessible to the public.
Imagine taking a virtual 3D tour of the Louvre and having the entire collection available to you-tagged, tracked and modeled-so you can choose which route to take through the museum and which artwork to see, generating your own compelling 3D experience. With all the artwork in the museum's massive collection deconstructed into components, a computer could to mine the data to identify patterns that would be impossible to detect manually. Perhaps certain colors, people, or motifs appear in the artwork of different regions and eras. In this way you might gain new insights into particular historical periods or cultural trends that might otherwise go undetected.
If a computer could automatically tag the objects in photos as the images are captured, it would be possible to conduct intelligent image search. Suppose you had a large digital photo collection and wanted to find a photo of your friend Mary at a train station in Boston. GPS tagging could help to narrow the search, and if the computer had deconstructed each image of Mary into components (e.g., purse, hair and hat) and tagged them, it might be able to identify her in one image even if her face were not visible, by making an association with components from another photo.
In short, semantic tagging enables the computer to search more efficiently for related photos. The center will investigate the development of Internet-based video and photo search algorithms to enable this capability.
The ability to conduct intelligent image searches could be useful in augmenting memory as well. Suppose you were "life logging" - recording everything you do throughout the day. The ability to intelligently search through all of the day's content, even months later, would make it easy to "remember" details of people, places and activities that might otherwise be forgotten.
Graphics and systems researchIn addition to addressing the software needed to power visual computing experiences, the researchers will explore the hardware required to support those next-generation workloads, with the objective of influencing Intel hardware platforms (mobile, PC, cloud). The key is to enable future platform capabilities to coincide with the availability of new visual computing applications.
Future architectures will include programmable pipelines for mainstream graphics. One objective is to give graphics processors the flexibility to do more than render images. For instance, the graphics processors of the future might assist in performing the calculations required to generate realistic simulations. (Intel's second generation Core processor is a step in this direction.)
The center also will address the development of programmable processors for photography and computer vision. One project will focus on a programmable camera processor that would give users more flexibility in how a device captures images. For instance, a camera might be able to take multiple shots using a variety of settings, and combine the images to create more realistic lighting effects or greater depth of field. Or the device could enable the user to perform more sophisticated editing of photos, such as changing the focus to a different object or person. The key is to add more intelligence to cameras, giving the average user the skills to transform imperfect images into professional quality photographs.
System-on-chip (SOC) architectures are among the many other future architectures the researchers will address, along with architectures to support Internet-scale visual computing as well as large-scale distributed image analysis and 3D visualization. In addition to investigating future architectures, the center will focus on creating the tools and development environments needed to utilize those architectures effectively.