Semantic image understanding : from the web, in large scale, with real-world challenging data

Li, Jia; Stanford University, Computer Science Department

Semantic image understanding : from the web, in large scale, with real-world challenging data

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fqk372kq7966" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Human can effortlessly perceive rich amount of semantic information from our visual world including objects within it, the scene environment, and event/activity taking place etc.. Such information has been critical for us to enjoy our life. In computer vision, an important, open problem is to endow computers/intelligent agents the ability to extract semantically meaningful information as human does. The primary goal of my research is to design and demonstrate visual recognition algorithms to bridge the gap between visual intelligence and human perception. Towards this goal, we have developed rigid statistical models to represent the large scale real-world challenging data especially those from Internet. Visual features are the starting-point of computer vision algorithms. We propose a novel high-level image representation to encode the abundant semantic and structural information within an image. We first focus on introducing principle generative models for modeling our rich visual world, from recognizing objects in an image, to a detailed understanding of scene/activity images, to inferring the relationship among large scale user images and related textual data. We propose a non-parametric topic model, hierarchical Dirichlet Process (HDP), in a robust noise rejection system for object recognition, learning the object model and re-ranking noisy web images containing the objects in an iterative online fashion. It learns the object model in a fully automatic way, freeing the researchers from heavy human labor in labeling training examples for recognizing objects. This framework has been tested on a large scale corpus of over 400 thousand images and also won the Software Robot first Prize in the 2007 Semantic Visual Recognition Competition. Understanding our visual world is beyond simply recognizing objects. We then present a generative model for understanding complex scenes that involve objects, humans and scene backgrounds to interact together. For detailed understanding of an image, we propose the very first model for event recognition in a static image by combining the objects appear in the event and the scene environment, where the event takes place. We are not only interested in the category prediction of an unknown image, but also in how pixels form coherent objects and the semantic concepts related to them. We propose the first principled graphical model that tackles three very challenging vision tasks in one framework: image classification, object annotation, and object segmentation. Our statistical model encodes the relationships of pixel visual properties, object identities, textual concepts and the image class. It is a much larger scale departure from the previous work, using real-world challenging user photos such as noisy, Flickr images and user tags to learn the model in an automatic framework. Interpreting single images is an important corner stone for inferring relationships among large scale images to effectively organize them. We propose a joint visual-textual model based upon the nested Chinese Restaurant Process (nCRP) model. Our model combines textual semantics (user tags) with image visual contents, which learns a semantically and visually meaningful image hierarchy on thousands of Flickr user images with noisy user tags. The hierarchy performs significantly better on image classification and annotation performance as a knowledge base comparing to the state-of-the-art algorithms. Visual recognition algorithms start from representation of the images, the socalled image feature. While the goal of visual recognition is to recognize object and scene contents that are semantically meaningful, all previous work have relied on lowlevel feature representations such as filter banks, textures, and colors, creating the well known semantic gap. We propose a fundamentally new image feature, Object Bank, which uses hundreds and thousands of object sensing filters (i.e. pre-trained object detectors) to represent an image. Instead of representing an image based on its color, texture or likewise, Object Bank depicts an image by objects appearing in the image and their locations. Encoding rich descriptive semantic and structural information of an image, Object Bank is extremely robust and powerful for complex scene understanding, including classification, retrieval and annotation.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2011
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Li, Jia
Associated with	Stanford University, Computer Science Department
Primary advisor	Li, Fei Fei, 1976-
Thesis advisor	Li, Fei Fei, 1976-
Thesis advisor	Koller, Daphne
Thesis advisor	Ng, Andrew Y, 1976-
Advisor	Koller, Daphne
Advisor	Ng, Andrew Y, 1976-

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Jia Li.
Note	Submitted to the Department of Computer Science.
Thesis	Ph. D. Stanford University 2011
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...