Visual learning with weakly labeled video
- With the rising popularity of Internet photo and video sharing sites like Flickr, Instagram, and YouTube, there is a large amount of visual data uploaded to the Internet on a daily basis. In addition to pixels, these images and videos are often tagged with the visual concepts and activities they contain, leading to a natural source of weakly labeled visual data, in which we aren't told where within the images and videos these concepts or activities occur. By developing methods that can effectively utilize weakly labeled visual data for tasks that have traditionally required clean data with laborious annotations, we can take advantage of the abundance and diversity of visual data on the Internet. In the first part of this thesis, we consider the problem of complex event recognition in weakly labeled video. In weakly labeled videos, it is often the case that the complex events we are interested in are not temporally localized, and the videos contain varying amounts of contextual or unrelated segments. In addition, the complex events themselves often vary significantly in the actions they consist of, as well as the sequences in which they occur. To address this, we formulate a flexible, discriminative model that is able to learn the latent temporal structure of complex events from weakly labeled videos, resulting in a better understanding of the complex events and improved recognition performance. The second part of this thesis tackles the problem of object localization in weakly labeled video. Towards this end, we focus on several aspects of the object localization problem. First, using object detectors trained from images, we formulate a method for adapting these detectors to work well in video data by discovering and adapting them to examples automatically extracted from weakly labeled videos. Then, we explore separately the use of large amounts of negative and positive weakly labeled visual data for object localization. With only negative weakly labeled videos that do not contain a particular visual concept, we show how a very simple metric allows us to perform distributed object segmentation in potentially noisy, weakly labeled videos. With only positive weakly labeled images and videos that share a common visual concept, we show how we can leverage correspondence information between images and videos to identify and detect the common object. Lastly, we consider the problem of learning temporal embeddings from weakly labeled video. Using the implicit weak label that videos are sequences of temporally and semantically coherent images, we learn temporal embeddings for frames of video by associating frames with the temporal context that they appear in. These embeddings are able to capture semantic context, which results in better performance for a wide variety of standard tasks in video.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Department of Computer Science.
|Li, Fei Fei, 1976-
|Li, Fei Fei, 1976-
|Statement of responsibility
|Submitted to the Department of Computer Science.
|Thesis (Ph.D.)--Stanford University, 2015.
- © 2015 by Kevin Dechau Tang
Also listed in
Loading usage metrics...