Classifying images to object or scene categories according to the content is an important topic in computer vision with many applications. In real world, an image or an object is usually associated with rich contexts which are important in human vision to categorization. In this thesis, we explore modeling the contexts for effective image categorization, and address the issues of defining, representing and learning contexts in three categorization scenarios: single-label categorization, multi-label categorization, and pixel-level categorization, $\It{i.e.}$., scene parsing.
Defining two typical contextual relations between local features, $\It{i.e.}$., a semantic conceptual relation and a spatial neighboring relation, a local feature based Contextual Bag-of-Words (CBoW) model is proposed for single-label image categorization with the popular Bag-of-Words (BoW) representation style. The conceptual relation is learned according to the similarity of class distributions induced by visual words corresponding to local features, and the spatial neighboring relation is learned by a confidence that neighboring visual words are relevant. Classification is taken using support vector machine (SVM) with a designed kernel incorporating the relational information.
Multi-label image categorization is more challenging yet closer to real-world applications than single-label case since real-world images are usually associated to multiple labels. Conventional algorithms over multi-label image data predominantly rely on the holistic image similarities, ignoring that each label essentially only characterizes a local region. With the multi-label contexts piloted by a collection of multi-label images, we propose the Contextual Image Decomposition (CID), to obtain an optimal representation for each label of a set of multi-labeled images without explicit segmentation. Multi-label context is defined that local label representations of the same category are similar across different im...