Computing That Serves

Fine-grained Visual Categorization using PAIRS: Pose and Appearance Integration for Recognition

Pei Guo: PhD Qualifying Process

Thursday, October 12, 12:30PM

3350 TMCB

Advisor: Ryan Farrell

In Fine-grained Visual Categorization (FGVC), the differences between similar categories are often highly localized on a small number of object parts, and significant pose variation therefore constitutes a great challenge for identification. To address this, we propose extracting image patches using pairs of predicted keypoint locations as anchor points.
The benefits of this approach are two-fold: (1) we achieve explicit top-down visual attention on object parts, and (2) the extracted patches are pose-aligned and thus contain stable appearance features. We employ the popular Stacked Hourglass Network to predict keypoint locations, reporting state-of-the-art keypoint localization results on the challenging CUB-200-2011 dataset. Anchored by these predicted keypoints, pose-aligned patches are extracted and a specialized appearance classification network is trained for each patch. An aggregating network is then applied to combine the patch networks' individual predictions, producing a final classification score. Our PAIRS algorithm attains an accuracy of 88.6%, an increase of 1.1% over the current state-of-the-art. Enhancing the base PAIRS model with single-keypoint patches produces a further improvement, yielding a new state-of-the-art accuracy of 89.2% and clearly demonstrating the power of integrating pose and appearance features.