Part-based methods have seen popular applications for face verification in the wild, since they are more robust to local variations in terms of pose, illumination, and so on. However, most of the part-based approaches are built on hand-crafted features, which may not be suitable for the specific face verification purpose. In this paper, we propose to learn a part-based feature representation under the supervision of face identities through a deep model that ensures that the generated representations are more robust and suitable for face verification. The proposed framework consists of the following two deliberate components: 1) a deep mixture model (DMM) to find accurate patch correspondence and 2) a convolutional fusion network (CFN) to extract the part-based facial features. Specifically, DMM robustly depicts the spatial-appearance distribution of patch features over the faces via several Gaussian mixtures, which provide more accurate patch correspondence even in the presence of local distortions. Then, DMM only feeds the patches which preserve the identity information to the following CFN. The proposed CFN is a two-layer cascade of convolutional neural networks: 1) a local layer built on face patches to deal with local variations and 2) a fusion layer integrating the responses from the local layer. CFN jointly learns and fuses multiple local responses to optimize the verification performance. The composite representation obtained possesses certain robustness to pose and illumination variations and shows comparable performance with the state-of-the-art methods on two benchmark data sets.