Facial recognition for surveillance applications still remains challenging in uncontrolled environments, especially with the appearances of masks/veils and different ethnicities effects. Multimodal facial biometrics recognition becomes one of the major studies to overcome such scenarios. However, to cooperate with multimodal facial biometrics, many existing deep learning networks rely on feature concatenation or weight combination to construct a representation layer to perform its desired recognition task. This concatenation is often inefficient, as it does not effectively cooperate with the multimodal data to improve on recognition performance. Therefore, this paper proposes using multi-feature fusion layers for multi modal facial biometrics, thereby leading to significant and informative data learning in dual-stream convolutional neural networks. Specifically, this network consists of two progressive parts with distinct fusion strategies to aggregate RGB data and texture descriptors for multimodal facial biometrics. We demonstrate that the proposed network offers a discriminative feature representation and benefits from the multi-feature fusion layers for an accuracy-performance gain. We also introduce and share a new dataset for multimodal facial biometric data, namely the Ethnic-facial dataset for benchmarking. In addition, four publicly accessible datasets, namely AR. FaceScrub, IMDB_WIKI, and YouTube Face datasets are used to evaluate the proposed network. Through our experimental analysis, the proposed network outperformed several competing networks on these datasets for both recognition and verification tasks.