The integration of multisensory information plays a crucial role in autonomous robotics to forming robust and meaningful representations of the environment. In this work, we investigate how robust multimodal representations can naturally develop in a self-organizing manner from co-occurring multisensory inputs. We propose a hierarchical architecture with growing self-organizing neural networks for learning human actions from audiovisual inputs. The hierarchical processing of visual inputs allows to obtain progressively specialized neurons encoding latent spatiotemporal dynamics of the input, consistent with neurophysiological evidence for increasingly large temporal receptive windows in the human cortex. Associative links to bind unimodal representations are incrementally learned by a semi-supervised algorithm with bidirectional connectivity. Multimodal representations of actions are obtained using the co-activation of action features from video sequences and labels from automatic speech recognition. Experimental results on a dataset of 10 full-body actions show that our system achieves state-of-the-art classification performance without requiring the manual segmentation of training samples, and that congruent visual representations can be retrieved from recognized speech in the absence of visual stimuli. Together, these results show that our hierarchical neural architecture accounts for the development of robust multimodal representations from dynamic audiovisual inputs. (C) 2016 The Authors. Published by Elsevier B.V.