While audio-driven talking head generation has achieved highly realistic multi-speaker generation, previous works rely on predefined additional data such as 3D model parameters, landmarks, and head pose angles. However, these explicit supervisions are expensive as scanning 3D models require special devices in a controlled lab environment, and landmarks are a manual annotation. In this paper, we propose a novel multi-speaker talking video generation framework that does not use any predefined prior for the first time. We first design a novel style code manipulator that explores the latent space of pretrained StyleGAN3 and generates a sequence of style codes within the distribution of the generator. In this way, we achieve identity-preserving head pose matching without any support of predefined supervision. Furthermore, by leveraging the power of StyleGAN3, our framework achieves high-quality video generation. Finally, we adopt sync loss, computed from an expert discriminator that maps audio and visual features to unified space, for better lip synchronization. Our framework is fully unsupervised since we do not include any model trained with additional data. Experimental results show that our method can generate high-quality video results and show competitive performance with the state-of-the-art methods that use supervision.