For human-like embodied agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. Existing gesture authoring approaches including keyframe animation and motion capture are too expensive to use when there are numerous utterances requiring gestures. It is also difficult to generate human-like gestures automatically due to motion complexity and ambiguity in speech-gesture mapping. In this dissertation, I present a data-driven approach to attempt to learn gesticulation skills from a large corpus of human gesticulation videos. The proposed automatic gesture generation model uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. In addition, to overcome the limitation of the automatic generation that it is hard to modify output motion as a gesture designer wants, I introduce an interactive gesture authoring toolkit, named SGToolkit, which accommodates fine-level pose controls and coarse-level style controls from users. The user study showed that the toolkit is favorable over manual authoring, and the generated gestures were also human-like and appropriate to input speech. This dissertation also proposes a new quantitative evaluation metric for gesture generation models to meet the demand for objective evaluation and to accelerate developments.