Recent approaches to deep learning-based music analysis have had significant impact on procedural content generation in music-based games. However, the lack of understanding of the unique features of various platforms and interfaces makes auto-generated content less valuable than manually designed content. Hand-crafted datasets are required, to enhance the quality of content in various platforms, but most rhythm games permit only indirect access to the dataset, as a form of player's experience and its replay video. We develop a vision-based approach to content extraction through video analysis, using a format named beatmap. We cover some common visualized features in well-known rhythm games, and construct a mapping from their content to our beatmap model, using multiple object detection. Our method correctly detects each action button, type, and time, and extracts beatmap representations for our target game.