Earlier studies have used the reinforcement learning theory to explain how animals explore a task space to maximize reward. However, a majority of empirical tests heavily rely on a simple task paradigm. This limits our understanding of the ability to explore an uncharted world with infinitely-many options, which inevitably entails the sparse reward problem. Here, we test a theoretical idea that metacognition1,2, the ability to introspect and estimate one’s own level of uncertainty, guides efficient exploration. We designed a novel two-stage decisionmaking task with infinitely-many choices and sparse rewards (90% rewarded in less than 8% options in reward states on average) and collected 88 subjects’ behavioral data. First, we identified two key variables guiding exploration: uncertainty about the environmental structure (state-space uncertainty) and information about the reward structure (reward information). To further understand exploration dynamics, we differentiated between
the two variables as a function of a learning stage. We found that the state-space uncertainty is significantly correlated with the individual metacognitive ability measured using an independent perception task3. Interestingly, highly metacognitive subjects act on the state-space uncertainty over the course of learning, while the effect of the reward information on exploration behavior diminishes after the early learning stage. Note the learning bias towards the environmental structure and against the reward structure is a near-optimal exploration strategy for the sparse reward problem. On the other hand, the effects of both variables last in the low metacognitive subject group. Our theory is further supported by the finding that the high metacognitive subject group showed higher task performance and sampling efficiency in the test phase following the learning phase. Taken together, our work elucidates a crucial role of metacognition in fostering a sample-efficient, near-optimal exploration strategy to resolve uncertainty about environmental and reward structures.