The increasing demand on a versatile high-performance metasurface requires a freeform design method that can handle a huge design space, which is many orders of magnitude larger than that of conventional fixed-shape optical structures. In this work, we formulate the designing process of one-dimensional freeform Si metasurface beam deflectors as a reinforcement learning problem to find their optimal structures consistently without requiring any prior metasurface data. During training, a deep Q-network-based agent stochastically explores the device design space around the learned trajectory optimized for deflection efficiency. The devices discovered by the agents show overall improvements in maximum efficiency compared to the ones that stateof-the-art baseline methods find at various wavelengths and deflection angles. Furthermore, the efficiencies of the devices generated by agents trained from different neural network initializations have a small variance, demonstrating the robustness of the proposed design method.