Production scheduling of semiconductor manufacturing tools is a challenging problem due to the complexity of the equipment and systems in modern wafer fabs. In our study, we focus on the photolithography toolset and consider it as a non-identical parallel machine scheduling problem with random lot arrivals and auxiliary resource constraints. The proposed methodology strives to learn a near optimal scheduling policy by incorporating WIP, masks, and the tardiness of jobs. An Action Filter (AF) is proposed as a methodology to eliminate illogical actions and speed the learning process of agents. The proposed model was evaluated in a simulation environment inspired by practical photolithography scheduling problems across various settings with reticle and qualification constraints. Our experiments demonstrated improved performance compared to typical rule-based strategies. Relative to our learning methods, weighted shortest processing time (WSPT) and apparent tardiness cost with setups (ATCS) rules perform 28% and 32% worse for weighted tardiness, respectively.