The objective of the proposed work is to perform monocular vision-based relative 6-DOF pose estimation of the non-cooperative target spacecraft relative to the chaser satellite in rendezvous operations. In this work, the convolutional neural network (CNN) is replaced by the high-resolution transformer network to predict the feature points of the target satellite. The self-attention mechanism inside the transformer provides the advantage of overcoming the inadequacies of the translation equivariance, 2D neighborhood awareness, and long-range dependencies in CNN. First, the 3D model of the target satellite is reconstructed using the inverse direct linear transform (IDLT) method. Then, the pose estimation pipeline is developed with a learning-based image-processing subsystem and geometric optimization of the pose solver. The image-processing subsystem performs target localization using CNN-based architecture. Then, the key points detection network performs regression to predict 2D key points using the transformer-based network. Afterward, the predicted key points based on their confidence scores are projected onto the corresponding 3D points, and the pose value is computed using the efficient perspective-n-point method. The pose is refined using the non-linear iterative Gauss-Newton method. The proposed architecture is trained and tested on the spacecraft pose estimation dataset and it shows superior accuracy both in translation and rotation values. The architecture has shown robustness against the drastically changing clutter background and light conditions in the space images due to the self-attention mechanism. Moreover, this method consumes less computation resources by using fewer floating-point operations and trainable parameters with low input image resolution.