Avatar-mediated mixed-reality telepresence enables distant users to collaborate remotely. However, heterogeneous spaces between distant users have various arrangements and shapes of objects, which make it a challenge to animate the avatar while preserving the motion context of the user. To solve this problem, we propose a real-time framework using a neural network for retargeting the upper-body motion to virtual avatars in dissimilar environments. Our architecture, trained in a supervised way, incorporates a Mixture of Experts to learn well-conditioned latent space for various upper-body motions and an attention mechanism of Transformer to capture temporal dependencies between the user history and avatar history. Through quantitative and qualitative evaluation, we demonstrate the effectiveness of our fast and lightweight architecture that performs real-time retargeting of upper-body motion, including gaze, deictic gesture, and environment contact to virtual avatars in dissimilar environments. Our work is suitable for animating virtual avatars in telepresence scenarios such as interactive learning and collaboration.