This paper deals with a neural network that mimics biological structure of bat so that 3-dimensional (3D) environments can be perceived through fusion of auditory and visual information, named as Bat-FNet. The autonomous vehicles typically use visual sensors such as RADAR, LIDAR, and RGB cameras, and sound sensor like ultrasonic. Visual sensors are vulnerable to adverse weather, where sight is not secured. Ultrasonic sensors are used only for measuring distance even though they are robust [54]. The Bat-FNet, inspired by bats that use eyes and ears harmoniously to survive in complex environments, recognizes location and size of the target object. We prove the superiority of fusion network via mean square error (MSE) and intersection over union (IoU) scores. We demonstrate robustness against image distortion by complementing each other between ultrasound and camera sensors.