Programming on a GPU has been made considerably easier with theintroduction of Virtual Memory features, which support commonpointer-based semantics between the CPU and the GPU. However,supporting virtual memory on a GPU comes with some additionalcosts and overhead, with the largest being from the support foraddress translation. The fact that a massive number of threads runconcurrently on a GPU means that the translation lookaside bu!ers(TLBs) are oversubscribed most of the time. Our investigation intoa diverse set of GPU workloads shows that TLB misses can beextremely high (up to 99%), which inevitably leads to signi"cantperformance degradation due to long-latency page-table walks. Ourpro"ling of TLB-sensitive workloads reveals a high degree of pagesharing across the di!erent cores of a GPU. In many applications,a page can be accessed in temporal proximity by multiple cores,following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, anintegrated cooperative TLB prefetching mechanism and an interL1-TLB probing scheme that can e#ciently reduce TLB bottlenecksin GPUs. Our evaluation using a diverse set of GPU workloadsreveals that Valkyrie is able to achieve an average speedup of 1.95?,while adding modest hardware overhead.