Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore the near-data processing opportunity of a fundamental operation-linked-list traversal (LLT). We propose a new NDP architecture which does not change the existing sequential programming model and does not require any modification to the core microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We assume a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversal. Thus, we first propose NDP-aware data localization to exploit packaging locality-including locality within a single memory module and memory vault-to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. Our evaluation shows that the combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency.