Sorting is one of the most fundamental operations for many applications. For efficient sorting, data locality can be exploited by processing subdivided data in parallel. This work presents a high-performance and area-efficient near-memory radix sort accelerator where end-to-end sorting is performed locally. With a parallel 1-bit radix sorter, it achieves high throughput by processing multiple keys per cycle. Tested with Xilinx Zynq UltraScale+ ZCU104 FPGA, the experimental result shows up to 10x performance speedup over CPU. It is highly area-efficient and can be integrated into each processing node of a distributed computing system with low area cost.