In real-time memory-intensive image processing and vision applications, increasing image resolution requires the use of external SDR/DDR memories. However, the arbitrary pixel access patterns used in most algorithms reduce their memory throughput as a result of increasing access latency. Efficient cache design is paramount in real-time memory-intensive applications. Its effectiveness depends on the spatial and temporal locality of data access. In image processing, the spatial locality denotes the neighboring pixels, located horizontally and vertically in 2-D. However, the conventional caches used in general processors cannot define the vertical locality. We propose a rolling cache optimized for image format and algorithms, a method to reduce the miss penalty by moving the cache horizontally and vertically, and a parallel processing architecture with interpolation, multilevel and multiple caches. To support our idea, we compare it with other types of caches and show that the average memory access time and the memory bandwidth are decreased by 28% and 74%, respectively, for a 2048x2048 image. Its performance is greater than that of the 16-way set associative cache, but the tag memory size is a bit larger than that of the direct-mapped cache. Using two different applications, we show that the proposed architecture is applicable to a number of algorithms if data access follows an arbitrary curve or block-wise pattern, which is the usual case with image processing and vision algorithms. If an application is based on local data access in resource-limited systems, it is possible to achieve high performance with lower operational frequency using the proposed architecture.