We propose an organization for the on-chip memory sys-
tem of a chip multiprocessor, in which 16 processors share
a 16MB pool of 256 L2 cache banks. The L2 cache is or-
ganized as a non-uniform cache architecture (NUCA) array
with a switched network embedded in it for high perfor-
mance. We show that this organization can support the
spectrum of degrees of sharing: unshared, in which each
processor has a private portion of the cache, thus reduc-
ing hit latency, completely shared, in which every processor
shares the entire cache, thus minimizing misses, and every
point in between. We find the optimal degree of sharing for
a number of cache bank mapping policies, and also evaluate
a per-application cache partitioning strategy. We conclude
that a static NUCA organization with sharing degrees of two
or four work best across a suite of commercial and scientific
parallel workloads. We also demonstrate that migratory, dy-
namic NUCA approaches improve performance significantly
for a subset of the workloads at the cost of increased power
consumption and complexity, especially as per-application
cache partitioning strategies are applied.