In highly parallel ILP(Instruction-Level Parallel) processors, one of the performance bottlenecks is insufficient instruction fetch bandwidth. The fetch bandwidth is determined by a number of factors, such as instruction cache hit ratio, branch prediction accuracy, basic block size, and percentage of taken branches in the instruction stream.
In a conventional fetch unit, which can fetch one basic block per cycle, the number of instructions fetched per cycle is about five, which is the average basic block size of general-purpose application program. As the fetch unit cannot fully feed the highly parallel execution unit, it becomes necessary to fetch multiple basic blocks per cycle. Fetching instructions from conventional cache memory hinders this effort due to the problem of noncontiguous instruction fetching caused by taken branches as branch instructions and their targets are typically located in different cache lines.
Cache miss also limits instruction fetch bandwidth by introducing clock cycle overhead for the cache miss service. To reduce the clock cycle overhead and improve the efficiency of an instruction fetch unit, prefetching is used to bring instructions as needed by the processor into the cache in advance.
In this thesis, we propose hardware schemes to improve instruction fetch bandwidth; multiple basic block fetching and instruction prefetching. In the proposed multiple basic block fetching scheme, called path-classified trace cache, paths are classified to improve the trace cache hit ratio and basic blocks are joined to reduce the hardware cost needed to implement the trace cache. The proposed instruction prefetching scheme, called look-ahead prefetching, reduces the miss penalty of the first-level instruction cache by accurately predicting the target lines to be prefetched using a branch predictor.
In order to investigate the performance of the proposed schemes, we constructed trace-driven simulation models for the proposed schemes which were ...