In recent years, dynamic program analysis (DPA) has been widely used in various fields such as profiling, finding bugs, and security. However, existing solutions have their own weaknesses. Software solutions provide flexibility in DPA but they suffer from tremendous performance overhead. In contrast, core-level hardware engines rely on specialized integrated logics and attain extremely fast computation, but they have a limited functional extensibility because the logics are tightly coupled with the host processor. To mend this, a prior system-level approach utilizes an existing channel to integrate their hardware without necessitating the host architecture modification and introduced great potential in performance. Nevertheless, the prior work does not address the detailed design and implementation of the engine, which is quite essential to leverage the deployment on real systems. To address this, in this article, we propose an implementation of programmable DPA hardware engine, called program analysis unit (PAU). PAU is an application-specific instruction-set processor (ASIP) whose instruction set is customized to reflect common features of various DPA methods. With the specialized architecture and programmability of software, our PAU aims at fast computation and sufficient flexibility. In our case studies on several DPA techniques, we show that our ASIP approach can be successfully applicable to complex DPA schemes while providing hardware-backed power in performance and software-based flexibility in analysis. Recent experiments on our FPGA prototype revealed that the performance of PAU is 4.7-13.6 times faster than pure software DPA, and the power/area consumption is also acceptably small compared to today's mobile processors.