Static timing analysis (STA) for large-scale modern circuits requires extensive handling of false paths, multi-cycle paths, and other types of path exceptions. Despite the linear nature of timing propagation, we show that exception-aware STA is NP-hard and thus requires a long runtime to solve using conventional CPU-based methods. To overcome this runtime challenge, we propose a general CPU-GPU heterogeneous algorithm, HeteroExcept, that can handle common types of path exceptions and efficiently generate an accurate path report. Our algorithm targets runtime efficiency at the scale of thousands of exception rules and millions of circuit elements. To further improve the performance, we optimize our GPU implementation by introducing a cost-effective data exchange strategy between CPU and GPU. Experimental results demonstrate up to 6.84× and 12.93× speed-up compared to industrial timers, PrimeTime and OpenSTA.