We embed special function units (SFUs) in homogeneous stream processors (SPs) within a graphics processing unit (GPU), to improve its performance in running modern programmable shaders, which make poor use of a single-instruction multiple-data (SIMD) architecture. We also compact instructions, so as to reduce the size of the instruction memory, and reduce area requirements by using a partial SFU in SPs, and a lookup table which is shared between multiple SFUs. The result is an increase of 88% in utilization and a reduction in the normalized area-delay product of 27%, compared to a baseline SIMD architecture. We verified our architecture on an field-programmable gate-array evaluation platform with an ARM9 host processor and a full 3-D graphics pipeline.