For secure computing against malicious attacks, symmetric security algorithms are commonly deployed on high-performance embedded systems such as network routers, database servers, UTM systems, etc. Consequently, high-performance security algorithms are critical in order not to degrade overall performance of those systems. We aim at optimizing ARIA, a Korean symmetric block cipher similar to AES, used on those embedded systems for high performance. For this end, we propose three low-level techniques for improving performance of ARIA at the software level. First, we utilize a 64-bit processing capability of current high-performance processors in order to reduce the number of instructions required to implement ARIA. Second, we make an attempt to maximize utilization of hardware resources so as to enhance the instruction-level parallelism. Third, low-level optimization techniques are applied to reduce instructions and instruction dependencies. By combining all the three techniques, we are able to improve the ARIA performance up to 47 percent over a compiler-generated optimal code.