2024年3月6日发(作者:苑建木)
稀疏矩阵向量乘的访存分析和优化
Memory Accessing Analysis of Sparse Matrix Vector Multiplication and
Optimization
Abstract: Sparse Matrix Vector multiplication (SpMV) is one of the most important kernel in computing science.
Both theoretical analysis and practices demonstrate that SpMV is a memory-intensive application. However
state-of-the-art compilers cannot make full use of the modern processors’ characters, the usage ratio of SpMV
bandwidth is only about 10%. In this paper, based on the memory access characters of current processors, we
optimize memory access in SpMV instruction pipeline using SIMD instruction. The experimental results show that
there are 63% and 89% performance improvement on Intel Nehalem and Sandy Bridge, 30% and 36%
improvement on AMD Opteron 6168 and Opteron 8374 compared with the standard SpMV implementation. As for
matrices in real applications, there are 10% performance gained on Nehalem and Sandy Bridge, nevertheless little
performance gain is achieved on the two AMD Opteron platforms.
Key words:
sparse matrix vector multiplication; SpMV; pipeline; SIMD; Memory Access
摘 要: 稀疏矩阵向量乘(SpMV)是科学计算中最重要的核心算法之一。理论分析和实际测试结果都表明,SpMV属于访存密集型应用。由于目前主流编译器尚不能充分利用现代处理器的访存特性,SpMV对带宽利用率仅为10%。本文通过探索现代处理器的访存特征,使用单指令流多数据流(SIMD)对SpMV进行流水线访存优化。实验表明与标准SpMV相比,优化后的SpMV在Intel Nehalem和Sandy Bridge架构上流水线性能分别有63%和89%的提升,在AMD Opteron 6168 和Opteron 8374 HE上分别有30%和36%的提升。SpMV在实际矩阵进行的性能测试中,Intel Nehalem和Sandy Bridge架构上均有10%的性能提高,而在两个AMD Opteron平台上性能提高不明显。
关键字:稀疏矩阵向量乘; SpMV; 流水线; SIMD; 访存
中图法分类号:
1 引言
很多科学和工程问题,如物理模拟(粒子碰撞,核爆炸模拟)、医学成像,通过建模,最终可以转化成线性方程组求解问题。由于计算时间的要求,大规模线性方程组往往采用迭代法求解,如Gauss-Seidel 迭代、Jacobi迭代、代数多重网格等。实际工程中,求解问题的矩阵规模一般很大、耗时较长,因此很有必要对其进行优化以提高工程效率。迭代方法的核心操作是矩阵向量乘(y
2024年3月6日发(作者:苑建木)
稀疏矩阵向量乘的访存分析和优化
Memory Accessing Analysis of Sparse Matrix Vector Multiplication and
Optimization
Abstract: Sparse Matrix Vector multiplication (SpMV) is one of the most important kernel in computing science.
Both theoretical analysis and practices demonstrate that SpMV is a memory-intensive application. However
state-of-the-art compilers cannot make full use of the modern processors’ characters, the usage ratio of SpMV
bandwidth is only about 10%. In this paper, based on the memory access characters of current processors, we
optimize memory access in SpMV instruction pipeline using SIMD instruction. The experimental results show that
there are 63% and 89% performance improvement on Intel Nehalem and Sandy Bridge, 30% and 36%
improvement on AMD Opteron 6168 and Opteron 8374 compared with the standard SpMV implementation. As for
matrices in real applications, there are 10% performance gained on Nehalem and Sandy Bridge, nevertheless little
performance gain is achieved on the two AMD Opteron platforms.
Key words:
sparse matrix vector multiplication; SpMV; pipeline; SIMD; Memory Access
摘 要: 稀疏矩阵向量乘(SpMV)是科学计算中最重要的核心算法之一。理论分析和实际测试结果都表明,SpMV属于访存密集型应用。由于目前主流编译器尚不能充分利用现代处理器的访存特性,SpMV对带宽利用率仅为10%。本文通过探索现代处理器的访存特征,使用单指令流多数据流(SIMD)对SpMV进行流水线访存优化。实验表明与标准SpMV相比,优化后的SpMV在Intel Nehalem和Sandy Bridge架构上流水线性能分别有63%和89%的提升,在AMD Opteron 6168 和Opteron 8374 HE上分别有30%和36%的提升。SpMV在实际矩阵进行的性能测试中,Intel Nehalem和Sandy Bridge架构上均有10%的性能提高,而在两个AMD Opteron平台上性能提高不明显。
关键字:稀疏矩阵向量乘; SpMV; 流水线; SIMD; 访存
中图法分类号:
1 引言
很多科学和工程问题,如物理模拟(粒子碰撞,核爆炸模拟)、医学成像,通过建模,最终可以转化成线性方程组求解问题。由于计算时间的要求,大规模线性方程组往往采用迭代法求解,如Gauss-Seidel 迭代、Jacobi迭代、代数多重网格等。实际工程中,求解问题的矩阵规模一般很大、耗时较长,因此很有必要对其进行优化以提高工程效率。迭代方法的核心操作是矩阵向量乘(y