M1 Macでnumpyの行列積を300倍高速化する(OpenBLASのリンク)
※この記事は,線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。
前提
本記事では以下を仮定しています
- M1 Mac
- Arm版のHomebrew
- Arm版のPython3.9
インストール方法
numpy/scipyをOpenBLASにlinkしてinstallするコマンドです
# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも(結構長いので注意)
% pip3 install --no-binary :all: --no-use-pep517 scipy
検証
以下を実行してbenchmarkを計測しました
▼ https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276
Name | Python Platform | BLAS / LAPACK | 行列積 (4096x4096) [sec] | ドット積 (1x524228) [ms] | SVD (2048x1024) [sec] | Cholesky分解 (2048x2048) [sec] | 対角化 (2048x2048) [sec] |
---|---|---|---|---|---|---|---|
Pure NumPy | aarch64 (Homebrew) | - | 298.54 | 1.09 | 13.27 | 2.12 | 73.81 |
NumPy + OpenBLAS | aarch64 (Homebrew) | OpenBLAS | 0.95 | 0.28 | 2.49 | 0.11 | 10.27 |
NumPy + Intel MKL | intel (Miniconda) | Intel MKL | 2.53 | 0.08 | 0.96 | 0.22 | 8.16 |
行列積ではpure numpyに比べて300倍も高速化できました。しかし,NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか。
検証ログ
Pure NumPy (aarch64)
Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.
This was obtained using the following Numpy configuration:
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
blas_info:
NOT AVAILABLE
blas_src_info:
NOT AVAILABLE
blas_opt_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
openblas_clapack_info:
NOT AVAILABLE
flame_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
lapack_info:
NOT AVAILABLE
lapack_src_info:
NOT AVAILABLE
lapack_opt_info:
NOT AVAILABLE
numpy_linalg_lapack_lite:
language = c
define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]
NumPy w/ OpenBLAS (aarch64)
qiita@m1 ~ % python numpy_benchmark.py
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.
This was obtained using the following Numpy configuration:
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/homebrew/opt/openblas/lib/']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/homebrew/opt/openblas/lib/']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/homebrew/opt/openblas/lib/']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/opt/homebrew/opt/openblas/lib/']
language = c
define_macros = [('HAVE_CBLAS', None)]
runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
NumPy w/ Intel MKL (x86_64)
Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.
This was obtained using the following Numpy configuration:
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/qiita/miniconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/qiita/miniconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/qiita/miniconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/Users/qiita/miniconda3/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/Users/qiita/miniconda3/include']
Author And Source
この問題について(M1 Macでnumpyの行列積を300倍高速化する(OpenBLASのリンク)), 我々は、より多くの情報をここで見つけました https://qiita.com/atksh/items/3022de521f55ae654793著者帰属:元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。
Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .