M1 Macでnumpyの行列積を300倍高速化する(OpenBLASのリンク)


※この記事は,線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。

前提

本記事では以下を仮定しています

  • M1 Mac
  • Arm版のHomebrew
  • Arm版のPython3.9

インストール方法

numpy/scipyをOpenBLASにlinkしてinstallするコマンドです

# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも(結構長いので注意)
% pip3 install --no-binary :all: --no-use-pep517 scipy

検証

以下を実行してbenchmarkを計測しました
https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

Name Python Platform BLAS / LAPACK 行列積 (4096x4096) [sec] ドット積 (1x524228) [ms] SVD (2048x1024) [sec] Cholesky分解 (2048x2048) [sec] 対角化 (2048x2048) [sec]
Pure NumPy aarch64 (Homebrew) - 298.54 1.09 13.27 2.12 73.81
NumPy + OpenBLAS aarch64 (Homebrew) OpenBLAS 0.95 0.28 2.49 0.11 10.27
NumPy + Intel MKL intel (Miniconda) Intel MKL 2.53 0.08 0.96 0.22 8.16

行列積ではpure numpyに比べて300倍も高速化できました。しかし,NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか。

検証ログ

Pure NumPy (aarch64)

Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_info:
  NOT AVAILABLE
blas_src_info:
  NOT AVAILABLE
blas_opt_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
numpy_linalg_lapack_lite:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

NumPy w/ OpenBLAS (aarch64)

qiita@m1 ~ % python numpy_benchmark.py 
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']

NumPy w/ Intel MKL (x86_64)

Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/qiita/miniconda3/include']