M1 Macでnumpyの行列積を300倍高速化する（OpenBLASのリンク）

10803 ワード

numpy Mac OpenBLAS Python M1 Python テキストリンク

※この記事は，線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。

前提

本記事では以下を仮定しています

M1 Mac
Arm版のHomebrew
Arm版のPython3.9

インストール方法

numpy/scipyをOpenBLASにlinkしてinstallするコマンドです

# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも（結構長いので注意）
% pip3 install --no-binary :all: --no-use-pep517 scipy

検証

以下を実行してbenchmarkを計測しました
▼ https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

Name	Python Platform	BLAS / LAPACK	行列積 (4096x4096) [sec]	ドット積 (1x524228) [ms]	SVD (2048x1024) [sec]	Cholesky分解 (2048x2048) [sec]	対角化 (2048x2048) [sec]
Pure NumPy	aarch64 (Homebrew)	-	298.54	1.09	13.27	2.12	73.81
NumPy + OpenBLAS	aarch64 (Homebrew)	OpenBLAS	0.95	0.28	2.49	0.11	10.27
NumPy + Intel MKL	intel (Miniconda)	Intel MKL	2.53	0.08	0.96	0.22	8.16

行列積ではpure numpyに比べて300倍も高速化できました。しかし，NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか。

検証ログ

Pure NumPy (aarch64)

Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_info:
  NOT AVAILABLE
blas_src_info:
  NOT AVAILABLE
blas_opt_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
numpy_linalg_lapack_lite:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

NumPy w/ OpenBLAS (aarch64)

qiita@m1 ~ % python numpy_benchmark.py 
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']

NumPy w/ Intel MKL (x86_64)

Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/qiita/miniconda3/include']

Author And Source

この問題について(M1 Macでnumpyの行列積を300倍高速化する（OpenBLASのリンク）), 我々は、より多くの情報をここで見つけました https://qiita.com/atksh/items/3022de521f55ae654793

著者帰属：元の著者の情報は、元のURLに含まれています。著作権は原作者に属する。

Content is automatically searched and collected through network algorithms . If there is a violation . Please contact us . We will adjust (correct author information ,or delete content ) as soon as possible .

Activityがactivityを継承した後に親で使用するthisは誰のインスタンスですか

BootStrap-tableのasp.netでの応用