x 86 6 64対Aarcha 64のアルゴリズム選択パートI

12770 ワード

assemblylanguage simd aarch64 assembly テキストリンク

このポストでは、ボリュームを縮めるために異なるアルゴリズムでいくつかの異なるプログラムをベンチマーキングすることを調査します.私は、サンプルの入って来る流れの上で行動する5つの異なるアルゴリズムをベンチマーキングするでしょう.リアルタイムでオーディオをスケールするために、48000 kHzの信号に作用することは1秒につき96000バイト以上のデータを含むことができます、それで、効率は何も失われるか、遅れないことを確認するキーです.それを念頭に置いて、オーディオボリュームを拡大する方法をいくつか見てみましょう.
受信サンプルは次のようにシミュレートされます.

void vol_createsample(int16_t* sample, int32_t sample_count) {
        int i;
        for (i=0; i<sample_count; i++) {
                sample[i] = (rand()%65536)-32768;
        }
        return;
}

アルゴリズム1 - Vol 0 .C - Naは、VE

int16_t scale_sample(int16_t sample, int volume) {

        return (int16_t) ((float) (volume/100.0) * (float) sample);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples. (Why is this needed?)
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples. (Why is this needed?)
        printf("Result: %d\n", ttl);

 return 0;
}

この第1のアルゴリズムは、各サンプルをスケールファクタで乗算するNA−ATI経路をとる.これは整数を浮動小数点値に変換し、再びこのスケールで非常に高価です.私は四肢で出て、これがより効率的にされることができると言うつもりです、私はこれが最悪を実行すると予測します.(また、注意-プログラムの実際のサンプルスケーリング部分をコンパイラが最適化しないように、コードの合計と印刷部分が必要です)

アルゴリズム2 -第1巻。固定小数点数

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;

このアルゴリズムは、以前のコードをボギングしている浮動小数点変換を回避し、ビット数が続く整数シフトを計算します.これは最後のアルゴリズムで時間を節約する必要がありますが、おそらく2番目または3番目の遅いでしょう.

アルゴリズム3 - Vol 2 .事前計算

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        static int16_t* precalc;

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        precalc = (int16_t*) calloc(65536,2);
        if (precalc == NULL) {
                printf("malloc failed!\n");
                return 1;
        }

        for (x = -32768; x <= 32767; x++) {
 // Q: What is the purpose of the cast to unint16_t in the next line?
                precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
        }

        for (x = 0; x < SAMPLES; x++) {
                out[x]=precalc[(uint16_t) in[x]];
        }

// ---- This part sums the samples.
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.
        printf("Result: %d\n", ttl);

        return 0;
}

このアルゴリズムは、すべての65526の値(- 32768から32767)を事前に計算しているので、プログラムはそれぞれの値の結果を調べる必要があります.これは、オーディオファイルの大きさとあまり比較しない16ビット数のすべての可能な値の128 KBテーブルを引き出すでしょう.この場合のパフォーマンスは、数学単位が128 KBのデータを取得するキャッシュとどのくらいの速さで高速にヒンジになります.もう一度これは2番目または3番目の遅いアルゴリズムかもしれないと思います.

アルゴリズム4 - Vol 4 .インラインsimd

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

このアルゴリズムはインラインアセンブリを使用してSIMDを使用して複数の値を同時に処理します.そのように、それはほぼ確かに前のアルゴリズムよりよく実行するでしょう.SimdはAarcha 64システムでしか利用できないので、どのように動作しているかを見なければならず、このアルゴリズムに対してX 86 SAR 64ベンチマーキングを残す必要があります.
以下のように「Q」でマークされた5点があります.

// Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

( 1 )値は32768ではなく32767で乗算され、整数オーバーフローを防ぐ必要があります.

// Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;

(2)インナカーソルとoutnalカーソルは、Inおよびout配列の最初の要素を指すように設定されます.これらは、それぞれのスケーリング・ロジックに、そして、それぞれから読み込むために以下のループで使われます.

 // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

( 3 )VolSense intは符号付き16ビット整数としてボリュームを表しますdup 32ビットからボリュームスケーリング因子を複製する命令w0 ベクトルレジスタにv1.8h .

// Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"

4).これらの3行はすべて、このプログラムの2番目のテンプレートの一部です.最初の行は出力、2番目の入力、最後のclobbers - memoryです.

// Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

5).ここでの結果はsqrdmulh 上記の命令は結果を飽和させ、オーバーフローを防止します.

アルゴリズム5 - volo 5。イントインシシクス

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else

        register int16_t*       in_cursor       asm("r20");     // input cursor (pointer)
        register int16_t*       out_cursor      asm("r21");     // output cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int16_t*                limit;          // end of input array

        int                     x;              // array interator
        int                     ttl=0;          // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]

        vol_int = (int16_t) (VOLUME/100.0 * 32767.0);

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        while ( in_cursor < limit ) {
                // What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;
                out_cursor += 8;
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;
#endif
}

この最後のアルゴリズムもsimdを使用していますが、コンパイラアセンブラのインラインアセンブラoptsよりも.それは同様に前のアルゴリズムの同時処理から利益を得なければならないので、私はこの1つまたはそれのどちらかがトップに出ると思っています.再び、いくつかのセクションは、私が現在する明確化のために指摘されます.

// What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));

( 1 )これらの内在関数は、最後のプログラムで使用する命令と等価です.vst1q_s16 はstr , vqrdmulhq_s16 はsqrdmulh , vld1q_s16 はldr , and vdupq_n_s16 はdup .

// Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                in_cursor += 8;

(2)各々の内因が一度に8つの要素を計算するので、ポインタは8によって増やされます.インラインアセンブラプログラムでは、ポインタはUS用にインクリメントされましたが、ここでは手動で行う必要があります.

// Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

( 3 )もう一度、私たちが固有の等価sqrdmulh , 結果は飽和しなければならなくて、潜在的オーバーフローを避けるべきです、したがって、出力は信頼できなければなりません.
次のポストでは、テストにこれらのアルゴリズムを入れて、それらが最速であることを見つけるためにベンチマーキングを取得します.もっとすぐに!

Reference

この問題について(x 86 6 64対Aarcha 64のアルゴリズム選択パートI), 我々は、より多くの情報をここで見つけました https://dev.to/gusmccallum/algorithm-selection-on-x8664-vs-aarch64-part-i-5ff6

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

[白俊]1916回取得最小費用

TIL 2022-03-16-数量