インテリジェントコンピューティングシステム---実験一BANGC演算子実現とTensorFlow集積

8870 ワード

インテリジェントコンピューティングシステム深く勉強する. tensorflow

インテリジェントコンピューティングシステム---実験一BANGC演算子実現とTensorFlow集積

実験概要と説明

実験内容

演算子実装

演算子テスト

cnplugin統合

cnmlCreatePluginPowerDifferenceOpParam

cnmlCreatePluginPowerDifferenceOp

cnmlComputePluginPowerDifferenceOpForward

TensorFlow演算子統合

フレーム演算子テスト

実験の概要と説明

この実験は寒武紀陳雲霁先生の著書「知能計算システム」が持参した実験部分である.
この実験は,インテリジェントプログラミング言語(BANGC)を用いた演算子開発により,高性能ライブラリ(CNML)演算子を拡張し,最終的にプログラミングフレームワーク(TensorFlow)に統合し,高性能ライブラリおよびプログラミングフレームワークを拡張する能力を身につけ,読者がDLPハードウェア上で特定の応用シーンを満たす新しい演算子を自由に設計し最適化し,日進月歩インテリジェントアルゴリズムの実際のニーズを満たすことができるようにした.
本論文では,この実験の直接的な解答ではなく,実験部分において重要なピットを紹介する.本人の理解では、この実験の主な目的はBANGC言語を使ってMLUの演算子開発を行うことを熟知させ、スマートハードウェアが現在の特定の任務、例えば各種ニューラルネットワークに対して、卓越した加速作用を認識させることである.しかし、紙面や精力に限られ、この本が提供する実験文書と寒武紀公式サイト文書は、一部のインタフェースの使用について明確に説明されておらず、本人が実験を行う部分で多くの問題点を踏んでいる.そこで、このブログで読者の皆さんに紹介するドキュメントには紹介や説明はありませんが、実験全体の難易度に影響を与えない部分を書きます.
また具体的な実験手順は寒武紀公式実験を参照してください.

実験内容

先行する実験環境と実験サーバ申請部分は、公式チュートリアルに従って直接行えばよい.

オペレータ実装

このセクションでは、インテリジェントプログラミング言語BCLを使用してPowerDifference演算子を実装する必要があります.つまり、

/opt/AICSE-demo-student/demo/style_transfer_bcl/src/bangc/PluginPowerDifferenceOp/plugin_power_difference_kernel.h   plugin_power_difference_kernel.mlu

ファイルを補完する必要があります.
この部分は特に注意が必要なインターフェースはありませんが、「スマートコンピューティングシステム」という本を買った学生にとって注意が必要です.本の中で紹介されているのは教育用の非公式のDLP言語で、BANGCではありません.そのため、開発ロジックを参考にするだけでいいので、インターフェースの具体的な使用はBANGC言語の公式ドキュメントを見てください.例を挙げると,本の中でベクトル乗算のコード紹介に用いられるのは

__vec_mul(output, input_1, input_2, LEN)

BANGC言語で対応するコードは

__bang_mul(output, input_1, input_2, LEN)

演算子のパラメータの順序とタイプにも注意してください.これは、後続の使用時にパラメータをインポートする順序の問題に影響します.

__mlu_entry__ void PowerDifferenceKernel(A, B, C, D, E)

ここではファイルパスの下にあるPowerDiffを見ることができます.cppでは、これは後続で使用する単一演算子テストプログラムで、パラメータのインポート順序はそれぞれmlu_を入力します.input 1、mlu_を入力input 2,次数pow,出力結果mlu_outputおよびベクトル長dims_a.ここでは、与えられたパラメータ順序とデータ型を使用して、後続の変更のコード数を減らすことをお勧めします.

 cnrtKernelParamsBufferAddParam(params, &mlu_input1, sizeof(half*));
 cnrtKernelParamsBufferAddParam(params, &mlu_input2, sizeof(half*));
 cnrtKernelParamsBufferAddParam(params, &pow, sizeof(int));
 cnrtKernelParamsBufferAddParam(params, &mlu_output, sizeof(half*));
 cnrtKernelParamsBufferAddParam(params, &dims_a, sizeof(int));

オペレータテスト

このセクションでは、PowerDifference演算子自体をテストし、機能が正しいことを保証します.
補完が必要だcppファイルを実行します./make.sh
この部分のインタフェースは、CNRTの公式マニュアルで使用方法と例を見つけることができ、説明する必要はありません.

cnplugin統合

この部分は高性能ライブラリPluginOpのインタフェースを介してPowerDifference演算子をカプセル化し,主な内容はplugin_を補完することである.power_difference_op.ccとcnplugin.h新しいCambricon-CNPluginをコンパイルします.
このセクションでは、主に3つのセクションを補完する必要があります.cnmlCreatePluginPowerDifferenceOpParam、cnmlCreatePluginPowerDifferenceOp、cnmlComputePluginPowerDifferenceOpForward

cnmlCreatePluginPowerDifferenceOpParam

この部分は主に演算子の入力パラメータを決定し、/opt/AICSE-demo-student/env/Cambricon-CNpulugin-LU 270/pluginops/経路下の他の演算子の対応する部分を参照して記入することができる.

cnmlCreatePluginPowerDifferenceOp

このセクションでは、/opt/AICSE-demo-student/env/Cambricon-CANPlugin-LU 270/pluginops/パスの他の演算子の対応するセクションを参照して、演算子のopを作成します.
同時に/opt/AIcSE-demo-student/demo/style_transfer_bcl/src/tf-implementation/tf-add-power-diff/mlu_lib_ops.ccファイルには以下のコードがあります.記入する必要があるcnmlCreatePluginPowerDifferenceOp機能を使用して、ここで使用する方法でここに記入したパラメータを確認できます.

tensorflow::Status CreatePowerDifferenceOp(MLUBaseOp** op, MLUTensor* input1,
                                             MLUTensor* input2,
                                             int input3,
                                             MLUTensor* output, int len) {
  MLUTensor* inputs_ptr[2] = {input1, input2};
  MLUTensor* outputs_ptr[1] = {output};


  CNML_RETURN_STATUS(cnmlCreatePluginPowerDifferenceOp(op, inputs_ptr, input3, outputs_ptr, len));
}

一方,内部実装が必要な方法cnmlCreatePluginOpについては,試行錯誤後に内部パラメータの説明が得られる.

cnmlCreatePluginOp(cnmlBaseOp**, const char*, void*, cnrtKernelParamsBuffer_t, cnmlTensor**, int, cnmlTensor**, int, cnmlTensor**, int)

1つ目はop、2つ目はop名、3つ目はkernelパラメータ、4つ目は入力tensor配列ヘッダポインタ、5つ目は入力tensorの数なのでint、6つ目は出力tensor配列ヘッダポインタ、7つ目は出力tensorの数、最後の2つの書き込みポインタと0でよい

cnmlComputePluginPowerDifferenceOpForward

このセクションでは、/opt/AICSE-demo-student/env/Cambricon-CNpugin-LU 270/pluginops/パスの他の演算子の対応するセクションを参照して、演算子の順方向計算を作成します.
同時に/opt/AIcSE-demo-student/demo/style_transfer_bcl/src/tf-implementation/tf-add-power-diff/mlu_lib_ops.ccファイルには以下のコードがあります.記入するcnmlComputePluginPowerDifferenceOpForward機能を使用して、ここに記入したパラメータをここで使用する方法で確認できます.

tensorflow::Status ComputePowerDifferenceOp(MLUBaseOp* op,
                                              MLUCnrtQueue* queue, void* input1,
                                              void* input2, void* output) {
  void* inputs_ptr[2] = {input1, input2};
  void* outputs_ptr[1] = {output};
  CNML_RETURN_STATUS(cnmlComputePluginPowerDifferenceOpForward(
                                         op, inputs_ptr, outputs_ptr, queue));
}

一方,内部実装が必要な方法cnmlCreatePluginOpについては,試行錯誤後に内部パラメータの説明が得られる.

cnmlComputePluginOpForward_V3(cnmlBaseOp_t, cnmlTensor**, int, cnmlTensor**,  int, void*，cnrtQueue_t)

1つ目はop、2つ目は入力tensor配列ヘッダポインタ、3つ目は入力tensorの数、4つ目は出力tensor配列ヘッダポインタ、5つ目は出力tensorの数、6つ目は空ポインタ、7つ目はタスクキュー

TensorFlow演算子集積

この実験の内容は,カプセル化された演算子をTensorFlowプログラミングフレームワークに統合し,このパスの下(/opt/AICSE-demo-student/demo/style_transfer_bcl/src/tf-implementation/tf-add-power-diff/)のファイルをTensorFlowソースコードに順次追加することである(ソースパス:/opt/AICSE-demo-student/env/tensorflow-v 1.10/)
各ファイルの具体的な追加規則は/opt/AICSE-demo-student/demo/style_を参照してください.transfer_bcl/src/tf-implementation/tf-add-power-diff/readme.txtで.
追加が完了するとソースコードを再コンパイルする必要があり、コンパイル前にコンパイルコマンド/opt/AICSE-demo-student/env/tensorflow-v 1を変更する必要がある.10/build_tensorflow-v1.10_mlu.shはjobs_numを16に変更します.そうしないと、コンパイルエラーが発生しやすいです.

フレーム演算子テスト

このセクションでは、フレームワークAPIを使用して、前のステップでTensorFlowに統合された演算子をテストし、その機能が正しいことを保証します.
主な内容は補完.../src/online_mlu/power_difference_test_bcl.py、および.../src/online_cpu/power_difference_test_cpu.pyファイル、実行:

python power_difference_test_xxx.py

ここで、この部分は、最も重要なコードtfを補完する必要がある.power_differenceの使用は/opt/AICSE-demo-student/env/tensorflow-v 1.10/virtualenv_mlu/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.pyに説明があります

def power_difference(x, y, pow, name=None):
  r"""TODO: add doc.

  Args:
    x: A `Tensor`. Must be one of the following types: `bfloat16`, `float32`, `half`, `float64`, `int32`, `int64`, `complex64`, `complex128`.
    y: A `Tensor`. Must have the same type as `x`.
    pow: A `Tensor`. Must have the same type as `x`.
    name: A name for the operation (optional).

  Returns:
    A `Tensor`. Has the same type as `x`.
  """
  _ctx = _context._context or _context.context()
  if _ctx is not None and _ctx._thread_local_data.is_eager:
    try:
      _result = _pywrap_tensorflow.TFE_Py_FastPathExecute(
        _ctx._context_handle, _ctx._thread_local_data.device_name,
        "PowerDifference", name, _ctx._post_execution_callbacks, x, y, pow)
      return _result
    except _core._FallbackException:
      try:
        return power_difference_eager_fallback(
            x, y, pow, name=name, ctx=_ctx)
      except _core._SymbolicException:
        pass  # Add nodes to the TensorFlow graph.
      except (TypeError, ValueError):
        result = _dispatch.dispatch(
              power_difference, x=x, y=y, pow=pow, name=name)
        if result is not _dispatch.OpDispatcher.NOT_SUPPORTED:
          return result
        raise
    except _core._NotOkStatusException as e:
      if name is not None:
        message = e.message + " name: " + name
      else:
        message = e.message
      _six.raise_from(_core._status_to_exception(e.code, message), None)
  # Add nodes to the TensorFlow graph.
  try:
    _, _, _op = _op_def_lib._apply_op_helper(
        "PowerDifference", x=x, y=y, pow=pow, name=name)
  except (TypeError, ValueError):
    result = _dispatch.dispatch(
          power_difference, x=x, y=y, pow=pow, name=name)
    if result is not _dispatch.OpDispatcher.NOT_SUPPORTED:
      return result
    raise
  _result = _op.outputs[:]
  _inputs_flat = _op.inputs
  _attrs = ("T", _op.get_attr("T"))
  _execute.record_gradient(
      "PowerDifference", _inputs_flat, _attrs, _result, name)
  _result, = _result
  return _result

このfuncの役割は,呼び出し前にTFに統合された演算子を設計し,問題がなければ,最後に以下の結果を得ることである.

comput BCL op cost 294.717073441ms
comput op cost 225.753068924ms
err rate= 5.8923983536261504e-06

［Cache］Mysql(JPA)とRedisの使用

なぜlambda式でVariableを使うのはfinalか効果的にfinalなのでしょうか?