InfoNCE & Metric Learning

4008 ワード

metric learning テキストリンク

Noise Contrastive Estimation & InfoNCE

「顧客管理」設定での「オンサイト差別」
a.labelが存在しない場合、#samples=Nであれば、N-label分類も有意義な特徴を得ることができる
(Unsupervised Feature Learning via Non-parametric Instance Discrimination)
b. what if N is too large? → Noise Contrastive Estimation

Noise Contrastive Estimation for Unsupervised feature learning(NCE)
a.正サンプルと負サンプルをサンプリングし、バイナリ分類する
ex1). 画像x 1,x 2 x 1,x 2 x 1,x 2の場合、画像中のデータ増強をx 1′,x 2′x 1′,x 2′x 1′,x 2′と呼び、(x 1,x 1′)(x 1,x 1′)が正sample、(x 1,x 2)(x 1,x 2)(x 1,x 2)(x 2)(x 1,x 2)
ex2). word 2 vecを学習する場合、softmaxではなく、負のサンプリングまたは階層softmaxを使用して損失を計算します.

InfoNCE loss
a.分類クロスエントロピーを用いて複数の負のサンプルと比較して正のサンプルを識別する方法
b. formulation

c.f(x)tf(x+)f(x)^tf(x^+)f(x)tf(x+)は正サンプル間の余弦類似性を表す
d.この方法を用いたSimCLR,MoCo,BYOLは,監視なしの画像表示学習により異なる下流でSOTAを実現した

Mathematical view of NCE & InfoNCE

(reference : Contrastive Predictive Coding)

Intuition
a.ハイレベルデータ部における共有情報の符号化
ex). 記事内の近い単語間で共有される情報/隣接する画像パッチ間で共有される情報
b.文脈から学ぶ

 1. target : image → context : augmented image
  2. target : image patch → context : adjacent image patches or pixels
  3. target :  word → context : adjacent or preceding words
  4. target : video frame → context : adjacent video frames
  5. target : video clip → context : concurrent video transcript(sentence)
  6. target : image(when paired with caption) → context : paired caption

Why infoNCE maximizes Mutual Information between target and context?
a. Let data instance XXX, x=target,c=contextx=target, c=contextx=target,c=context, x,c∈Xx,c\in Xx,c∈X
ex). X:一つの文
x(target):コンテキスト周囲のtoken
c(context):x周囲のtoken(fixed)
b. Basic mathematical Intuition
正サンプルxposx{pos}xposはp(x

コンテキストcccに対応しない負のサンプルは

であり、p(x)p(x)p(x)p(x)

の設定において、N(=batchsize)個のサンプルにおいて正サンプルxposx{pos}xposを正しく認識する確率は以下の通りであり、この確率

を高めることを目的とする.

InfoNCE loss function formulation
L=−Ex[log⁡f(x,c)∑x′∈Xf(x′,c)]\mathcal{L} = -\mathbb{E}_{x}[\log {{f(x,c)}\over{\sum_{x'\in X} f(x',c)}} ]L=−Ex[log∑x′∈Xf(x′,c)f(x,c)]
where f(x,c)=exp⁡(vxTvc)f(x,c) =\exp (v_x^T v_c)f(x,c)=exp(vxTvc), which models density ratio p(x∣c)p(x){p(x|c)\over p(x)}p(x)p(x∣c)

How does minimizing loss function above corresponds to maximizing mutual information between xpos,cx_{pos}, cxpos,c ?

MI(x;c)=∑x,cp(x,c)log⁡p(x∣c)p(x)∝log⁡p(x∣c)p(x)MI(x;c) =\sum_{x,c}p(x,c)\log{ {p(x|c)}\over {p(x)}}\propto\log{p(x|c)\over p(x)}MI(x;c)=∑x,cp(x,c)logp(x)p(x∣c)∝logp(x)p(x∣c), which is called density ratio

f(x,c)f(x,c)f(x,c)f(x,c)p(x∣c)p(x∣c)p(x8739 c)モデル化(GPTやBERTなど)を行った.
f(x,c)‖p(x∣c)p(x,c)proto{p(x|c)over p(x)}f(x,c)‖p(x8739 c)密度比でモデリング

maximizing f(xpos,c)f(x_{pos},c)f(xpos,c) implies maximizing density ratio, and it implies maximizing mutual information between xpos,cx_{pos}, cxpos,c.

Contrastive Learning Applications

Image Representation Learning(Unsupervised Setting)
a.dataset:画像のみからなるデータセット
b. context-target

target : image patches → context : adjacent image patches

target : image → context : augmented images

c. ex). SimCLR, BYoL, MoCo

Vision-Language representation learning
a.dataset:image-chaptionペアからなるデータセット
b. context-target

context : image → target : paired caption

context : caption → target : paired image

c. ex). CLIP, ALIGN, FLAVA, Florence

Video Representation learning
a.dataset:HowTo 100 Mのようにクリップブック(text)からなるデータ
b. context-target

target : clip → context : paired transcript
(or-k time stepsから+k time stepsへ)→MIL-NCEが提案したアイデア

c. ex. MIL-NCE, UniVL, MerLoT

Reference

この問題について(InfoNCE & Metric Learning), 我々は、より多くの情報をここで見つけました https://velog.io/@dongdori/InfoNCE-Metric-Learning

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

VHDLにおけるtxtファイルの読み書き

Djangoラーニングシリーズ15:POSTリクエストのデータをデータベースに格納する