BogoMips向上

6191 ワード

s 5 pv 210プラットフォームでlinux 3.0とlinux 3.4.2を移植する場合、
BogoMips
[ 0.000083] Calibrating delay loop... 997.78 BogoMIPS (lpj=2494464)
いずれも1000に近いので、
しかし、移植をテストしたlinux-3.7、linux-3.9.7、linux-3.9.11、linux-3.10.28はいずれも
600以上
Calibrating delay loop... 663.55 BogoMIPS (lpj=1658880)
このような差が大きいと,clockの移植や他の問題による性能低下が懸念される.
デバッグ:
関数＃カンスウ＃

static unsigned long __cpuinit calibrate_delay_converge(void)
{
	/* First stage - slowly accelerate to find initial bounds */
	unsigned long lpj, lpj_base, ticks, loopadd, loopadd_base, chop_limit;
	int trials = 0, band = 0, trial_in_band = 0;

	lpj = (1<<12);

	/* wait for "start of" clock tick */
	ticks = jiffies;
	while (ticks == jiffies)
		; /* nothing */
	/* Go .. */
	ticks = jiffies;
	do {
		if (++trial_in_band == (1<<band)) {
			++band;
			trial_in_band = 0;
		}
		__delay(lpj * band);
		trials += band;
	} while (ticks == jiffies);
	/*
	 * We overshot, so retreat to a clear underestimate. Then estimate
	 * the largest likely undershoot. This defines our chop bounds.
	 */
	trials -= band;
	loopadd_base = lpj * band;
	lpj_base = lpj * trials;

recalibrate:
	lpj = lpj_base;
	loopadd = loopadd_base;

	/*
	 * Do a binary approximation to get lpj set to
	 * equal one clock (up to LPS_PREC bits)
	 */
	chop_limit = lpj >> LPS_PREC;
	while (loopadd > chop_limit) {
		lpj += loopadd;
		ticks = jiffies;
		while (ticks == jiffies)
			; /* nothing */
		ticks = jiffies;
		__delay(lpj);
		if (jiffies != ticks)	/* longer than 1 tick */
			lpj -= loopadd;
		loopadd >>= 1;
	}
	/*
	 * If we incremented every single time possible, presume we've
	 * massively underestimated initially, and retry with a higher
	 * start, and larger range. (Only seen on x86_64, due to SMIs)
	 */
	if (lpj + loopadd * 2 == lpj_base + loopadd_base * 2) {
		lpj_base = lpj;
		loopadd_base <<= 2;
		goto recalibrate;
	}

	return lpj;
}

実行
ticks = jiffies; while (ticks == jiffies)
ticksのサイズは同じなので,cpuクロックと単一命令の周期を排除した.
次は疑うべき場所が一つしかない
__delay(lpj * band);
linux3.4.2カーネル中arch/arm/lib/delay.Sで実現

/*
 * loops = r0 * HZ * loops_per_jiffy / 1000000
 *
 * Oh, if only we had a cycle counter...
 */

@ Delay routine
ENTRY(__delay)
		subs	r0, r0, #1
#if 0
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
#endif
		bhi	__delay
		mov	pc, lr

ENDPROC(__delay)

linux3.9.7とlinux 3.13でdelay-loop.S実装

/*
 * loops = r0 * HZ * loops_per_jiffy / 1000000
 */
		.align 3

@ Delay routine
ENTRY(__loop_delay)
		subs	r0, r0, #1
#if 0
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
		movls	pc, lr
		subs	r0, r0, #1
#endif
		bhi	__loop_delay
		mov	pc, lr

ENDPROC(__loop_delay)

比較によりlinux 3で発見する.13、1つの整列IDが追加されました
.align 3
linu 3.9.7にもこの整列標識が追加され、BogoMipsは663から994に編成された.
うれしくてたまらない.

[liujia@210]#cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 2 (v7l)
BogoMIPS        : 997.78
Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x2
CPU part        : 0xc08
CPU revision    : 2

Hardware        : SMDKV210
Revision        : 0000
Serial          : 0000000000000000

しかし、アプリケーションをテストしたところ、実際のパフォーマンスが向上していないことがわかりました.
ネットでこんな話を見つけました(http://www.linux-mips.org/wiki?title=BogoMIPS&oldid=6231):
From LinuxMIPS
Revision as of 13:32, 4 April 2005 by Ralf
(Talk | contribs)
( diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
BogoMIPS used to be the infamous, prestigious benchmark for Linux machines over a decade. Unfortunately - or fortunately - depending of point of view - the BogoMIPS number of the favorite machine BogoMIPS have little to nothing to do with actual processor performance. Certain other microarchectural details will also very overproprotionally influence the benchmark. On the other side memory performance, I/O performance, cache size and speed and many other processor and system architecture feature that make a crucial difference for system performance will not influence BogoMIPS at all. The BogoMIPS number for any given processor architecture is basically proportional to the clock rate. On most processor architectures the BogoMIPS loop is compiled into just two instructions. Accordingly small is the aspects of a processor that are actually tested. And processors again are just a small part of an overall system which includes other hardware and software. To show the actual code on MIPS:

          .set    noreorder
  loop:   bnez    $reg, loop
          subu    $reg, 1
          .set    reorder"

A typical modern machine with efficient branches or branch prediction can execute this loop at a rate of one instruction per cycle. Out of Order Execution which provides roughly a 50% speedup on real workloads provides no benefit. Not even second level caches or memory subsystems are exercised. The more surprising it is that BogoMIPS have become a benchmark for performance as important as extra inches in spam email. Having been a permanent annoynce over the years due to miss-interpretation by users and due to excessive output on multiprocessor machines Linux by default will no longer print the BogoMIPS number since 2.6.9-rc2.
The purpose of the BogoMIPS benchmark is to calibrate internal delay loops which are used for very short delays or in situations where a process can't sleep. This is done by calling the mdelay(), udelay() and ndelay() functions which take the time to delay as the argument in units of milliseconds, microseconds or nanoseconds, respectivly.
BogoMipsは実際の性能に反応できないようだ.

Sklearnでよく使われる方法

カスタムモデルバインドはこのように遊ぶことができます