MongoDBレプリカセット構成シリーズ11:MongoDBデータ同期原理と自動フェイルオーバ原理
28901 ワード
1:データ同期の原理:
2:レプリカセットの情報の表示
MongoDBはlastHeartbeatにより自動転送を実現する.
mongodインスタンスは、2秒おきに他のメンバーにハートビートパケットを送信し、rs.status()で返されたメンバーのhealthによってメンバーの状態を判断します.primaryノードが使用できない場合、レプリケーションセットのすべてのsecondaryノードが選挙操作をトリガーします.新しいprimaryノードを選択します.secondaryノードが複数ある場合、最新のoplogタイムスタンプ記録を持つノードまたはより高い権限を持つノードがprimaryになるように選択されます(注意:secondaryの停止時間が長すぎると、primaryノードのoplogコンテンツがループ書きで上書きされるため、手動でsecondaryノードを同期する必要があります).
MongoDBデータ同期のプロセス:
1:同期ノードのoplogを削除します.たとえば、SecondaryノードはPrimaryノードのoplogを引きます.
2:引いたoplogを自分のoplogに書き込む.たとえば、SecondaryノードがPrimaryから引いたoplogを自分のoplogに書き込む.
3:次のoplogをどこに同期するかを要求します.たとえば、SecondaryはPrimaryノードをどこに同期するように要求しますか?
Secondaryノードの同期先:
Pri 1:maryノードにデータを挿入
2:同時に、Primaryのoplogにデータが書き込まれ、タイムスタンプが記録されます
3:db.runCommand({getlasterror:1,w:2})Primaryノードが呼び出されると、Primaryは書き込み操作を完了し、他の非仲裁ノードがデータを同期するのを待つ
4:SecondaryノードPrimaryのoplogをクエリーし、oplogを削除
5:Secondaryタイムスタンプによるoplogの適用
6:Secondary要求が自分のoplogタイムスタンプより大きいoplog
7:Primary更新タイムスタンプ
同期の初期化:
1:新しく追加したノードやoplog同期時に上書きしたときに初期化同期を行います.
2:ソースノードから最新のoplog timeを取り、startとマークします.
3:ソースノードからすべてのデータをターゲットノードにクローン
4:ターゲットノードにインデックスを作成する
5:ターゲットノードの最新のoplog timeを取り、minValidとマークします.
6:ターゲットノードでstartからminValidまでのoplogを実行する(まだ実行されていないoplogをコピーし、最終的な一貫性を達成していない部分がoplog replayのプロセスであるはず)
7:通常のメンバーになる
公式の初期化同期3ステップ:
新しいノードで実行
どのメンバーからデータを同期させるか(Who to sync from)
MongoDBが同期データを初期化する場合、マスターノードから同期するか、スレーブノードから同期するか、最近の原則に従って、最隣接ノードを選択してデータを同期します.(ping値に基づく)
データを同期するノードを指定することもできます.
または
初期化同期のソース:http://dl.mongodb.org/dl/src/
C:\Users\John\Desktop\mongodb-src-r2.6.3\src\mongo\db\repl\rs_initialsync.cpp
rs_initialsync.cpp
Primary ,Secondary :
1: local oplog.rs 。
2: Primary local oplog.rs , 。
3: oplog.rs , 。
2:レプリカセットの情報の表示
gechongrepl:PRIMARY> rs.status()
{
"set" : "gechongrepl",
"date" : ISODate("2015-07-02T02:38:15Z"),
"myState" : 1,
"members" : [
{
"_id" : 6,
"name" : "192.168.91.144:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 1678,
"lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"),
"lastHeartbeatRecv" : ISODate("2015-07-02T02:38:14Z"),
"pingMs" : 1
},
{
"_id" : 10,
"name" : "192.168.91.135:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1678,
"optime" : Timestamp(1435803750, 1),
"optimeDate" : ISODate("2015-07-02T02:22:30Z"),
"lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"),
"lastHeartbeatRecv" : ISODate("2015-07-02T02:38:13Z"),
"pingMs" : 1,
"syncingTo" : "192.168.91.148:27017"
},
{
"_id" : 11,
"name" : "192.168.91.148:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1698,
"optime" : Timestamp(1435803750, 1),
"optimeDate" : ISODate("2015-07-02T02:22:30Z"),
"electionTime" : Timestamp(1435803023, 1),
"electionDate" : ISODate("2015-07-02T02:10:23Z"),
"self" : true
},
{
"_id" : 12,
"name" : "192.168.91.134:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1655,
"optime" : Timestamp(1435803750, 1),
"optimeDate" : ISODate("2015-07-02T02:22:30Z"),
"lastHeartbeat" : ISODate("2015-07-02T02:38:14Z"),
"lastHeartbeatRecv" : ISODate("2015-07-02T02:38:14Z"),
"pingMs" : 1,
"syncingTo" : "192.168.91.135:27017"
}
],
"ok" : 1
}
myState:1 primary
state:1 primary;7 arbiter
uptime:
lastHeartbeat:
pingMs:
optime: oplog.rs 。 。
MongoDBはlastHeartbeatにより自動転送を実現する.
mongodインスタンスは、2秒おきに他のメンバーにハートビートパケットを送信し、rs.status()で返されたメンバーのhealthによってメンバーの状態を判断します.primaryノードが使用できない場合、レプリケーションセットのすべてのsecondaryノードが選挙操作をトリガーします.新しいprimaryノードを選択します.secondaryノードが複数ある場合、最新のoplogタイムスタンプ記録を持つノードまたはより高い権限を持つノードがprimaryになるように選択されます(注意:secondaryの停止時間が長すぎると、primaryノードのoplogコンテンツがループ書きで上書きされるため、手動でsecondaryノードを同期する必要があります).
MongoDBデータ同期のプロセス:
1:同期ノードのoplogを削除します.たとえば、SecondaryノードはPrimaryノードのoplogを引きます.
2:引いたoplogを自分のoplogに書き込む.たとえば、SecondaryノードがPrimaryから引いたoplogを自分のoplogに書き込む.
3:次のoplogをどこに同期するかを要求します.たとえば、SecondaryはPrimaryノードをどこに同期するように要求しますか?
Secondaryノードの同期先:
Pri 1:maryノードにデータを挿入
2:同時に、Primaryのoplogにデータが書き込まれ、タイムスタンプが記録されます
3:db.runCommand({getlasterror:1,w:2})Primaryノードが呼び出されると、Primaryは書き込み操作を完了し、他の非仲裁ノードがデータを同期するのを待つ
4:SecondaryノードPrimaryのoplogをクエリーし、oplogを削除
5:Secondaryタイムスタンプによるoplogの適用
6:Secondary要求が自分のoplogタイムスタンプより大きいoplog
7:Primary更新タイムスタンプ
同期の初期化:
1:新しく追加したノードやoplog同期時に上書きしたときに初期化同期を行います.
2:ソースノードから最新のoplog timeを取り、startとマークします.
3:ソースノードからすべてのデータをターゲットノードにクローン
4:ターゲットノードにインデックスを作成する
5:ターゲットノードの最新のoplog timeを取り、minValidとマークします.
6:ターゲットノードでstartからminValidまでのoplogを実行する(まだ実行されていないoplogをコピーし、最終的な一貫性を達成していない部分がoplog replayのプロセスであるはず)
7:通常のメンバーになる
公式の初期化同期3ステップ:
新しいノードで実行
Initial Sync
Initial sync copies all the data from one member of the replica set to another member. A member uses initial sync when the member has no data, such as when the member is new, or when the member has data but is missing a history of the set’s replication.
When you perform an initial sync, MongoDB:
1:Clones all databases. To clone, the mongod queries every collection in each source database and inserts all data into its own copies of these collections. At this time, _id indexes are also built. The clone process only copies valid data, omitting invalid documents.
2:Applies all changes to the data set. Using the oplog from the source, the mongod updates its data set to reflect the current state of the replica set.
3:Builds all indexes on all collections (except _id indexes, which were already completed).
When the mongod finishes building all index builds, the member can transition to a normal state, i.e. secondary.
どのメンバーからデータを同期させるか(Who to sync from)
MongoDBが同期データを初期化する場合、マスターノードから同期するか、スレーブノードから同期するか、最近の原則に従って、最隣接ノードを選択してデータを同期します.(ping値に基づく)
データを同期するノードを指定することもできます.
db.adminCommand( { replSetSyncFrom: "[hostname]:[port]" } )
または
rs.syncFrom("[hostname]:[port]")
初期化同期のソース:http://dl.mongodb.org/dl/src/
C:\Users\John\Desktop\mongodb-src-r2.6.3\src\mongo\db\repl\rs_initialsync.cpp
rs_initialsync.cpp
/**
* Copyright (C) 2008 10gen Inc.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Affero General Public License, version 3,
* as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Affero General Public License for more details.
*
* You should have received a copy of the GNU Affero General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*
* As a special exception, the copyright holders give permission to link the
* code of portions of this program with the OpenSSL library under certain
* conditions as described in each individual source file and distribute
* linked combinations including the program with the OpenSSL library. You
* must comply with the GNU Affero General Public License in all respects for
* all of the code used other than as permitted herein. If you modify file(s)
* with this exception, you may extend this exception to your version of the
* file(s), but you are not obligated to do so. If you do not wish to do so,
* delete this exception statement from your version. If you delete this
* exception statement from all source files in the program, then also delete
* it in the license file.
*/
#include "mongo/pch.h"
#include "mongo/db/repl/rs.h"
#include "mongo/db/auth/authorization_manager.h"
#include "mongo/db/auth/authorization_manager_global.h"
#include "mongo/db/client.h"
#include "mongo/db/cloner.h"
#include "mongo/db/dbhelpers.h"
#include "mongo/db/repl/bgsync.h"
#include "mongo/db/repl/oplog.h"
#include "mongo/db/repl/oplogreader.h"
#include "mongo/bson/optime.h"
#include "mongo/db/repl/replication_server_status.h" // replSettings
#include "mongo/db/repl/rs_sync.h"
#include "mongo/util/mongoutils/str.h"
namespace mongo {
using namespace mongoutils;
using namespace bson;
void dropAllDatabasesExceptLocal();
// add try/catch with sleep
void isyncassert(const string& msg, bool expr) {
if( !expr ) {
string m = str::stream() << "initial sync " << msg;
theReplSet->sethbmsg(m, 0);
uasserted(13404, m);
}
}
void ReplSetImpl::syncDoInitialSync() {
static const int maxFailedAttempts = 10;
createOplog();
int failedAttempts = 0;
while ( failedAttempts < maxFailedAttempts ) {
try {
_syncDoInitialSync();
break;
}
catch(DBException& e) {
failedAttempts++;
str::stream msg;
msg << "initial sync exception: ";
msg << e.toString() << " " << (maxFailedAttempts - failedAttempts) << " attempts remaining" ;
sethbmsg(msg, 0);
sleepsecs(30);
}
}
fassert( 16233, failedAttempts < maxFailedAttempts);
}
bool ReplSetImpl::_syncDoInitialSync_clone(Cloner& cloner, const char *master,
const list<string>& dbs, bool dataPass) {
for( list<string>::const_iterator i = dbs.begin(); i != dbs.end(); i++ ) {
string db = *i;
if( db == "local" )
continue;
if ( dataPass )
sethbmsg( str::stream() << "initial sync cloning db: " << db , 0);
else
sethbmsg( str::stream() << "initial sync cloning indexes for : " << db , 0);
Client::WriteContext ctx(db);
string err;
int errCode;
CloneOptions options;
options.fromDB = db;
options.logForRepl = false;
options.slaveOk = true;
options.useReplAuth = true;
options.snapshot = false;
options.mayYield = true;
options.mayBeInterrupted = false;
options.syncData = dataPass;
options.syncIndexes = ! dataPass;
if (!cloner.go(ctx.ctx(), master, options, NULL, err, &errCode)) {
sethbmsg(str::stream() << "initial sync: error while "
<< (dataPass ? "cloning " : "indexing ") << db
<< ". " << (err.empty() ? "" : err + ". ")
<< "sleeping 5 minutes" ,0);
return false;
}
}
return true;
}
void _logOpObjRS(const BSONObj& op);
static void emptyOplog() {
Client::WriteContext ctx(rsoplog);
Collection* collection = ctx.ctx().db()->getCollection(rsoplog);
// temp
if( collection->numRecords() == 0 )
return; // already empty, ok.
LOG(1) << "replSet empty oplog" << rsLog;
collection->details()->emptyCappedCollection(rsoplog);
}
bool Member::syncable() const {
bool buildIndexes = theReplSet ? theReplSet->buildIndexes() : true;
return hbinfo().up() && (config().buildIndexes || !buildIndexes) && state().readable();
}
const Member* ReplSetImpl::getMemberToSyncTo() {
lock lk(this);
// if we have a target we've requested to sync from, use it
if (_forceSyncTarget) {
Member* target = _forceSyncTarget;
_forceSyncTarget = 0;
sethbmsg( str::stream() << "syncing to: " << target->fullName() << " by request", 0);
return target;
}
const Member* primary = box.getPrimary();
// wait for 2N pings before choosing a sync target
if (_cfg) {
int needMorePings = config().members.size()*2 - HeartbeatInfo::numPings;
if (needMorePings > 0) {
OCCASIONALLY log() << "waiting for " << needMorePings << " pings from other members before syncing" << endl;
return NULL;
}
// If we are only allowed to sync from the primary, return that
if (!_cfg->chainingAllowed()) {
// Returns NULL if we cannot reach the primary
return primary;
}
}
// find the member with the lowest ping time that has more data than me
// Find primary's oplog time. Reject sync candidates that are more than
// maxSyncSourceLagSecs seconds behind.
OpTime primaryOpTime;
if (primary)
primaryOpTime = primary->hbinfo().opTime;
else
// choose a time that will exclude no candidates, since we don't see a primary
primaryOpTime = OpTime(maxSyncSourceLagSecs, 0);
if (primaryOpTime.getSecs() < static_cast<unsigned int>(maxSyncSourceLagSecs)) {
// erh - I think this means there was just a new election
// and we don't yet know the new primary's optime
primaryOpTime = OpTime(maxSyncSourceLagSecs, 0);
}
OpTime oldestSyncOpTime(primaryOpTime.getSecs() - maxSyncSourceLagSecs, 0);
Member *closest = 0;
time_t now = 0;
// Make two attempts. The first attempt, we ignore those nodes with
// slave delay higher than our own. The second attempt includes such
// nodes, in case those are the only ones we can reach.
// This loop attempts to set 'closest'.
for (int attempts = 0; attempts < 2; ++attempts) {
for (Member *m = _members.head(); m; m = m->next()) {
if (!m->syncable())
continue;
if (m->state() == MemberState::RS_SECONDARY) {
// only consider secondaries that are ahead of where we are
if (m->hbinfo().opTime <= lastOpTimeWritten)
continue;
// omit secondaries that are excessively behind, on the first attempt at least.
if (attempts == 0 &&
m->hbinfo().opTime < oldestSyncOpTime)
continue;
}
// omit nodes that are more latent than anything we've already considered
if (closest &&
(m->hbinfo().ping > closest->hbinfo().ping))
continue;
if (attempts == 0 &&
(myConfig().slaveDelay < m->config().slaveDelay || m->config().hidden)) {
continue; // skip this one in the first attempt
}
map<string,time_t>::iterator vetoed = _veto.find(m->fullName());
if (vetoed != _veto.end()) {
// Do some veto housekeeping
if (now == 0) {
now = time(0);
}
// if this was on the veto list, check if it was vetoed in the last "while".
// if it was, skip.
if (vetoed->second >= now) {
if (time(0) % 5 == 0) {
log() << "replSet not trying to sync from " << (*vetoed).first
<< ", it is vetoed for " << ((*vetoed).second - now) << " more seconds" << rsLog;
}
continue;
}
_veto.erase(vetoed);
// fall through, this is a valid candidate now
}
// This candidate has passed all tests; set 'closest'
closest = m;
}
if (closest) break; // no need for second attempt
}
if (!closest) {
return NULL;
}
sethbmsg( str::stream() << "syncing to: " << closest->fullName(), 0);
return closest;
}
void ReplSetImpl::veto(const string& host, const unsigned secs) {
lock lk(this);
_veto[host] = time(0)+secs;
}
/**
* Replays the sync target's oplog from lastOp to the latest op on the sync target.
*
* @param syncer either initial sync (can reclone missing docs) or "normal" sync (no recloning)
* @param r the oplog reader
* @param source the sync target
* @param lastOp the op to start syncing at. replset::InitialSync writes this and then moves to
* the queue. replset::SyncTail does not write this, it moves directly to the
* queue.
* @param minValid populated by this function. The most recent op on the sync target's oplog,
* this function syncs to this value (inclusive)
* @return if applying the oplog succeeded
*/
bool ReplSetImpl::_syncDoInitialSync_applyToHead( replset::SyncTail& syncer, OplogReader* r,
const Member* source, const BSONObj& lastOp ,
BSONObj& minValid ) {
/* our cloned copy will be strange until we apply oplog events that occurred
through the process. we note that time point here. */
try {
// It may have been a long time since we last used this connection to
// query the oplog, depending on the size of the databases we needed to clone.
// A common problem is that TCP keepalives are set too infrequent, and thus
// our connection here is terminated by a firewall due to inactivity.
// Solution is to increase the TCP keepalive frequency.
minValid = r->getLastOp(rsoplog);
} catch ( SocketException & ) {
log() << "connection lost to " << source->h().toString() << "; is your tcp keepalive interval set appropriately?";
if( !r->connect(source->h().toString()) ) {
sethbmsg( str::stream() << "initial sync couldn't connect to " << source->h().toString() , 0);
throw;
}
// retry
minValid = r->getLastOp(rsoplog);
}
isyncassert( "getLastOp is empty ", !minValid.isEmpty() );
OpTime mvoptime = minValid["ts"]._opTime();
verify( !mvoptime.isNull() );
OpTime startingTS = lastOp["ts"]._opTime();
verify( mvoptime >= startingTS );
// apply startingTS..mvoptime portion of the oplog
{
try {
minValid = syncer.oplogApplication(lastOp, minValid);
}
catch (const DBException&) {
log() << "replSet initial sync failed during oplog application phase" << rsLog;
emptyOplog(); // otherwise we'll be up!
lastOpTimeWritten = OpTime();
lastH = 0;
log() << "replSet cleaning up [1]" << rsLog;
{
Client::WriteContext cx( "local." );
cx.ctx().db()->flushFiles(true);
}
log() << "replSet cleaning up [2]" << rsLog;
log() << "replSet initial sync failed will try again" << endl;
sleepsecs(5);
return false;
}
}
return true;
}
/**
* Do the initial sync for this member. There are several steps to this process:
*
* 0. Add _initialSyncFlag to minValid to tell us to restart initial sync if we
* crash in the middle of this procedure
* 1. Record start time.
* 2. Clone.
* 3. Set minValid1 to sync target's latest op time.
* 4. Apply ops from start to minValid1, fetching missing docs as needed.
* 5. Set minValid2 to sync target's latest op time.
* 6. Apply ops from minValid1 to minValid2.
* 7. Build indexes.
* 8. Set minValid3 to sync target's latest op time.
* 9. Apply ops from minValid2 to minValid3.
10. Clean up minValid and remove _initialSyncFlag field
*
* At that point, initial sync is finished. Note that the oplog from the sync target is applied
* three times: step 4, 6, and 8. 4 may involve refetching, 6 should not. By the end of 6,
* this member should have consistent data. 8 is "cosmetic," it is only to get this member
* closer to the latest op time before it can transition to secondary state.
*/
void ReplSetImpl::_syncDoInitialSync() {
replset::InitialSync init(replset::BackgroundSync::get());
replset::SyncTail tail(replset::BackgroundSync::get());
sethbmsg("initial sync pending",0);
// if this is the first node, it may have already become primary
if ( box.getState().primary() ) {
sethbmsg("I'm already primary, no need for initial sync",0);
return;
}
const Member *source = getMemberToSyncTo();
if (!source) {
sethbmsg("initial sync need a member to be primary or secondary to do our initial sync", 0);
sleepsecs(15);
return;
}
string sourceHostname = source->h().toString();
init.setHostname(sourceHostname);
OplogReader r;
if( !r.connect(sourceHostname) ) {
sethbmsg( str::stream() << "initial sync couldn't connect to " << source->h().toString() , 0);
sleepsecs(15);
return;
}
BSONObj lastOp = r.getLastOp(rsoplog);
if( lastOp.isEmpty() ) {
sethbmsg("initial sync couldn't read remote oplog", 0);
sleepsecs(15);
return;
}
// written by applyToHead calls
BSONObj minValid;
if (replSettings.fastsync) {
log() << "fastsync: skipping database clone" << rsLog;
// prime oplog
init.oplogApplication(lastOp, lastOp);
return;
}
else {
// Add field to minvalid document to tell us to restart initial sync if we crash
theReplSet->setInitialSyncFlag();
sethbmsg("initial sync drop all databases", 0);
dropAllDatabasesExceptLocal();
sethbmsg("initial sync clone all databases", 0);
list<string> dbs = r.conn()->getDatabaseNames();
Cloner cloner;
if (!_syncDoInitialSync_clone(cloner, sourceHostname.c_str(), dbs, true)) {
veto(source->fullName(), 600);
sleepsecs(300);
return;
}
sethbmsg("initial sync data copy, starting syncup",0);
log() << "oplog sync 1 of 3" << endl;
if ( ! _syncDoInitialSync_applyToHead( init, &r , source , lastOp , minValid ) ) {
return;
}
lastOp = minValid;
// Now we sync to the latest op on the sync target _again_, as we may have recloned ops
// that were "from the future" compared with minValid. During this second application,
// nothing should need to be recloned.
log() << "oplog sync 2 of 3" << endl;
if (!_syncDoInitialSync_applyToHead(tail, &r , source , lastOp , minValid)) {
return;
}
// data should now be consistent
lastOp = minValid;
sethbmsg("initial sync building indexes",0);
if (!_syncDoInitialSync_clone(cloner, sourceHostname.c_str(), dbs, false)) {
veto(source->fullName(), 600);
sleepsecs(300);
return;
}
}
log() << "oplog sync 3 of 3" << endl;
if (!_syncDoInitialSync_applyToHead(tail, &r, source, lastOp, minValid)) {
return;
}
// ---------
Status status = getGlobalAuthorizationManager()->initialize();
if (!status.isOK()) {
warning() << "Failed to reinitialize auth data after initial sync. " << status;
return;
}
sethbmsg("initial sync finishing up",0);
verify( !box.getState().primary() ); // wouldn't make sense if we were.
{
Client::WriteContext cx( "local." );
cx.ctx().db()->flushFiles(true);
try {
log() << "replSet set minValid=" << minValid["ts"]._opTime().toString() << rsLog;
}
catch(...) { }
// Initial sync is now complete. Flag this by setting minValid to the last thing
// we synced.
theReplSet->setMinValid(minValid);
// Clear the initial sync flag.
theReplSet->clearInitialSyncFlag();
cx.ctx().db()->flushFiles(true);
}
{
boost::unique_lock<boost::mutex> lock(theReplSet->initialSyncMutex);
theReplSet->initialSyncRequested = false;
}
// If we just cloned & there were no ops applied, we still want the primary to know where
// we're up to
replset::BackgroundSync::notify();
changeState(MemberState::RS_RECOVERING);
sethbmsg("initial sync done",0);
}
}