pingali/restore mysql replication with corrupted bin logs

## restore mysql replication with corrupted bin logs
Restoring a MySQL Replication with corrupted binlogs!
Posted on December 30th, 2005 by Basil

Major assumption: you have a safe, but halted replication you want to restore from.

Situation — Your replication server is down or simply the MySQL slave has stopped, possible because the hard drive filled up and your binlogs got corrupted beyond repair? Some might have the luxury to start a new snapshot from the main DB server, lets call it the Master DB server. But what if you can’t afford to do a read-lock on the server? (Which would be most of the times for production servers with decent traffic).  Well — since you planned ahead - you should have multiple replication servers running — it helps to reduce your read load with the round-robin method anyways.

Here’s some things I did to get our primary MySQL replication slave up to date using the planned, redundant replication server, secondary Slave.

So, on a regular day, Master DB replicates its binlog to Primary slave and Master DB replicates to secondatry slave.  The problem occurs when either the primary or secondary slave stops. Let’s assume it’s the primarty server that stopped,  Now we have Master DB no longer replicating to the primary slave.  We now have the primary slave out of sync with the Master and the secondary Slave.  Your application that checks for lag should have done two things:

* Alert you that the primary slave has stopped or at the very least, is really lagging behind.

* Switch its reading to the secondary slave server.

A Solution

   1. Have your application temporarily use your Master DB for reads unless you have another slave you can use besides the secondary slave. We need that to restore our primary slave in a moment.
   2. Stop any latency checks and db monitoring you have for the secondary server to avoid the false alarms.
   3. Stop the secondary slave’s slave threads with:
      mysql> slave stop.
   4. Copy the MySQL data director of the secondary slave. (The whole thing)
      tar cf snapshot.tar /path/to/your/mysql/datadir
   5. You can optionally compress the tar with a zip (recommended  especially if the tar file itself is just too big to transfer over your network).
   6. As soon as you have the data directory copied, go ahead and start the secondary slave back up. Let it continue to catch up to the Master DB and you can even turn on the latency and db monitoring scripts for the secondary slave.
   7. SSH into your primary slave server, we need to remove the old data. If you have the storage space for the old data AND the replacement data (uncompressed) then you can rename your old data. But why? It’s corrupt ;).  I tend to remove the primary slave server’s MySQL data.
      cd /path/to/your/mysql/datadir; rm -f *
   8. Transport the snapshot from your secondary slave server to your primary slave server. (/tmp location should be fine, and any transport method such as rsync, ftp, etc. will work)
   9. Decompress the snapshot on the primary server. If you get a hard disk out of space error — did you delete the old datadir? You might need to if you haven’t.
  10. Let’s clean up some files in the recently decompressed datadir. Here’s the list of files that are safe to be removed:
      rm -f ib*; // Eh - we don’t use this DB setup anyways — and our system will create new ones later if needed. (In our particular server we didn’t use these, if you suspect you do use this engine type, don’t delete your files.)
      rm -f [secondary-slave-name]-bin.0000*; // Not needed, we’ll have the primary slave server create its own bin logs shortly.
      rm -f [secondary-slave-name]-.err; // Remove the error file, it’s specific to the other server.
      rm -f [secondary-slave-name]-.pid; // Remove the process id file, again, specific to the other server.
      rm -f [secondary-slave-name]-slow.log; // Safe to remove if you see it, it’s the log of slow queries from the secondary slave.

      It is important to keep the master.info, relay-log.info, [secondary-slave-name]-relay-bin.index, and the [secondary-slave-name]-relay-bin.000*.
  11. Now here’s the important part — we’ll need to carefully do some edits here:

      * master.info: make sure all the data is correct - shouldn’t really require much change unless you have different slaves connected with different login.
      * relay-log.info
      – Change the relay log name from [secondary-slave-name]-relay-bin.0000* to [primary-slave-name]-relay-bin.0000*
      * rename [secondary-slave-name]-relay-bin.index to [primary-slave-name]-relay-bin.index
      * rename [secondary-slave-name]-relay-bin.000* to [primary-slave-name]-relay-bin.000*

      * [primary-slave-name]-relay-bin.index
      – Fix the one-line file pointer to use the new named, [primary-slave-name]-relay-bin.—
  12. If you turned off the MySQL daemon on the primary slave server, (you should have at least), then make sure you start up the daemon with the slave-stop option. (can do this in /etc/my.cnf if you like and comment out later.)

  13. Now check the >slave status \G on the primary slave server to make sure all files are pointing to the primary slave server values and the right MySQL Master DB values.
  14. > start slave; show slave status \G // Again make sure the slave thread AND process are BOTH saying YES
  15. Feel free to check slave status as often as you like to watch the slave catchup to the master. You can also create a simple shell file like I have to do latency checking, source shown below.


//-- BEGIN //
watch --interval=1 'echo "--LocalHost--"; mysql --uUSER -pPASSWORD an -e "SELECT COUNT(id) FROM POPULAR_TABLE"; echo "--Master--"; mysql -uUSER -pPASSWORD an -hSERVER_A -e "SELECT COUNT(id) FROM POPULAR_TABLE"; echo "--LocalHost--"; mysql -uUSER -pPASSWORD an -e "SELECT * FROM REPLICATIONWATCHTABLE"; echo "--Master--"; mysql -uUSER -pPASSWORD an -hSERVER_A -e "SELECT * FROM REPLICATIONWATCHTABLE"'
//-- END //

Replacement notes for the above shell script:
REPLICATIONWATCHTABLE is a table designed only to contain a current timestamp. You’ll see that the SERVER_A should be up to date (perhaps every 2 minutes you can update it) while the slave processes the bin log, you’ll see how far behind it really is by the timestamping! :) Nifty eh?

Ok — you should now have all servers functioning fully again to their original slave status - the bottleneck here in the process is the transfer of the database data from the secondary slave to the primary slave.
	Restoring a MySQL Replication with corrupted binlogs!
	Posted on December 30th, 2005 by Basil

	Major assumption: you have a safe, but halted replication you want to restore from.

	Situation — Your replication server is down or simply the MySQL slave has stopped, possible because the hard drive filled up and your binlogs got corrupted beyond repair? Some might have the luxury to start a new snapshot from the main DB server, lets call it the Master DB server. But what if you can’t afford to do a read-lock on the server? (Which would be most of the times for production servers with decent traffic). Well — since you planned ahead - you should have multiple replication servers running — it helps to reduce your read load with the round-robin method anyways.

	Here’s some things I did to get our primary MySQL replication slave up to date using the planned, redundant replication server, secondary Slave.

	So, on a regular day, Master DB replicates its binlog to Primary slave and Master DB replicates to secondatry slave. The problem occurs when either the primary or secondary slave stops. Let’s assume it’s the primarty server that stopped, Now we have Master DB no longer replicating to the primary slave. We now have the primary slave out of sync with the Master and the secondary Slave. Your application that checks for lag should have done two things:

	* Alert you that the primary slave has stopped or at the very least, is really lagging behind.

	* Switch its reading to the secondary slave server.

	A Solution

	1. Have your application temporarily use your Master DB for reads unless you have another slave you can use besides the secondary slave. We need that to restore our primary slave in a moment.
	2. Stop any latency checks and db monitoring you have for the secondary server to avoid the false alarms.
	3. Stop the secondary slave’s slave threads with:
	mysql> slave stop.
	4. Copy the MySQL data director of the secondary slave. (The whole thing)
	tar cf snapshot.tar /path/to/your/mysql/datadir
	5. You can optionally compress the tar with a zip (recommended especially if the tar file itself is just too big to transfer over your network).
	6. As soon as you have the data directory copied, go ahead and start the secondary slave back up. Let it continue to catch up to the Master DB and you can even turn on the latency and db monitoring scripts for the secondary slave.
	7. SSH into your primary slave server, we need to remove the old data. If you have the storage space for the old data AND the replacement data (uncompressed) then you can rename your old data. But why? It’s corrupt ;). I tend to remove the primary slave server’s MySQL data.
	cd /path/to/your/mysql/datadir; rm -f *
	8. Transport the snapshot from your secondary slave server to your primary slave server. (/tmp location should be fine, and any transport method such as rsync, ftp, etc. will work)
	9. Decompress the snapshot on the primary server. If you get a hard disk out of space error — did you delete the old datadir? You might need to if you haven’t.
	10. Let’s clean up some files in the recently decompressed datadir. Here’s the list of files that are safe to be removed:
	rm -f ib*; // Eh - we don’t use this DB setup anyways — and our system will create new ones later if needed. (In our particular server we didn’t use these, if you suspect you do use this engine type, don’t delete your files.)
	rm -f [secondary-slave-name]-bin.0000*; // Not needed, we’ll have the primary slave server create its own bin logs shortly.
	rm -f [secondary-slave-name]-.err; // Remove the error file, it’s specific to the other server.
	rm -f [secondary-slave-name]-.pid; // Remove the process id file, again, specific to the other server.
	rm -f [secondary-slave-name]-slow.log; // Safe to remove if you see it, it’s the log of slow queries from the secondary slave.

	It is important to keep the master.info, relay-log.info, [secondary-slave-name]-relay-bin.index, and the [secondary-slave-name]-relay-bin.000*.
	11. Now here’s the important part — we’ll need to carefully do some edits here:

	* master.info: make sure all the data is correct - shouldn’t really require much change unless you have different slaves connected with different login.
	* relay-log.info
	– Change the relay log name from [secondary-slave-name]-relay-bin.0000* to [primary-slave-name]-relay-bin.0000*
	* rename [secondary-slave-name]-relay-bin.index to [primary-slave-name]-relay-bin.index
	* rename [secondary-slave-name]-relay-bin.000* to [primary-slave-name]-relay-bin.000*

	* [primary-slave-name]-relay-bin.index
	– Fix the one-line file pointer to use the new named, [primary-slave-name]-relay-bin.—
	12. If you turned off the MySQL daemon on the primary slave server, (you should have at least), then make sure you start up the daemon with the slave-stop option. (can do this in /etc/my.cnf if you like and comment out later.)

	13. Now check the >slave status \G on the primary slave server to make sure all files are pointing to the primary slave server values and the right MySQL Master DB values.
	14. > start slave; show slave status \G // Again make sure the slave thread AND process are BOTH saying YES
	15. Feel free to check slave status as often as you like to watch the slave catchup to the master. You can also create a simple shell file like I have to do latency checking, source shown below.


	//-- BEGIN //
	watch --interval=1 'echo "--LocalHost--"; mysql --uUSER -pPASSWORD an -e "SELECT COUNT(id) FROM POPULAR_TABLE"; echo "--Master--"; mysql -uUSER -pPASSWORD an -hSERVER_A -e "SELECT COUNT(id) FROM POPULAR_TABLE"; echo "--LocalHost--"; mysql -uUSER -pPASSWORD an -e "SELECT * FROM REPLICATIONWATCHTABLE"; echo "--Master--"; mysql -uUSER -pPASSWORD an -hSERVER_A -e "SELECT * FROM REPLICATIONWATCHTABLE"'
	//-- END //

	Replacement notes for the above shell script:
	REPLICATIONWATCHTABLE is a table designed only to contain a current timestamp. You’ll see that the SERVER_A should be up to date (perhaps every 2 minutes you can update it) while the slave processes the bin log, you’ll see how far behind it really is by the timestamping! :) Nifty eh?

	Ok — you should now have all servers functioning fully again to their original slave status - the bottleneck here in the process is the transfer of the database data from the secondary slave to the primary slave.