MySQL Galera cluster crash

Привет,

Днес сутринта след рутинен update на пакетите на DB сървърите попаднах на MySQL Galera cluster crash.
За няколко минути cluster-а ми се счупи тотално и част от апликациите на root.bg не работиха.

MySQL Galera cluster crash

Ето как започна всичко :

След

apt-get update && apt-get upgrade -y
нещата се объркаха :

Setting up mariadb-server-10.1 (10.1.22+maria-1~xenial) ...
Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
invoke-rc.d: initscript mysql, action "start" failed.
● mariadb.service - MariaDB database server
Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/mariadb.service.d
└─migrated-from-my.cnf-settings.conf
Active: failed (Result: exit-code) since Thu 2017-03-16 10:35:22 EET; 4ms ago
Process: 13607 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
Process: 13602 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
Process: 13597 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)
Main PID: 19373 (code=exited, status=0/SUCCESS)

Mar 16 10:35:22 border mysqld[13607]: Segmentation fault (core dumped)

Веднага предприех опит да възтановя базите чрез вградения скрипт на galera за recovery :

galera_recovery

обаче резултата не беше такъв какъвто се надявах да бъде :

root@border:/var/lib/mysql# galera_recovery
2017-03-16 10:40:53 139680740075776 [Warning] options --log-slow-admin-statements, --log-queries-not-using-indexes and --log-slow-slave-statements have no effect if --log_slow_queries is not set
2017-03-16 10:40:53 139680740075776 [Note] /usr/sbin/mysqld (mysqld 10.1.22-MariaDB-1~xenial) starting as process 19000 ...
Segmentation fault (core dumped)
WSREP: Failed to start mysqld for wsrep recovery: '2017-03-16 10:40:53 139680740075776 [Note] Using unique option prefix 'myisam-recover' is error-prone and can break in the future. Please use the full name 'myisam-recover-options' instead.
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: The InnoDB memory heap is disabled
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Compressed tables use zlib 1.2.8
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Using Linux native AIO
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Using SSE crc32 instructions
2017-03-16 10:40:53 139680740075776 [Note] InnoDB: Initializing buffer pool, size = 256.0M
2017-03-16 10:40:54 139680740075776 [Note] InnoDB: Completed initialization of buffer pool
2017-03-16 10:40:54 139680740075776 [Note] InnoDB: Highest supported file format is Barracuda.
InnoDB: Transaction 7490983 was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is 7503616
InnoDB: Starting in background the rollback of uncommitted transactions
2017-03-16 10:40:54 7f062bffe700 InnoDB: Rollback of non-prepared transactions completed
2017-03-16 10:40:54 139680740075776 [Note] InnoDB: 128 rollback segment(s) are active.
2017-03-16 10:40:54 139680740075776 [Note] InnoDB: Waiting for purge to start
2017-03-16 10:40:54 139680740075776 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.35-80.0 started; log sequence number 9123984658
2017-03-16 10:40:54 139680740075776 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2017-03-16 10:40:54 139680740075776 [Note] Plugin 'FEEDBACK' is disabled.
2017-03-16 10:40:54 139680740075776 [Note] Recovering after a crash using tc.log
2017-03-16 10:40:54 139680740075776 [Note] Starting crash recovery...
2017-03-16 10:40:54 7f09f4e4e900 InnoDB: Starting recovery for XA transactions...
2017-03-16 10:40:54 7f09f4e4e900 InnoDB: Transaction 7490983 in prepared state after recovery
2017-03-16 10:40:54 7f09f4e4e900 InnoDB: Transaction contains changes to 1 rows
2017-03-16 10:40:54 7f09f4e4e900 InnoDB: 1 transactions in prepared state after recovery
2017-03-16 10:40:54 139680740075776 [Note] Found 1 prepared transaction(s) in InnoDB
170316 10:40:54 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.22-MariaDB-1~xenial
key_buffer_size=20971520000
read_buffer_size=1048576
max_used_connections=0
max_threads=602
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 21725239 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x30000
(my_addr_resolve failure: fork)
/usr/sbin/mysqld(my_print_stacktrace+0x2e) [0x562db325cd9e]
/usr/sbin/mysqld(handle_fatal_signal+0x2f5) [0x562db2daa2d5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390) [0x7f09f34fa390]
/usr/sbin/mysqld(wsrep_on+0x18) [0x562db2d44458]
/usr/sbin/mysqld(+0x803842) [0x562db3016842]
/usr/sbin/mysqld(+0x803f11) [0x562db3016f11]
/usr/sbin/mysqld(+0x59897f) [0x562db2dab97f]
/usr/sbin/mysqld(plugin_foreach_with_mask(THD*, char (*)(THD*, st_plugin_int*, void*), int, unsigned int, void*)+0x2cf) [0x562db2c2a6af]
/usr/sbin/mysqld(ha_recover(st_hash*)+0xc6) [0x562db2dad886]
/usr/sbin/mysqld(TC_LOG_MMAP::recover()+0x148) [0x562db2e5e9a8]
/usr/sbin/mysqld(TC_LOG_MMAP::open(char const*)+0x620) [0x562db2e5f070]
/usr/sbin/mysqld(+0x3766d5) [0x562db2b896d5]
/usr/sbin/mysqld(mysqld_main(int, char**)+0xa52) [0x562db2b8d302]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f09f2ab5830]
/usr/sbin/mysqld(_start+0x29) [0x562db2b81079]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.'

Нещата започнаха да стават все по-лоши и по-лоши.. но накрая успях да овладея ситуацията.

Решението на проблема:

Благодарение на този пост в bugzilla.redhat.com успях да възтановя счупения DB сървър чрез:

mysqld_safe --user=mysql --datadir=/var/lib/mysql --wsrep-recover --tc-heuristic-recover rollback

Ето и резултата :

root@border:/var/lib/mysql# mysqld_safe --user=mysql --datadir=/var/lib/mysql --wsrep-recover --tc-heuristic-recover rollback
170316 10:55:06 mysqld_safe Logging to syslog.
170316 10:55:06 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Така успях да стартирам cluster-a наново чрез

galera_new_cluster
и всичко си тръгна по старо му!

Изводите

Първи извод : Както не веднъж съм споменавал в предишни мои постове, винаги ама винаги трябва да имаме бекъп!! При положение, че всичките ми опити за recovery на galera не бяха довели до възтановяване на базата, щях да бъда принуден да стартирам нова mysql инстанция другаде и да налея всички бази в нея. Добре че не се стига до там! 🙂

Втори извод : Нещата с редовното обновяване на пакети не винаги носят облаги. Аз обичам да update-вам и често си патя от това – дали не е време да спра да го правя?

Сподели чрез...