Coding and decoding crash dump handlers

All software has bugs. Even if you could possibly write the perfect bug free software all the layers down have bugs. Even CPUs as can be seen with the recent Meltdown and Spectre bugs. This means unfortunately sometimes software will crash. When this happens it is useful to capture as much information as possible to try and stop it happening again.

One of the first things I did when coming back to work from the holiday break is code a new crash dump handler to be used in MariaDB ColumnStore. This will spit out a stack trace for the current thread into a file upon a crash. It is very useful for daemons to try and find the root cause of a problem without running through a debugger.

Compiler Options

The first thing you will want to do is enable useful debugging symbols and frame pointers to your binary compilations. This may add a tiny overhead to binary execution, a few percent at most but it is worth it to be able to run a postmortem on crashes. The useful options are “-g” and “-fno-omit-frame-pointer”.

Crash Handler

This is a basic crash handler, it will dump the crash data into a file with the filename of the PID of the process in /tmp. You will likely want to expand on this to add more information and error handling. The important thing is to try and avoid mallocs as much as possible:

#include <execinfo.h>

void fatalHandler(int sig)
{
  char filename[128];
  void* addrs[128];
  snprintf(filename, 128, "/tmp/%d.log", getpid());
  FILE* logfile = fopen(filename, "w");
  char s[30];
  struct tm tim;
  time_t now;
  now = time(NULL);
  tim = *(localtime(&now));
  strftime(s, 30, "%F %T", &tim);
  fprintf(logfile, "Date/time: %s\n", s);
  fprintf(logfile, "Signal: %d\n\n", sig);
  fflush(logfile);
  int fd = fileno(logfile);
  int count = backtrace(addrs, sizeof(addrs) / sizeof(addrs[0]));
  backtrace_symbols_fd(addrs, count, fd);
  fclose(logfile);
  struct sigaction sigact;
  memset(&sigact, 0, sizeof(sigact));
  sigact.sa_handler = SIG_DFL;
  sigaction(sig, &sigact, NULL);
  raise(sig);
}

This opens the file, writes the current time/date into it as well as the signal number that generated the crash. It then gets the backtrace and writes it into the file. We then reset the signal handler to default. You’ll need some more headers than this example, but execinfo.h, which is part of glibc, provides the backtrace functionality.

Adding to Application

Somewhere near the beginning of your ‘main’ function you need to add signal handler hooks, you’ll need to include ‘signal.h’ for this to work:

  struct sigaction crsh;
  memset(&crsh, 0, sizeof(crsh));
  crsh.sa_handler = fatalHandler;
  sigaction(SIGSEGV, &crsh, 0);
  sigaction(SIGABRT, &crsh, 0);
  sigaction(SIGFPE, &crsh, 0);

Testing

Once compiled and running an easy way to test this is to send a signal to an application to tell it that it has crashed. You can do this with “kill -11 <PID>”. You should find the crash dump in /tmp.

Analysing

The crash dump file will have a list of function calls and address offsets. This may be useful but you can use the same binaries to generate source line numbers. The following is an example from a MariaDB ColumnStore binary:

Date/time: 2018-01-03 15:47:16
Signal: 6

[0x5573f0e18014]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fda7d43a390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fda7b98b428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fda7b98d02a]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7fda7c2ce84d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7fda7c2cc6b6]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d701)[0x7fda7c2cc701]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d919)[0x7fda7c2cc919]
/usr/local/mariadb/columnstore/lib/libmessageqcpp.so.1(_ZN11messageqcpp18MessageQueueClient5setupEb+0x194)[0x7fda7ea19e84]
/usr/local/mariadb/columnstore/lib/libmessageqcpp.so.1(_ZN11messageqcpp18MessageQueueClientC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPN6config6ConfigEb+0xd6)[0x7fda7ea1b566]
/usr/local/mariadb/columnstore/lib/libjoblist.so.1(_ZN7joblist21DistributedEngineComm5SetupEv+0x665)[0x7fda816663a5]
/usr/local/mariadb/columnstore/lib/libjoblist.so.1(_ZN7joblist21DistributedEngineCommC2EPNS_15ResourceManagerEb+0x1ed)[0x7fda8166807d]
/usr/local/mariadb/columnstore/lib/libjoblist.so.1(_ZN7joblist21DistributedEngineComm8instanceEPNS_15ResourceManagerEb+0x4a)[0x7fda816681fa]
[0x5573f0e01fdb]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fda7b976830]
[0x5573f0e056d9]

The first useful line in this dump is:

/usr/local/mariadb/columnstore/lib/libmessageqcpp.so.1(_ZN11messageqcpp18MessageQueueClient5setupEb+0x194)[0x7fda7ea19e84]

We use the C++ mangled function with the tool ‘nm’ to get the base address:

nm /usr/local/mariadb/columnstore/lib/libmessageqcpp.so | grep _ZN11messageqcpp18MessageQueueClient5setupEb

0000000000011cf0 T _ZN11messageqcpp18MessageQueueClient5setupEb

Then in a hex calculator we add the offset from the stack dump (0x194) to 0x11cf0 which ‘nm’ provided above. This gives us 0x11e84. We can pass this to the utility ‘addr2line’ to get the line number:

addr2line -e /usr/local/mariadb/columnstore/lib/libmessageqcpp.so 0x11e84

/home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/messagequeue.cpp:170 (discriminator 2)

That line in the source is:

throw runtime_error(msg);

This uncaught exception is exactly what triggered this crash.

Published by

LinuxJedi

I am a Lead Engineer for MariaDB ColumnStore at MariaDB Corporation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s