Tuesday, May 31, 2005

Linux: Strace is your Friend!

When you're banging your head 'cause it doesn't work and you don't know why... here's a thing you can do to help ferret out a clue: strace

Strace is your friend. It has a lot of helpful options that allow you to watch a program in operation. It allows you to watch your program in operation by following its system calls.

What are system calls?

System calls are the "nuts and bolts" routines that the operating system makes available to programmers to interact with the system. There is another type of "call" (subroutine) that is not a system call which is called a "user call". System calls and well-known user level calls can be found in the UNIX manual sections 2 and 3 respectively. System calls are "atomic operations", which are specific, discrete operations that cannot easily be broken down any further into smaller segments (additional operating system calls). User calls on the other hand, are compound routines that consist of several system calls "glued together" to perform a larger, more abstract, function.

Why is this interesting?

It is interesting because by using strace you can monitor the progress (or the lack thereof) of your program while it is executing. This will allow you to watch all of the system calls the program makes, which will give you a general idea of what it is doing. By making some educated guesses, you can discern where it is in its execution, what parameters are being passed, what data is being processed, and see any relevant error codes being returned. To the uninitiated the output of strace looks like utter gibberish. To the trained eye it can reveal volumes about the operation of your program. In the following example, we'll trace the operation of the 'ls' command:


system prompt:/tmp#strace ls
execve("/bin/ls", ["ls"], [/* 101 vars */]) = 0
uname({sys="Linux", node="snotnose.wizard.org", ...}) = 0
brk(0) = 0x8059f08
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40013000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/mmx/libtermcap.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
stat64("/usr/lib/i686/mmx", 0xbfffe3c0) = -1 ENOENT (No such file or directory)
open("/usr/lib/i686/libtermcap.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)



Note that this is just a small sample for illustration. The actual output of strace is extremely verbose and will scroll through many screens. You will almost always want to capture the output in a file and review it later. Also, it is often helpful to give it additional options to include other details of your program, which will generally make it even more verbose! What part of the output is most helpful really depends on why you ran strace in the first place. If, for example, you have a program that keeps crashing for an unknown reason, you might run strace to get an idea of where/why it crashed. Run the program like so:


strace <path_to_your_program>


And capture the output. If you are using an X-terminal, or a terminal with a scrolling display, you can simply scroll up if its a short output. For capturing longer output, or if you need to let it run awhile to error out or get to the place where it bombs, you'll probably want the '-o' option. This allows you to specify an output file that will capture the strace output which you can browse later.


strace -o output_file <your_program>


Another thing to consider is that many programs "fork" a copy of themselves (clone) or spawn child processes. Ordinarily strace will only trace the execution of the original parent process, but you can tell it to capture the output of forked or spawned processes as well by using the '-f' option. Just add it to the command line and strace will add the output to the output stream:


strace -o output_file -f <your_program>


On some occasions, you might want to use strace to examine a process that's already running. This can be easily done by using the process's id number-- use the 'ps' command to find it-- with the '-p' option:


strace -p <process_id>


(And of course all the other options can be available as well)


Okay, so now you've learned how to fill up your screen and hard drive with all sorts of cryptic junk, what's it actually good for?

Here are some examples that I made up for discussion but illustrate typical real-world uses.

Your program bombs with a segfault. You don't know why. Run strace, look at the output, especially near the end, and find the place where it outputs the segfault message. Then go backwards just a little bit more and see what it was doing (or attempting to do) just before you got the segfault. You can often figure out why it is crapping out-- perhaps a library not found or the wrong version was found-- these are common issues.

Your program runs but doesn't do the right thing... you suspect there is a needed file or config file or device it may be looking for-- yet you don't know what it would be or where to start looking.... crank up strace, and grep the output for 'open' calls. I do something like this:


grep 'open(' <output_file> | more

grep 'open(' <output_file> | sort | uniq | more


This will give you a unique list of all attempts to open a file or device along with the outcome of the call. Note that many times open failures are normal and can be disregarded-- especially if it is an attempt to open a library-- a very common thing and one that will often generate many "failures" as the the program attempts to locate the library in the path-- just look and see if there is a chain of opens and follow it to be sure it ultimately found a suitable copy-- if so, then this particular group of failures can probably be safely ignored. Here's an example of what I mean:


open("/usr/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/X11R6/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/pg/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/mysql/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib/i686/libc.so.6", O_RDONLY) = 3


Here the system is attempting to locate the 'libc.so.6' library and is checking all of the places the system says it might be found. It is finally found in the last place it looks.. isn't that always the way??? This type of failure can generally be safely ignored unless you have an idea that it might be locating the _wrong_ library. (See, there's something else useful you can do with strace)

Here's another example. I asked the system to show data from '/dev/noport', which I know doesn't exist and thus will generate an error:


system prompt:/tmp# cat /dev/noport
cat: /dev/noport: No such file or directory


Here is the relevant portion of the strace output, near the bottom, that would clue you on the problem-- that the file/device couldn't be located:


open("/dev/noport", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
write(2, "cat: ", 5cat: ) = 5
write(2, "/dev/noport", 11/dev/noport) = 11
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, ": No such file or directory", 27: No such file or directory) = 27
write(2, "\n", 1) = 1
SYS_252(0x1, 0, 0, 0x40159e64, 0x1) = -1 ENOSYS (Function not implemented)
_exit(1) = ?


In the output above, look for the line "open("/dev/noport"...", which is conveniently located at the top of this listing, and discover that it FAILED to open with an "ENOENT" error, which means simply that the file could not be found and it says so: "No such file or directory". The program then soon exits making this an easy error to spot. But this could have been a missing configuration file, directory, etc.

Another common problem is when your progam attempts to access a resource it doesn't have permission to access- and generates a "permission denied error". Here is an example of an underpriviliged user attempting to access root's home directory:


/home/someuser: ls /root
lstat64("/root", {st_mode=S_IFDIR|0700, st_size=9216, ...}) = 0
open("/root", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1 EACCES (Permission denied)
write(2, "ls: ", 4ls: ) = 4
write(2, "/root", 5/root) = 5
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
write(2, ": Permission denied", 19: Permission denied) = 19
write(2, "\n", 1) = 1
SYS_252(0x1, 0, 0, 0x4015de64, 0x1) = -1 ENOSYS (Function not implemented)
_exit(1) = ?


In the output above you will notice that the program attempted to access the root directory (eg. open("/root"...)) and got a "EACCES" error, which results in "Permission Denied". In this case the execution of the program ended. But it wouldn't necessarily have to in another program. And it might continue right on until some later point where it tried to do something assuming it had succeeded earlier but had instead gotten an EACCES error. This could be a subtle issue and hard to track down. But with strace, you could examine the output and see where it failed and go fix it.


There are lots of options to strace, and I don't mean to cover every one, nor explain how it can be useful in every possible circumstance. However, if you're sitting there stuck and can't figure out how to proceed, try strace'ing the program, you may gain valuable clues that can lead to a resolution. Also, don't forget to clean up all those multi-megabyte output files you're gonna create. They can get unweildy after awhile. I wouldn't recommend running strace on anything and walking away. The files can get very huge very quick.


There are lots of ways that strace can help you in your daily work, I have shown you a few. Feel free to experiment with strace on your own system so you can learn how troubleshoot your problems like a pro.

Labels:

0 Comments:

Post a Comment

<< Home