asked on

Process restarting using the core file

Hi,

I had a process which dumped core, not because of any "fault" (segmentation fault, illegal instruction etc.) but because I "told" it to do so by sending the SIGQUIT signal.

Now, I'm trying to "restart" the process from the state it was in when the core was dumped.

I try to do this as follows:
1) There is a process (say P) which reads in register values, memory dumps etc. from the core file.
2) P does a fork(), the child does a PTRACE_TRACEME before exec()ing the original executable
3) P then stops the child at every instruction (PTRACE_SINGLESTEP) and waits for the child to reach the address of main() [P checks the eip register of the child after every instruction].
4) Now that the child is stopped at main(), P modifies the address space of the child with values that were there in the core file. This includes the stack (which is also present in the core file)
5) P also sets the registers of the child to the values as found in the core file
6) P does a PTRACE_DETACH

In theory I believe this should have been sufficient to get the process to restart in the state it was in when core was dumped.

Unfortunately, when I try this the child executes many instructions but eventually encounters a segmentation fault and dumps core (this time it wasn't intentional!!). It seems that this fault is encountered the first time the child executes an instruction in its "own" memory area (address 0x804ba84). This address seems to be invalid.

Does anyone have any ideas about possible flaws in the method stated above? Or what could be causing the child to go to an invalid instruction??

I searched for information on restarting from the core but most people seem to suggest that you can't restart using the core file, just use the core to examine faults etc. There is a Solaris utility called undump() but that seems to restart from main() and thus only restores values of global variables.

A basic question is that considering that the core file DOES have the stack, WHAT prevents us from being able to restart the process from the exact point it was when the core was dumped?

I apologize for any vagueness in the description above, but any help will be greatly appreciated.

Sincerely,

-- Asim

ahoffmann

did you try to do it with dbx?

asim_shankar

ASKER

Unfortunately, I'm not very familiar with dbx. Is it possible to continue a process checkpointed by the core using dbx?? If so, could you please tell me how.

However, I'm trying to do this (continue a process checkpointed by the core file) programatically. So if it can be done by dbx, I'd like to know HOW dbx does it.

asim_shankar

ASKER

Oh, I'd like to clarify that I understand that there are issues with the "meaning" of restarting a process (what happens to open files/sockets etc.) - many of which are nicely put in http://sources.redhat.com/ml/gdb/2001-12/msg00130.html and are probably reasons why debuggers cannot restart a process from the state in the core file.

However, I'm looking to restart only simple compute processes. A process like:

main()
{
int i;
while(1)
printf("%d\n",i++);
}

IMHO, I should have sufficient information in the core file and with the approach outlined above should be able to make it. Opinions?

bryanh

I think even a program that simple has lots of state that you are not restoring (and Linux doesn't give you a way to restore). The GNU C library is fairly complex.

And maybe you just have a bug in your code that reloads registers. Maybe you're not restoring one, or restoring one that shouldn't be restored. Segment registers or such.

This shouldn't be too hard to debug. Unless you're doing something weird, address 0x804ba84 ought to contain instructions. In the dump of the 2nd crash, is there a 0x804ba84? Sounds like an address space problem to me -- your program exists in 0x804ba84 in one address space, and you are trying to branch to 0x804ba84 in another one.

asim_shankar

ASKER

Well, I was able to debug my program, and it was quite a stupid mistake. So the state of things now is that I CAN restart a process (like the baove mentioned above) without any trouble. So I guess there IS enough information in the core file to restart a process.

However, there is one problem with the method outlined above. I stop the child at main(), so any memory pages that were allocated during the original execution aren't part of the address space now. The core file contains the values of these addresses but since they are not part of the child's address space yet, I can't write to them.

Is there a way to add pages to the child's address space from the parent?

ASKER CERTIFIED SOLUTION

bryanh

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial