Process restarting using the core file

Hi,

I had a process which dumped core, not because of any "fault" (segmentation fault, illegal instruction etc.) but because I "told" it to do so by sending the SIGQUIT signal.

Now, I'm trying to "restart" the process from the state it was in when the core was dumped.

I try to do this as follows:
1) There is a process (say P) which reads in register values, memory dumps etc. from the core file.
2) P does a fork(), the child does a PTRACE_TRACEME before exec()ing the original executable
3) P then stops the child at every instruction (PTRACE_SINGLESTEP) and waits for the child to reach the address of main() [P checks the eip register of the child after every instruction].
4) Now that the child is stopped at main(), P modifies the address space of the child with values that were there in the core file. This includes the stack (which is also present in the core file)
5) P also sets the registers of the child to the values as found in the core file
6) P does a PTRACE_DETACH

In theory I believe this should have been sufficient to get the process to restart in the state it was in when core was dumped.

Unfortunately, when I try this the child executes many instructions but eventually encounters a segmentation fault and dumps core (this time it wasn't intentional!!). It seems that this fault is encountered the first time the child executes an instruction in its "own" memory area (address 0x804ba84). This address seems to be invalid.

Does anyone have any ideas about possible flaws in the method stated above? Or what could be causing the child to go to an invalid instruction??

I searched for information on restarting from the core but most people seem to suggest that you can't restart using the core file, just use the core to examine faults etc. There is a Solaris utility called undump() but that seems to restart from main() and thus only restores values of global variables.

A basic question is that considering that the core file DOES have the stack, WHAT prevents us from being able to restart the process from the exact point it was when the core was dumped?

I apologize for any vagueness in the description above, but any help will be greatly appreciated.

Sincerely,

-- Asim

asim_shankarAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ahoffmannCommented:
did you try to do it with dbx?
0
asim_shankarAuthor Commented:
Unfortunately, I'm not very familiar with dbx. Is it possible to continue a process checkpointed by the core using dbx?? If so, could you please tell me how.

However, I'm trying to do this (continue a process checkpointed by the core file) programatically. So if it can be done by dbx, I'd like to know HOW dbx does it.
0
asim_shankarAuthor Commented:
Oh, I'd like to clarify that I understand that there are issues with the "meaning" of restarting a process (what happens to open files/sockets etc.) - many of which are nicely put in http://sources.redhat.com/ml/gdb/2001-12/msg00130.html and are probably reasons why debuggers cannot restart a process from the state in the core file.

However, I'm looking to restart only simple compute processes. A process like:

main()
{
   int i;
   while(1)
      printf("%d\n",i++);
}

IMHO, I should have sufficient information in the core file and with the approach outlined above should be able to make it. Opinions?
0
Cloud Class® Course: Microsoft Exchange Server

The MCTS: Microsoft Exchange Server 2010 certification validates your skills in supporting the maintenance and administration of the Exchange servers in an enterprise environment. Learn everything you need to know with this course.

bryanhCommented:
I think even a program that simple has lots of state that you are not restoring (and Linux doesn't give you a way to restore).  The GNU C library is fairly complex.

And maybe you just have a bug in your code that reloads registers.  Maybe you're not restoring one, or restoring one that shouldn't be restored.  Segment registers or such.

This shouldn't be too hard to debug.  Unless you're doing something weird, address 0x804ba84 ought to contain instructions.  In the dump of the 2nd crash, is there a 0x804ba84?  Sounds like an address space problem to me -- your program exists in 0x804ba84 in one address space, and you are trying to branch to 0x804ba84 in another one.
0
asim_shankarAuthor Commented:
Well, I was able to debug my program, and it was quite a stupid mistake. So the state of things now is that I CAN restart a process (like the baove mentioned above) without any trouble. So I guess there IS enough information in the core file to restart a process.

However, there is one problem with the method outlined above. I stop the child at main(), so any memory pages that were allocated during the original execution aren't part of the address space now. The core file contains the values of these addresses but since they are not part of the child's address space yet, I can't write to them.

Is there a way to add pages to the child's address space from the parent?
0
bryanhCommented:
I think what we know now is that there is enough information in the core file to restart a certain kind of process that failed in a certain way.  I still think the scope of processes/failures where this works is quite narrow.

>Is there a way to add pages to the child's address space
>from the parent?

I rather doubt it.  That would be a ptrace kind of thing, and of course ptrace doesn't have that function.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux OS Dev

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.