?
Solved

Process restarting using the core file

Posted on 2003-03-30
6
Medium Priority
?
224 Views
Last Modified: 2010-04-21
Hi,

I had a process which dumped core, not because of any "fault" (segmentation fault, illegal instruction etc.) but because I "told" it to do so by sending the SIGQUIT signal.

Now, I'm trying to "restart" the process from the state it was in when the core was dumped.

I try to do this as follows:
1) There is a process (say P) which reads in register values, memory dumps etc. from the core file.
2) P does a fork(), the child does a PTRACE_TRACEME before exec()ing the original executable
3) P then stops the child at every instruction (PTRACE_SINGLESTEP) and waits for the child to reach the address of main() [P checks the eip register of the child after every instruction].
4) Now that the child is stopped at main(), P modifies the address space of the child with values that were there in the core file. This includes the stack (which is also present in the core file)
5) P also sets the registers of the child to the values as found in the core file
6) P does a PTRACE_DETACH

In theory I believe this should have been sufficient to get the process to restart in the state it was in when core was dumped.

Unfortunately, when I try this the child executes many instructions but eventually encounters a segmentation fault and dumps core (this time it wasn't intentional!!). It seems that this fault is encountered the first time the child executes an instruction in its "own" memory area (address 0x804ba84). This address seems to be invalid.

Does anyone have any ideas about possible flaws in the method stated above? Or what could be causing the child to go to an invalid instruction??

I searched for information on restarting from the core but most people seem to suggest that you can't restart using the core file, just use the core to examine faults etc. There is a Solaris utility called undump() but that seems to restart from main() and thus only restores values of global variables.

A basic question is that considering that the core file DOES have the stack, WHAT prevents us from being able to restart the process from the exact point it was when the core was dumped?

I apologize for any vagueness in the description above, but any help will be greatly appreciated.

Sincerely,

-- Asim

0
Comment
Question by:asim_shankar
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
6 Comments
 
LVL 51

Expert Comment

by:ahoffmann
ID: 8237176
did you try to do it with dbx?
0
 

Author Comment

by:asim_shankar
ID: 8237237
Unfortunately, I'm not very familiar with dbx. Is it possible to continue a process checkpointed by the core using dbx?? If so, could you please tell me how.

However, I'm trying to do this (continue a process checkpointed by the core file) programatically. So if it can be done by dbx, I'd like to know HOW dbx does it.
0
 

Author Comment

by:asim_shankar
ID: 8237317
Oh, I'd like to clarify that I understand that there are issues with the "meaning" of restarting a process (what happens to open files/sockets etc.) - many of which are nicely put in http://sources.redhat.com/ml/gdb/2001-12/msg00130.html and are probably reasons why debuggers cannot restart a process from the state in the core file.

However, I'm looking to restart only simple compute processes. A process like:

main()
{
   int i;
   while(1)
      printf("%d\n",i++);
}

IMHO, I should have sufficient information in the core file and with the approach outlined above should be able to make it. Opinions?
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 5

Expert Comment

by:bryanh
ID: 8263139
I think even a program that simple has lots of state that you are not restoring (and Linux doesn't give you a way to restore).  The GNU C library is fairly complex.

And maybe you just have a bug in your code that reloads registers.  Maybe you're not restoring one, or restoring one that shouldn't be restored.  Segment registers or such.

This shouldn't be too hard to debug.  Unless you're doing something weird, address 0x804ba84 ought to contain instructions.  In the dump of the 2nd crash, is there a 0x804ba84?  Sounds like an address space problem to me -- your program exists in 0x804ba84 in one address space, and you are trying to branch to 0x804ba84 in another one.
0
 

Author Comment

by:asim_shankar
ID: 8264108
Well, I was able to debug my program, and it was quite a stupid mistake. So the state of things now is that I CAN restart a process (like the baove mentioned above) without any trouble. So I guess there IS enough information in the core file to restart a process.

However, there is one problem with the method outlined above. I stop the child at main(), so any memory pages that were allocated during the original execution aren't part of the address space now. The core file contains the values of these addresses but since they are not part of the child's address space yet, I can't write to them.

Is there a way to add pages to the child's address space from the parent?
0
 
LVL 5

Accepted Solution

by:
bryanh earned 750 total points
ID: 8270849
I think what we know now is that there is enough information in the core file to restart a certain kind of process that failed in a certain way.  I still think the scope of processes/failures where this works is quite narrow.

>Is there a way to add pages to the child's address space
>from the parent?

I rather doubt it.  That would be a ptrace kind of thing, and of course ptrace doesn't have that function.
0

Featured Post

How To Install Bash on Windows 10

Windows’ budding partnership with Canonical has certainly led to some great improvements. One of them being the ability to use Bash on your Windows machine without third party applications! This might be one of the greatest things a cloud engineer in a Windows environment can do!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command l…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
If you’ve ever visited a web page and noticed a cool font that you really liked the look of, but couldn’t figure out which font it was so that you could use it for your own work, then this video is for you! In this Micro Tutorial, you'll learn yo…
In this video, Percona Solution Engineer Dimitri Vanoverbeke discusses why you want to use at least three nodes in a database cluster. To discuss how Percona Consulting can help with your design and architecture needs for your database and infras…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question