Script to prevent a ZFS hang.

p0sreed
p0sreed used Ask the Experts™
on
I have an open solaris sun machine that encounters a zfs file system hang most of the time.

When we issue ZFS list , it hangs in there with no result. I have to execute "echo | format" or the "format" command to clear the hang.

I want to write a perl script that automates this every time the zfs filesystem hangs. The algorithm is below.

1. execute "zfs list | wc -l" and store the value in a scalar
2. check scalar is empty or not
3. if scalar value is empty, execute "echo | format " command
4. else exit

Want to set up this script as a crob job to run eveery minute
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Can you show the output of your 'zfs list | wc -l' in your two scenarios (one where it's working, the other where it's hung?).  With it, you can probably get a more definitive answer, without it maybe this:
#!/usr/bin/perl
system("echo | format") if `zfs list | wc -l`;

Open in new window

Author

Commented:
The problem with that script is that. "zfs list | wc -l" will hang itself, will not give any result to the test and non-execution of the command.
Sorry... meant:
 #!/usr/bin/perl
system("echo | format") unless `zfs list | wc -l`;
 

Open in new window

Build an E-Commerce Site with Angular 5

Learn how to build an E-Commerce site with Angular 5, a JavaScript framework used by developers to build web, desktop, and mobile applications.

Author

Commented:
Where will the output of "echo | format" go to?
So it's really not:

1. execute "zfs list | wc -l" and store the value in a scalar
2. check scalar is empty or not
3. if scalar value is empty, execute "echo | format " command
4. else exit

It's:

1. execute "zfs list | wc -l"
2. if it hangs, execute "echo | format " command
4. else exit

Correct?  How long of a wait before you consider it hung?  A few seconds?

Author

Commented:
yes, correct. I would give it 10 seconds to produce an output, else it is hung, or the other hand, using "wc -l" was to rule out a timeout, if "wc -l" doesn't produce a number, we can execute the format command
Try:
eval {
        local $SIG{ALRM} = sub { system("echo | format"); exit; };
        alarm 10;
        $data = `zfs list`;
};
if ($@) {
        print "Error: $@\n";
}

Open in new window

Author

Commented:
Thank you. I really have no way of testing this script as i don't have a system hang at the moment. If you do, please, let me know. I will implement this script in the cron and monitor the system for a week and get back to you otherwise
I tested it with a command that sleeps for 60 seconds (rather than zfs) and it worked, but YMMV.  I'm fairly certain the timeout on the alarm will catch the hang, but I'm not so sure about the echo | format clearing your zfs problem via the perl script within a cron job.  Give it a try and let us know.

BTW,forgot to include the shebang in the code I posted.  If you're calling it directly, don't forget it-- like I did.  :-)
Brian UtterbackPrinciple Software Engineer
Commented:
You should report this as it is a bug. The next time you have a hang in zfs list, run this command as root:

mdb -k <<EOF
::pgrep zfs | ::walk thread | ::findstack
EOF

Post the output along with the output of "uname -a"

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial