• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 508
  • Last Modified:

LENGTH differences UNIX vs Windows

Hi I am using a spider writting in perl.  All crawled pages are stored in one file with headers indicating the URL and the content length.  

e.g.
CONTENT-PATH: "some url"
CONTENT-LENGTH:  x
[blank line]
[start of content]
[last byte of content]CONTENT-PATH: etc etc

The next set of headers starts at the first byte following the length of the previous content.

The problem I am having is that perl counts the length using UNIX line endings and also seems to count some other things differently. This makes it hard for me to parse the output on my Windows.

I thought I had it licked when I simply adjusted the count by treating each line ending encountered as two bytes instead of one. ( Unix /n vs windows /cr/lf).   Performing that adjustment seemed to work but at some point the parser would go off the rails and lose its place over or undershooting the next document.  I suspect it might have something to do with the way it counts unicode characters but I am not sure.  Perhaps the PERL LENGTH counts  a unicode character as 1 byte and Windows as 2 ?

So my question is:  Is there someway to change the PERL program to determine a length that will agree with how Windows would count it?

This is the usage in the spider for counting the length in bytes.
$total_length += length $content;

In my initial research I turned up an environment varible that sounded like it might be a solution.  Some of what I read seemed to indicate that changing the IO library might change the way it determines length.

E.g.
$ENV{PERLIO} = perlio perl

I have almost no experience with PERL so any help appreciated.  Any ideas on either updating the PERL code to provide a count that agrees with windows, or an idea that helps me translate the PERL length into windows length appreciated.

0
chisholmd
Asked:
chisholmd
1 Solution
 
FishMongerCommented:
Sounds like you may need to use binmode on the filehandle.

perldoc -f binmode
0
 
chisholmdAuthor Commented:
I'm sorry I am not a PERL programmer so I am not sure what you mean.  The spider is fetching the content from the web into a variable, counting its length, then writting it to disk.

Where would I add the line you suggest?  It sounds like it would effect how a file was read from disk, not how a string variable was 'read'.

Could you please explain furthure what that statement would do and where I would add it?

Thanks



0
 
chisholmdAuthor Commented:
I just realized that I was not very clear about one point.  I am parsing the file with C#.NET  not PERL.  So file is written in PERL via the spider.pl but I am parsing it with .NET.

I just realized that your comment was probably suggesting that I open the file in binmode when parseing it.

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
FishMongerCommented:
The statement I gave is executed on the command line and outputs the documentation on the usage of the binmode command.  The binmode command can be used on either read or write filehandles.

I don't have a clear picture of what you're trying to accomplish, but I think you might have an easier time by using a unique character or string to separate the headers (records).  Then you can set the input record separator to that unique character or string when you need to parse the data on your windows system.  This will make it easier to loop through and parse the data (record sets).

If you can post the related portion of your code with more details on what you're doing/needing, we should be able to provide a better answer.
0
 
FishMongerCommented:
I don't know C#.NET so I won't be of much help in that area.  However, when using Perl to gather the info and create the file, I think that using binmode on the output filehandle may help.
0
 
chisholmdAuthor Commented:
OK Here is the whole story.

I am switching from using swish-e to dotLucene but the spider that is bundled with swish-e is the best that I have been able to find so I am trying to use it.  (swish-e and dotLucene are search indexers / search engines)

Spider.pl is the spider from swish-e and it fetches about a gb worth of pages from 2,000 sites in about 5 hours which is fast enough for me.

Spider.pl simply fetches the pages and stores them in one giant text file in the structure noted above.

I am parsing that file in .NET and indexing it with dotLucene.  

When I parse the file using the byte count method it is very fast.  As a work around I tried parseing it on the string "CONTENT-PATH:"  but it is way to slow using that method.  (I single character wouldn't work since I am parsing documents from all over the web. There is no garuntee that a single character or even CONTENT-PATH: would not be used somewhere else in the content.  However, the speed difference trumps it in any event so I really want to be able to translate between how PERL counts bytes and how .NET counts bytes.

The first problem is that the PERL LENGTH will count new lines as one byte (/n) but .NET figures they are two bytes.  I successfully adjusted for this but it is still encountering something in the file that throws the count off.

So I was hoping to adjust spider.pl to count the length in the same manner as windows (.NET) does, or be able to apply a formula as I parse the file without having to match any strings longer then 1 byte.  What I mean by the latter is that currently when I encounter a chr(13) followed by chr(10) I add one to the number of bytes to read.   This looked like it was going to work but there must be another difference between the way perl/.net count the length of a string because it sometimes loses its way.  The lasttime it only lot its place  after 25,000 documents.    I was pretty bummed :(


Family walk time, gotta go




0
 
chisholmdAuthor Commented:
OK, I'll read up on binmode when writting the file.
0
 
chisholmdAuthor Commented:
BINMODE sounds like its the right thing but the file is written by redirecting stdout to the file, would binmode still apply?

Here is the output function:

sub output_content {
    my ( $server, $content, $uri, $response ) = @_;

    $server->{indexed}++;

    unless ( length $$content ) {
        print STDERR "Warning: document '", $response->request->uri, "' has no content\n";
        $$content = ' ';
    }


    $server->{counts}{'Total Bytes'} += length $$content;
    $server->{counts}{'Total Docs'}++;


    # ugly and maybe expensive, but perhaps more portable than "use bytes"
    my $bytecount = length pack 'C0a*', $$content;

    # Decode the URL
    my $path = $uri;
    $path =~ s/%([0-9a-fA-F]{2})/chr hex($1)/ge;


    # For Josh
    if ( my $fn = $server->{output_function} ) {
        eval {
            $fn->(  $server, $content, $uri, $response, $bytecount, $path);
        };
        die "output_function died for $uri: $@\n" if $@;
        return;
    }


    my $headers = join "\n",
        'Path-Name: ' .  $path,
        'Content-Length: ' . $bytecount,
        '';

    $headers .= 'Last-Mtime: ' . $response->last_modified . "\n"
        if $response->last_modified;

    # Set the parser type if specified by filtering
    if ( my $type = delete $server->{parser_type} ) {
        $headers .= "Document-Type: $type\n";

    } elsif ( $response->content_type =~ m!^text/(html|xml|plain)! ) {
        $type = $1 eq 'plain' ? 'txt' : $1;
        $headers .= "Document-Type: $type*\n";
    }


    $headers .= "No-Contents: 1\n" if $server->{no_contents};
    print "$headers\n$$content";

    die "$0: Max indexed files Reached\n"
        if $server->{max_indexed} && $server->{counts}{'Total Docs'} >= $server->{max_indexed};
}
0
 
FishMongerCommented:
Yes, you can use binmode on STDOUT.  However, typically programs of this type will open a separate filehandle and use binmode on that filehandle instead of redirecting stdout.
0
 
chisholmdAuthor Commented:
Ok, So I guess the question is how to I enable binmode for stdout on this script?  I'll research this myself, but any pointers are welcome.

Dave

0
 
FishMongerCommented:
binmode STDOUT;

I'm not sure if this is the best location, but you could put it just before:

print "$headers\n$$content";

But it might be better (more efficient) to put it earlier in the script so that it gets called only once.
0
 
chisholmdAuthor Commented:
I added
   binmode STDOUT;
to the script directly under the const declarations.  I ran the spider twice against two domains, once with and once without the binmode.

Both files report the exact same size for each document. So in hack-n-slash mode maybe I should try the opposite? :)  ASsuming that it might be in binmode by default, how would I set it to ummm "text mode"

0
 
ps15Commented:
you _could_ just run the whole thing in windwos i guess if y<ou want to get things done ... ;)
0
 
chisholmdAuthor Commented:
Well I am running both on win2k3.   PERL still counts a new line as one byte and .NET as two.

I tried over a dozen different spiders and many indexing options.  The best pair is spider.pl from swish-e and dotLucene. I am putting something together that I am going to use allot so I am kinda committed to getting this combo to work.

Any idea what othe char/bytes are interpeted differently in PERL?  Perhaps it is only new lines and I had some glitch with the way I was counting those in my loops. But it didn't lose track until after about 25,000 documents so probably not.  If I had some glitch in identifying and counting new lines I would have seen a problem much earlier.

I'll have to sleep on it.

0
 
mjcoyneCommented:
Can you take the file as a glob in Perl and just replace all line endings with what .NET expects?  Then both programs should count them the same.  Something like:

$my_data =~ s/(\r?\n|\r)/\n/g;

or perhaps:

$my_data =~ s/(\r?\n|\r)/\r\n/g;

to go the other way?
0
 
chisholmdAuthor Commented:
I'll give that a try but I thought I had already solved the issue with line endings, maybe if I do the same thing to any characters over 255?  Doesn't make much sense to me though because you would think that both perl and windows count a double-byte character as a well double byte, maybe not though.

It's sunday so lots of family stuff, but I'll put al this stuff together tonight and try both the craler and parser again then try to close the question.
0
 
FishMongerCommented:
If you're going to use a regex to change the line endings, this one is a little cleaner probably slightly more effcient.

$my_data =~ s/[\r\n]+/\n/g;
0
 
ahoffmannCommented:
> PERL still counts a new line as one byte and .NET as two.
this is wrong for the perl part: perl counts exactly what it finds, no +-1
Don't know what .NET does.

Means that perl does not count "different" for text mode or binmode or when it finds \r\n.
I guess you need to dig into .NET and find out what it realy does.

BTW, perl also does not behave different on UNIX or windows according the line endings, except you configured to do it otherwise.
0
 
kblack05Commented:
There's a WAY easier way to handle this.

Simply translate or remove newline endings.

s/\r//gi;
s/\n//gi;

Or in the case of a specific string with

s/\r//gi, $_; (or $string instead of default element $_)

Have you tried this??

Regards,

~K Black~
If Linux were my God, Slackware would be my religion.
0
 
chisholmdAuthor Commented:
Remeber that I ran a test with and without the [ binmode STDOUT; ] and reported that the content-length header in both versions reported the same length?

Well there was a difference and it was obvious when I got around to open both versions in a hex editor and noticed that the version without binmode had all its 0a lineendings replaced with 0d0a  (lf with crlf).

So ahoffmann was correct, in that PERL is simply reading each as it is, the difference was when it was writing it to disk.

My first attempt at a solution was simply to add 1 to the content-length for each line ending I encountered as I parsed.  This probably failed because some crawled pages probably had crlf's while others didn't and I had no way of knowing which so my count was bound to fail eventualy.

I am doing a test tonight on a 100 site crawl but all indications are that this is fixed.

Thanks muchly.



0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now