Solved

Removing duplicate URL's from search cache

Posted on 1998-08-26
5
168 Views
Last Modified: 2010-03-04
I made a custom search script and I like to remove all the duplicate URLs that get mixed into the results.
My scripts returns via STDOUT:

Title:
Discription:
URL:

I am thinking I better combine title and URL together prior to sorting and dupe removal. I can sort it, just never seen anyting on dupe removal in any of the perl references I have, and not even sure what is the proper method for combining Titel & URL to a one line link.  
0
Comment
Question by:Biffo
  • 4
5 Comments
 
LVL 5

Expert Comment

by:b2pi
ID: 1204545
I'm a little unsure what you're asking for, as it seems that there are
two different questions here.

1.) Removing duplicates and combining

Throw each URL into a hash.  For instance, if your search function
comes up with the following URL's

HTTP://www.abc.com/a.html
HTTP://www.def.com/b.html
HTTP://www.abc.com/a.html
HTTP://www.abc.com/c.html
HTTP://www.def.com/a.html

then you say something like

$title = '' unless defined($title);
$URLS{$url} = $title;

then any duplicate urls will disappear.

2.) Sorting, combining, etc.
To sort these by url, you can just

foreach (sort keys %URLS) {
   ## Do whatever you want
}

If you need to do that case-insensitive...
foreach (sort {uc($a) cmp uc($b)} keys %URLS) {
   ## Do whatever you want
}

Finally, if you want to sort by title...
foreach (sort {$URLS{$a} cmp $URLS{$b}} keys %URLS ) {
   ## Do whatever you want
}

I'll let you figure out how to do that last one with no regard to case

0
 
LVL 5

Expert Comment

by:b2pi
ID: 1204546
Oh, well, that should have been an answer, nu?
0
 
LVL 2

Author Comment

by:Biffo
ID: 1204547
This is how my results appear below, notice the title, brief discription and URL. As you can see, the results need a little formating to make them presentable :-)


1. (title: modperl Archive: Re: Problem Compiling Mod-Perl -- description: Problems Compiling Mod-Perl -- httpd...
http://outside.organic.com/mail-rchives/modperl/

2. (title: Perl 5 How-To,
description: Perl 5 How-To. The Definitive Perl Programming Problem-Solver. Author: Aidan Humphreys Mike Glover Ed Weiss Publishing Information Publication Date: May...,
http://www.techexpo.com/bookfair/macmillan/perl5ht.html

0
 
LVL 5

Accepted Solution

by:
b2pi earned 190 total points
ID: 1204548
If what you're showing is what you want to parse, and those are each
coming in on one line....

 
#!/usr/bin/perl -w

use strict;

my($title, $desc, $url, %URLS);

while (<>) {
    m/title: (.*) description: (.*)(http:.*)$/;
    $title = $1;
    $desc = $2;
    $url = $3;
    $URLS{$url}->{title} = $title;
    $URLS{$url}->{desc} = $desc;
}

print "\n\n\nHere goes\n";
## This is sorted by url in a case sensitive way
my($i) = 0;
foreach (sort keys %URLS) {
    print $i++,".) URL:$_\n";
    print "\tTitle:\t$URLS{$_}->{title}\n";
    print "\tDescription:\t$URLS{$_}->{desc}\n";
}

0
 
LVL 5

Expert Comment

by:b2pi
ID: 1204549
By the way, it's considered courteous to grade questions when you get
an answer.  (You have two ungraded questions right now, some effort
was put forth on your behalf because you requested it... it would seem
proper to at least acknowledge that effort)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Perl count the hash for print 4 161
Exchange 2010 Transport Rule Regex 28 95
Replace  text in a file 2 97
Perl Untar File 1 27
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Internet Business Fax to Email Made Easy - With  eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, f…

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now