?
Solved

Regular expression

Posted on 2002-06-05
10
Medium Priority
?
146 Views
Last Modified: 2013-12-25
Hi friends

I want to extract the name of background image from a body tag. A typical tag may contain attributes like text, vlink, bgcolor, background,marginheight,leftmargin etc. or some of them or none of them with varying number of spaces and quotes . I want to extract the name of bgimage from this tag. For example I want to extract "tree.gif" from the below tag  <BODY marginwidth="0" marginheight="0"  background= "tree.gif" leftmargin=0 topmargin="0" link="#330099" vlink="#330066" alink="#330066" >

Please advice...
Thanks
0
Comment
Question by:boolee
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
10 Comments
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7055951
s/.*background="?([^" ]*)"?.*/$1/
0
 
LVL 15

Expert Comment

by:samri
ID: 7056244
ahoffman,

and that is in Perl right?
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7056347
right
(or is somebody doing real/advanced regular expressions in other language *and* use it for CGI ;-)
0
Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

 
LVL 10

Expert Comment

by:rj2
ID: 7057283
bolee,
Html might contain (almost) arbitrary whitespace.
Code below should work also if there is linebreak before background filename (as it is in the sample you posted)

#!/usr/bin/perl
use strict;

$_=<<ENDHTML;
<BODY marginwidth="0" marginheight="0"  background=
"tree.gif" leftmargin=0 topmargin="0" link="#330099" vlink="#330066" alink="#330066" >
ENDHTML


/(.|\s)*background\s*=\s*"([^"]*)"/;
print "$2\n";
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7057646
\n and \r problem fixed too, but how about missing ", or pathnames containing whitespaces (which is not allowed according RFC, but M$ makes it work):

Simply forgot my disclaimer in my very first comment:
   to be improved in many ways
0
 
LVL 1

Author Comment

by:boolee
ID: 7058508
HI all

I shall tell u the real problem . I am doing a mail handling program and I want to display the embedded image or background image from the incoming mail in a table.  I have an HTML template in the form

<HTML>
    <BODY>
    <TABLE>
    <TR> <TD BACKGROUND="MY FILENAME"><TD></TR>
    .............................
    .............................

The problem is that when I receive a mail from outlook/eudora/netscape messenger with a stationary background image, the mail text itself is an HTML file with tag <BODY BACKGROUND="an image"....> and this tag overrides my previous declaration. So the image comes as a page background. I am not sure about the attributes of body tags from different mail softwares. So I want to extract the filename and replace it in my table. There may be different number of body tags depending on the number of times of reply/forward. Please tell me a solution.

ahoffmann's answer worked for most of my attribute combinations except a few ... but since different mailing softwares sent the body tag in different ways I am looking for something robust...

Thanks
0
 
LVL 51

Expert Comment

by:ahoffmann
ID: 7058694
if you have a mail with sevaral replies and forwards embeded, you probably also have more than one <BODY> and/or <HTML> tag.
I.g. we can say that this all together is no longer a valid HTML syntax tree, 'cause you never can restrict someone to write her/his comments anywhere inbetween the previous text.

You need to identify the <BODY> you're interested in, or you need to change them all.
I suggest to change them all. But keep in mind that someone sends mail which contains the string literal
   <body background=this-all-together-is-a-literal-string>
which must not be changed at all.
Means that you cannot get a 100% solution.

s/.*<body\s+.*background="?([^" ]*)"?.*/$1/ig
performed on all your lines should give you most of the required images (keep previous comments about \n \r and blanks in mind)
0
 
LVL 1

Author Comment

by:boolee
ID: 7061125
So u mean I have to replace all the inner BODY and HTML tags? Can I just remove the background property and preserve the rest of the tag? Do u think it is useful?
0
 
LVL 51

Accepted Solution

by:
ahoffmann earned 150 total points
ID: 7061847
> Can I just remove the background property and preserve the rest of the tag?
You can, it's up to you.

> Do u think it is useful?
If you never expect mails which contain the literal string background=some.gif (like this text here), then it might be usefull for you in most cases.

How about replacing all occourences of background=whatever
with for example:
    "back--ground = whatever"
so it will not be a valid tag attribute anymore (hopefully in future too). And you still can read the text, even if it was not meant as tag attribute but literal string.
0
 
LVL 1

Author Comment

by:boolee
ID: 7077229
Thanks ahoffmann
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article will show, step by step, how to integrate R code into a R Sweave document
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Suggested Courses

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question