Solved

Regular expressions again

Posted on 2001-06-05
10
201 Views
Last Modified: 2010-08-05
Hi,

I want to grab html file, but I am looking to grab the following:

1. all ancor tags if it is includes text only , example:

<a href="...">Welcome</a> --- Yes

<a href="..."><img src=".."></a> --- No

2. all image tags: <img ..... >

Any idea ??

Regards,,
0
Comment
Question by:Zuhair070699
  • 6
  • 4
10 Comments
 
LVL 8

Expert Comment

by:us111
ID: 6156204
Pretty nice solution from LexZEUS:

1.
<?
$html = '<HTML>
             foo text0
             <a href = "http://www.1.com" ><img src="lklki"></a>foo text1
             <a href =
                       "http://www.2.com"
             >texta</a>
             foo text1
             <a   href  = http://www.3.com >texta</a>foo text1
             <a   href     =  \'http://www.4.com\'
             >texta</a>foo text1

             <a href="http://linka">texta</a>foo text1
             <a href="http://linkb">textb</a>foo text2
             foo text2
             </HTML>
             ';

     preg_match_all("/<a[[:space:]]+href[[:space:]]*=[[:space:]]*"."[\"']{0,1}([^\"'> ]+)/i",$html,$arr,PREG_SET_ORDER);
     preg_match_all("|>([^<]*)</a|i",$html,$arr_text,PREG_SET_ORDER);

     for ($i=0;$i<count($arr);$i++)
         print $arr[$i][1]."  ".$arr_text[$i][1]."\n";
?>

2. still to come
0
 
LVL 8

Expert Comment

by:us111
ID: 6156310
would ou like to  get all the <img "pictures/fsdfsf.gif"> or just the image link pictures/fsdfsf.gif ??
0
 

Author Comment

by:Zuhair070699
ID: 6158531
Hi,

The first one is 100% OK

and about image tags could you please explain the two methods.

Thanx
0
 

Author Comment

by:Zuhair070699
ID: 6158650
Hi,
Regarding to the first script I tried the following example and there is a problem:

<?
$html = '<p><a href="http://www.link1.com">Link1</a></p>
<p><a href="http://www.link2.com">Link2</a></p>
<p><a class=z href="http://www.link3.com" >Link3</a></p>
<p><a href="http://www.link4.com"><img src="" alt=""></a></p>
<p><a href="http://www.link5.com">Link5</a></p>
<p><a href="http://www.link6.com">Link6</a></p>
<p><a href="http://www.link7.com">Link7</a></p>
<p><a href="http://www.link8.com">Link8</a></p>
<p><a href="http://www.link9.com">Link9</a></p>
<p><a href="http://www.link10.com">Link10</a></p>   ';

    preg_match_all("/<a[[:space:]]+href[[:space:]]*=[[:space:]]*"."[\"']{0,1}([^\"'> ]+)/i",$html,$arr,PREG_SET_ORDER);
    preg_match_all("|>([^<]*)</a|i",$html,$arr_text,PREG_SET_ORDER);

    for ($i=0;$i<count($arr);$i++)
         
        print "<a href=\"".$arr[$i][1]."\">".$arr_text[$i][1]."</a> + \n";
                   
                   
?>

===================

The output seems Ok put the links is not ok starting from link3 which should point to http://www.link3.com rather than http://www.link4.com

Any idea?

Regards,,
0
 
LVL 8

Expert Comment

by:us111
ID: 6159590
oh sh... ;)) it's because you have  <a class=z....
I need to modify the regular expression
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 

Author Comment

by:Zuhair070699
ID: 6172735
waiting .....
0
 
LVL 8

Expert Comment

by:us111
ID: 6173755
in progress.... :)
0
 

Author Comment

by:Zuhair070699
ID: 6200941
waiting  :(
0
 
LVL 8

Accepted Solution

by:
us111 earned 50 total points
ID: 6201269
// For url
preg_match_all("/href[[:space:]]*=[[:space:]]*"."[\"']{0,1}([^\"'> ]+)/i",$html,$arr,PREG_SET_ORDER);
preg_match_all("|>([^<]*)</a|i",$html,$arr_text,PREG_SET_ORDER);

for ($i=0;$i<count($arr);$i++)
     print "<a href=\"".$arr[$i][1]."\">".$arr_text[$i][1]."</a> + \n";
0
 
LVL 8

Expert Comment

by:us111
ID: 6201312
preg_match_all("/href[[:space:]]*=[[:space:]]*"."[\"']{0,1}([^\"'> ]+)/i",$html,$arr,PREG_SET_ORDER);
preg_match_all("|>(<[^<]*)</a|i",$html,$arr_text,PREG_SET_ORDER);
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

Both Easy and Powerful How easy is PHP? http://lmgtfy.com?q=how+easy+is+php (http://lmgtfy.com?q=how+easy+is+php)  Very easy.  It has been described as "a programming language even my grandmother can use." How powerful is PHP?  http://en.wikiped…
Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
The viewer will learn how to dynamically set the form action using jQuery.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now