Solved

Regular Expression to find search text between parentheses

Posted on 2011-09-12
7
510 Views
Last Modified: 2012-08-14
I am looking for a regular expression to use in a PHP preg_match function call that will find a search text anywhere in the search string where the search text is between open and close parentheses, i.e. like:

$postVal = "find me"
$pdfdata = "This is a test of a string that (will find me between) parentheses.

I have this for starter, but not quite correct.  I basically want to find whole words or phrases that are between the parentheses, as this how it is formatted in the pdf document that I am searching.

preg_match('/\([^\(\r\n]'.$postVal.'\)/i', $pdfdata)
0
Comment
Question by:sscotti
7 Comments
 
LVL 35

Expert Comment

by:Terry Woods
ID: 36526830
Something like this?

$postVal = "find me";
$pdfdata = "This is a test of a string that (will find me between) parentheses.";
preg_match('/\([^\(\r\n]*'.$postVal.'[^\(\r\n]*\)/i', $pdfdata, $matches);
print_r($matches);

Output:
Array
(
    [0] => (will find me between)
)
0
 
LVL 17

Expert Comment

by:sonawanekiran
ID: 36527426
If you want do that with javascript, then it is very simple
var str = "This is test string (which you are looking for)";
alert(str.replace(/^.*\((.*)\).*$/m, '$1'));

Open in new window


Live Demo :

http://jsfiddle.net/R8WGt/
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 36528697
An interesting wrinkle on this question... What if the parenthetical expression contains a parenthetical expression?
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 108

Accepted Solution

by:
Ray Paseur earned 450 total points
ID: 36528882
This seems to work OK for at least some of the edge cases.  When you try to parse strings with REGEX the result almost always contains some surprises.  I often like to use explode() first to limit the exposure to unpredictable external data.

You can see the output of this script on my server here.  As you will see, the absence of a closing paren creates ambiguity about the string we want to find.  You might also want to decide whether the search should be case-sensitive or not.  See line 39.
http://www.laprbass.com/RAY_temp_sscotti.php

Best of luck with your project, ~Ray
<?php // RAY_temp_sscotti.php
error_reporting(E_ALL);
echo "<pre>";


// SOME TEST DATA
$pdfdatas = array
( "This is a test of a string that (will find me between) parentheses."
, "This has parentheses but (not the search argument) and also (find me here) the search arg."
, "Here is (a layered (parenthetical (find me expression) in an unbalanced) string"
, "(find me)"
, "This ought to ((find me, too!))"
, "This has multiples of (find me, one) and (find me, too)."
, "This has nothing."
, "This has nothing ()."
, "This has nothing (of value)."
, "This is interesting (I am curious - find me here at the end?"
)
;

// THE SEARCH STRING IS PROBABLY EXTERNAL DATA
$postVal = "find me";

// PREPARE THE EXTERNAL DATA
$s = preg_quote($postVal);


// CONSTRUCT A REGEX
$r
= '#'           // REGEX DELIMITER
. '[(]{1}'      // A CHARACTER CLASS OF OPEN PAREN
. '('           // START A GROUP
. '.*?'         // ANYTHING OR NOTHING
. $s            // THE PREPARED SEARCH STRING
. '.*?'         // ANYTHING OR NOTHING
. ')'           // ENDOF A GROUP
. '[)]{1}'      // A CHARACTER CLASS OF CLOSING PAREN
. '#'           // REGEX DELIMITER
. 'i'           // CASE-INSENSITIVE
;


// TEST EACH OF THE STRINGS
foreach ($pdfdatas as $p)
{
    // SHOW THE ORIGINAL STRING
    echo PHP_EOL;
    echo htmlentities($p);
    echo PHP_EOL;

    // MAKE THE MATCH
    preg_match_all($r, $p, $m);

    // IF THERE IS A FINDING IT IS IN THE GROUP AT $m[1]
    if (!empty($m[1]))
    {
        // THERE MIGHT BE MULTIPLES
        foreach ($m[1] as $n)
        {
            // HANDLES UNBALANCED PARENTHESES
            $a = explode('(', $n);
            $f = end($a);

            // SHOW THE STRING WE FOUND
            var_dump($f);
        }
    }
    else
    {
        echo "NO MATCH" . PHP_EOL;
    }
}

Open in new window

0
 
LVL 5

Author Comment

by:sscotti
ID: 36531873
Thanks for the input.  Will award points shortly.  Just curious.  The  application here is that I am searching for keywords or text in a converted PDF document that has been OCR'ed or saved with Adobe Acrobat from Powerpoint.  As an example, there is a lot of formatting data in the document and other data that looks like the text that I am searching for.  e.g.


Following looks like text that appears on my ppt slides.


Q
BT
0.19 0.147 0.152 0 k
/TT0 1 Tf
24 0 0 24 195.873 160.2 Tm
[(Submitted by: Dr. Gay, M.D. )250( )]TJ
3.861 -1.167 Td
[(Professor of Radiology)250( )]TJ
3.333 -1.208 Td
(8/25/11)Tj
0.149 0.113 0.118 0 k
/TT1 1 Tf
2.806 0 Td
( )Tj
ET


......


BT
0 0 0 0 k
/TT0 1 Tf
44 0 0 44 99.125 491.7999 Tm
[(Based on these images, what is)233( )]TJ
1.05 -1.182 Td
[(the most likely diagnosis?)250( )]TJ
0.029 0 0.342 0 k
32 0 0 32 131.125 366.5 Tm
(1.)Tj
0.022 0 0.281 0 k
/C2_0 1 Tf
<0001>Tj
0 0 0 0 k
/TT0 1 Tf
1.361 0 Td
(Invasive ductal carcinoma )Tj

following looks like text elsewhere in the document:

(6y´Î"†ú9¿•#Ngì´…°úécè-ïIüñyà·ÿPåÛEné]s˜x¿›/´ÚvªYï)Ò—lF°œp+Å¿¿€3SJ°Û,N¿¿F8F!J[LëÌŸ/ج Gäuuu9Ò˜µC¿p†Áz.MT»±oY<K*S°2o´ª]±Ï€°~=6¿Ä`Í2Ë!/§£›.~¿´=RJTòä*ja.®Ô#i[  ÿÓÏf‚¿‹ôL¶¿ {qÕ› WHpn€is”ƒQë/¥_7*ëKsÿgj¿¿¶5æåµ÷¿ÆOBz7Á•àS#ÁÇ–F Y»0é(óv‡/\g1õ^¿}º)ôœmgaY¿?.ãí4râ)


If I search for "gay" in this case there is a match in the ppt text portion and a match in the "elsewhere" data display above (gaY) at the end of the string.

Just wondering if there is some standard method to search for text words or  phrases within a pdf document, which is what I am trying to do.  The solution above works for the most part since what I am search for (e.g. idiopathic, lipoma, etc.) is probably only going to show up in the text data, but some short words may be in the encoded text as well.  I am sure there are tools or methods already out there for doing that sort of thing.
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 50 total points
ID: 36532714
You could use a positive lookahead to ensure that x many characters following the word you're looking for are within an expected set of characters. If you're still wanting to find a particular word within parentheses, you can add a lookahead to it like this:

preg_match('/\([^\(\r\n]*'.$postVal.'(?=[\w\s!@#$%^&*()\-=+\[\]{};\':",.\/<>?\\|`~]{3})[^\(\r\n]*\)/i', $pdfdata, $matches);

This bit was added:
(?=[\w\s!@#$%^&*()\-=+\[\]{};\':",.\/<>?\\|`~]{3})

The given pattern requires any 3 keyboard characters (also including tab or CR/LF) to follow the given value.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 450 total points
ID: 36535855
When you try to parse strings with REGEX the result almost always contains some surprises.  I often like to use explode() first to limit the exposure to unpredictable external data. ... and ... You might also want to decide whether the search should be case-sensitive or not.  See line 39.

Parsing text for meaning is difficult enough without the added complexity of PDF and PPT markup, formatting and layout.  That's why I rarely try to do everything in a single statement.  A few extra lines of code add a lot of power and flexibility.

Two suggestions... One, if you want help searching a PDF document, you will get better results if you post a representative PDF document (or three) for us to work with.  And Two, there are many prefabricated search machines that can search PDF documents very capably.  The Atomz engine is one that I have used successfully for many years.  The Wrensoft Zoom indexer does a good job, too.

Best of luck with your project, ~Ray
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Foreword (July, 2015) Since I first wrote this article, years ago, a great many more people have begun using the internet.  They are coming online from every part of the globe, learning, reading, shopping and spending money at an ever-increasing ra…
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now