?
Solved

Perl's split function & regexp

Posted on 2003-03-06
3
Medium Priority
?
190 Views
Last Modified: 2013-12-25
I'm currently working on parsing data, and need to split a large group of text (below is an example) at #.# or #.## or ##.## or (#) or (##).  Sometimes what I need to split it on uses tabs, spaces, or could be right next to a word (or referred to in the text following, but I don't want to split it there).  So my question, how do I use a regular expression in the split function to split at 4.3 for example grab the text up to 4.4 (and make note of the footnote (2).  I've got it grabbing the text up to 4.4 but for some reason it doesn't want to give me the 4.3 since it's splitting on that it's not retaining it in the line.  Here is an example of how I'm splitting it (I'm a regexp newbie, so don't be too harsh):

********************
Example text:

4.3 Dated as of June 22, 2000, by and among the company, as Issuer, and Group, as Guarantor. (2) 4.4 Amended and Restated Rights Agreement, dated as of February 11, 1999, between the company. (5) 4.5 Specimen of Class A common stock certificate. (2) 10.1 Agreement, dated as of May 31, 1990, between the company, and Amendment thereto. (6)*
***************

# I'm splitting the docs into paragraphs, and if the paragraph contains a #.# (etc) I'll grab it then...
if ($paragraph =~ m/([\s\s|\t][\d+][.|-][\d+|A-Za-z][\s\s|\t])/) {

# then I'll push the 'split'lines into an array for further breakdown
my (@newlines) = split(/\s\s\d+[.|-][\d+|A-Za-z]\s\s/, $paragraph);

****************************
From here however I'd receive:

Dated as of June 22, 2000, by and among the company, as Issuer, and Group, as Guarantor. (2)

...as output.

Any help is truly appreciated!
0
Comment
Question by:buzzbuzz
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 8084375
$paragraph='4.3 Dated as of June 22, 2000, by and among the company, as Issuer, and Group, as
Guarantor. (2) 4.4 Amended and Restated Rights Agreement, dated as of February 11,
1999, between the company. (5) 4.5 Specimen of Class A common stock certificate. (2)
10.1 Agreement, dated as of May 31, 1990, between the company, and Amendment
thereto. (6)*';
my (@newlines) = $paragraph =~  /(.+?)(?=$|\s+\d+[.|-][\dA-Za-z]+\s+)/gs;
0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this tutorial I will focus on how to use WhizBase as a tool for sending ICQ messages to ICQ. Here I will use a new technology in WhizBase, published in WhizBase 5.1 version. In this tutorial I will use 3 files, pager.wbsp for the processing, e…
Originally, this post was published on Monitis Blog, you can check it here . In business circles, we sometimes hear that today is the “age of the customer.” And so it is. Thanks to the enormous advances over the past few years in consumer techno…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…
Suggested Courses

777 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question