Solved

python find position of string in entire file

Posted on 2012-04-05
7
743 Views
Last Modified: 2012-04-23
Hi,

Below is a very simplified situation of my code. I'm iterating over lines in a file and on every line a regex is tried. I want the exact start position in the entire file if the regex matches. However the code below doesn't get the correct position. Any ideas? Please note, my situation doesn't allow to use a global regex on the entire file! I have to iterate over the lines.

file = open('file.txt').read()
pos = 0
match_pos = 0
for line in file.splitlines():
  match = re.search('function [^{]+?{', line)
  if match:
     match_pos = pos + match.start() #exact pos
     print match_pos
     print file.count('\n', 0, match_pos) #lineno
  
  
  pos = pos + len(line)

Open in new window

0
Comment
Question by:Dennie
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
7 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 37813917
What do you mean by "the exact start position"?

Might you be getting into trouble with your calculations because of the "newline" characters a the end of each line?

Might something like this be what you want?

import re
pat = re.compile( 'function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d..%5d' % ( item.start(), item.end() )

Open in new window

0
 

Author Comment

by:Dennie
ID: 37813995
with exact position I mean that if I would use:

m = re.finditer('searchsomething', file, flags=re.DOTALL)
m.start()

That the m.start in this code would match the start position of a match in the code of my first post (where i'm iterating over the lines)
Again, I have to iterate over all the lines!
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37814046
Are you looking for the offset from the beginning of the file (in bytes)?
Or are you looking for the offset at the beginning of each line?

Sorry for being confused.

The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:Dennie
ID: 37814076
"The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position."

Yes but I have to iterate over the lines. your example is the exact match that I'm looking for, but I can't use a global regex in the file. I can only use a regex in the line.
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 250 total points
ID: 37814277
ok, so you want to iterate over the individual lines in the file, but obtain the offset from the beginning of the file.
Is that right?

Why do you need to iterate over the lines?

The regular expression can include the use of ^ to match the beginning of each line, so this may be the solution for which you are looking:

import re
pat = re.compile( '^function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d: "%s"' % ( item.start(), data[ item.start() : item.end() ] )

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 250 total points
ID: 37814547
I guess that the code should be used for example when building an editor where you have only a "window of lines" to be searched.  Is it correct?

Basically, I can see no problem with your basic idea. Try the following code:

a.py
import re

rexFunctionPattern = r'function [^{]+?{'
rexFunction = re.compile(rexFunctionPattern)

def getStartsByLines(fname, rex=rexFunction):
    with open(fname) as f:
        pos = 0
        match_pos = 0
        for lineno, line in enumerate(f):
            m = rexFunction.search(line)
            if m:
                match_pos = pos + m.start() #exact pos
                yield (lineno + 1, m.start(), match_pos)

            pos = pos + len(line)


def getStartsBlockReading(fname, pattern=rexFunctionPattern):
    with open(fname) as f:
        data = f.read()

    for m in re.finditer(pattern, data, flags=re.DOTALL):
        yield m.start()


for t in getStartsByLines('file.txt'):
    print t

print '-' * 30

for pos in getStartsBlockReading('file.txt'):
    print pos

Open in new window


With my sample of file.txt...
a
b
c
xyz function klm { something here }
r
t  function zzz { something here }
z

Open in new window


It prints on my console:

c:\tmp\_Python\Dennie\Q_27664512>a.py
(4, 4, 10)
(6, 3, 47)
------------------------------
10
47

Open in new window


The earlier code returns the line number (one-based), line position (here zero-based, but can be adjusted easily), and the character offset from the beginning of the file (zero based).  The second generator retuns the zero-based character offsets via the find iter of a single block of data.

If the file is opened in a text mode, then it does not matter what newlines are used.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37880328
Thanks for the assist, and the points.

Good luck & have a great day.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this post we will learn different types of Android Layout and some basics of an Android App.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question