Solved

python find position of string in entire file

Posted on 2012-04-05
7
665 Views
Last Modified: 2012-04-23
Hi,

Below is a very simplified situation of my code. I'm iterating over lines in a file and on every line a regex is tried. I want the exact start position in the entire file if the regex matches. However the code below doesn't get the correct position. Any ideas? Please note, my situation doesn't allow to use a global regex on the entire file! I have to iterate over the lines.

file = open('file.txt').read()
pos = 0
match_pos = 0
for line in file.splitlines():
  match = re.search('function [^{]+?{', line)
  if match:
     match_pos = pos + match.start() #exact pos
     print match_pos
     print file.count('\n', 0, match_pos) #lineno
  
  
  pos = pos + len(line)

Open in new window

0
Comment
Question by:Dennie
  • 4
  • 2
7 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 37813917
What do you mean by "the exact start position"?

Might you be getting into trouble with your calculations because of the "newline" characters a the end of each line?

Might something like this be what you want?

import re
pat = re.compile( 'function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d..%5d' % ( item.start(), item.end() )

Open in new window

0
 

Author Comment

by:Dennie
ID: 37813995
with exact position I mean that if I would use:

m = re.finditer('searchsomething', file, flags=re.DOTALL)
m.start()

That the m.start in this code would match the start position of a match in the code of my first post (where i'm iterating over the lines)
Again, I have to iterate over all the lines!
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37814046
Are you looking for the offset from the beginning of the file (in bytes)?
Or are you looking for the offset at the beginning of each line?

Sorry for being confused.

The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position.
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:Dennie
ID: 37814076
"The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position."

Yes but I have to iterate over the lines. your example is the exact match that I'm looking for, but I can't use a global regex in the file. I can only use a regex in the line.
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 250 total points
ID: 37814277
ok, so you want to iterate over the individual lines in the file, but obtain the offset from the beginning of the file.
Is that right?

Why do you need to iterate over the lines?

The regular expression can include the use of ^ to match the beginning of each line, so this may be the solution for which you are looking:

import re
pat = re.compile( '^function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d: "%s"' % ( item.start(), data[ item.start() : item.end() ] )

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 250 total points
ID: 37814547
I guess that the code should be used for example when building an editor where you have only a "window of lines" to be searched.  Is it correct?

Basically, I can see no problem with your basic idea. Try the following code:

a.py
import re

rexFunctionPattern = r'function [^{]+?{'
rexFunction = re.compile(rexFunctionPattern)

def getStartsByLines(fname, rex=rexFunction):
    with open(fname) as f:
        pos = 0
        match_pos = 0
        for lineno, line in enumerate(f):
            m = rexFunction.search(line)
            if m:
                match_pos = pos + m.start() #exact pos
                yield (lineno + 1, m.start(), match_pos)

            pos = pos + len(line)


def getStartsBlockReading(fname, pattern=rexFunctionPattern):
    with open(fname) as f:
        data = f.read()

    for m in re.finditer(pattern, data, flags=re.DOTALL):
        yield m.start()


for t in getStartsByLines('file.txt'):
    print t

print '-' * 30

for pos in getStartsBlockReading('file.txt'):
    print pos

Open in new window


With my sample of file.txt...
a
b
c
xyz function klm { something here }
r
t  function zzz { something here }
z

Open in new window


It prints on my console:

c:\tmp\_Python\Dennie\Q_27664512>a.py
(4, 4, 10)
(6, 3, 47)
------------------------------
10
47

Open in new window


The earlier code returns the line number (one-based), line position (here zero-based, but can be adjusted easily), and the character offset from the beginning of the file (zero based).  The second generator retuns the zero-based character offsets via the find iter of a single block of data.

If the file is opened in a text mode, then it does not matter what newlines are used.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37880328
Thanks for the assist, and the points.

Good luck & have a great day.
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Displaying an arrayList in a listView using the default adapter is rarely the best solution. To get full control of your display data, and to be able to refresh it after editing, requires the use of a custom adapter.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question