Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

python find position of string in entire file

Posted on 2012-04-05
7
Medium Priority
?
963 Views
Last Modified: 2012-04-23
Hi,

Below is a very simplified situation of my code. I'm iterating over lines in a file and on every line a regex is tried. I want the exact start position in the entire file if the regex matches. However the code below doesn't get the correct position. Any ideas? Please note, my situation doesn't allow to use a global regex on the entire file! I have to iterate over the lines.

file = open('file.txt').read()
pos = 0
match_pos = 0
for line in file.splitlines():
  match = re.search('function [^{]+?{', line)
  if match:
     match_pos = pos + match.start() #exact pos
     print match_pos
     print file.count('\n', 0, match_pos) #lineno
  
  
  pos = pos + len(line)

Open in new window

0
Comment
Question by:Dennie
  • 4
  • 2
7 Comments
 
LVL 41

Expert Comment

by:HonorGod
ID: 37813917
What do you mean by "the exact start position"?

Might you be getting into trouble with your calculations because of the "newline" characters a the end of each line?

Might something like this be what you want?

import re
pat = re.compile( 'function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d..%5d' % ( item.start(), item.end() )

Open in new window

0
 

Author Comment

by:Dennie
ID: 37813995
with exact position I mean that if I would use:

m = re.finditer('searchsomething', file, flags=re.DOTALL)
m.start()

That the m.start in this code would match the start position of a match in the code of my first post (where i'm iterating over the lines)
Again, I have to iterate over all the lines!
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37814046
Are you looking for the offset from the beginning of the file (in bytes)?
Or are you looking for the offset at the beginning of each line?

Sorry for being confused.

The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 

Author Comment

by:Dennie
ID: 37814076
"The snippet of code that I supplied above displays the "exact" offset of the match, both the staring and ending position."

Yes but I have to iterate over the lines. your example is the exact match that I'm looking for, but I can't use a global regex in the file. I can only use a regex in the line.
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 1000 total points
ID: 37814277
ok, so you want to iterate over the individual lines in the file, but obtain the offset from the beginning of the file.
Is that right?

Why do you need to iterate over the lines?

The regular expression can include the use of ^ to match the beginning of each line, so this may be the solution for which you are looking:

import re
pat = re.compile( '^function [^{]+?{', re.MULTILINE )
fh = open( 'file.txt', 'rb' )
data = fh.read()
fh.close()
for item in re.finditer( pat, data ) :
  print '%5d: "%s"' % ( item.start(), data[ item.start() : item.end() ] )

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 1000 total points
ID: 37814547
I guess that the code should be used for example when building an editor where you have only a "window of lines" to be searched.  Is it correct?

Basically, I can see no problem with your basic idea. Try the following code:

a.py
import re

rexFunctionPattern = r'function [^{]+?{'
rexFunction = re.compile(rexFunctionPattern)

def getStartsByLines(fname, rex=rexFunction):
    with open(fname) as f:
        pos = 0
        match_pos = 0
        for lineno, line in enumerate(f):
            m = rexFunction.search(line)
            if m:
                match_pos = pos + m.start() #exact pos
                yield (lineno + 1, m.start(), match_pos)

            pos = pos + len(line)


def getStartsBlockReading(fname, pattern=rexFunctionPattern):
    with open(fname) as f:
        data = f.read()

    for m in re.finditer(pattern, data, flags=re.DOTALL):
        yield m.start()


for t in getStartsByLines('file.txt'):
    print t

print '-' * 30

for pos in getStartsBlockReading('file.txt'):
    print pos

Open in new window


With my sample of file.txt...
a
b
c
xyz function klm { something here }
r
t  function zzz { something here }
z

Open in new window


It prints on my console:

c:\tmp\_Python\Dennie\Q_27664512>a.py
(4, 4, 10)
(6, 3, 47)
------------------------------
10
47

Open in new window


The earlier code returns the line number (one-based), line position (here zero-based, but can be adjusted easily), and the character offset from the beginning of the file (zero based).  The second generator retuns the zero-based character offsets via the find iter of a single block of data.

If the file is opened in a text mode, then it does not matter what newlines are used.
0
 
LVL 41

Expert Comment

by:HonorGod
ID: 37880328
Thanks for the assist, and the points.

Good luck & have a great day.
0

Featured Post

Become an Android App Developer

Ready to kick start your career in 2018? Learn how to build an Android app in January’s Course of the Month and open the door to new opportunities.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When you see single cell contains number and text, and you have to get any date out of it seems like cracking our heads.
When you discover the power of the R programming language, you are going to wonder how you ever lived without it! Learn why the language merits a place in your programming arsenal.
The viewer will learn how to clear a vector as well as how to detect empty vectors in C++.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
Suggested Courses

578 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question