• Status: Solved
  • Priority: Low
  • Security: Public
  • Views: 38
  • Last Modified:

count of repetitive consecutive two words from a text file.

How do I count repetitive consecutive two words from a text file.

input.txt file contains

backend error oracle error insufficient space
oracle error
insufficient space insufficient space
complete order etc

output should be
backend error count 1
oracle error count  2
insufficient space count 3
complete order 1
etc 1
0
Thirupathi Lagishetti
Asked:
Thirupathi Lagishetti
  • 6
  • 2
  • 2
1 Solution
 
gelonidaCommented:
Is this example representative enough?

Do you look for predefined word pairs, which you know prior to parsing the text
or do you always group
the first/second third/fourth fifth/6th word of a  line as pairs.

What to do with punctuation characters. can they occur, shall they be stripped off, . . .

This all might have an impact on the best implementation for a robust solution in your context.
0
 
aikimarkCommented:
what about
error oracle
error insufficient
etc ?
0
 
aikimarkCommented:
import re
import collections

text = """backend error oracle error insufficient space
oracle error
insufficient space insufficient space
complete order"""

print collections.Counter(re.findall(r'\b(\w+\s+\w+)\b', text))

Open in new window

produces the following output:
Counter({'insufficient space': 3, 'oracle error': 2, 'complete order': 1, 'backend error': 1})

Open in new window

If this is sufficient, all you need to do is replace the string literal with a file read
Hint: With  As
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 
aikimarkCommented:
...like this
with open('c:\users\mark\downloads\Q_29099534.txt') as f:
    text = f.read(-1)
    print collections.Counter(re.findall(r'\b(\w+\s+\w+)\b', text))

Open in new window

0
 
Thirupathi LagishettiAuthor Commented:
Hi @aikimark
Thank you for your reply, Below content is also needed, could you pls update the script or give me the hint to achieve the result.

error oracle
error insufficient
0
 
gelonidaCommented:
Just a small comment:

instead of
with open('c:\users\mark\downloads\Q_29099534.txt')

Open in new window


It's better to write one of these
with open('c:\\users\\mark\\downloads\\Q_29099534.txt')

Open in new window

or
with open(r'c:\users\mark\downloads\Q_29099534.txt')

Open in new window

or
with open('c:/users/mark/downloads/Q_29099534.txt')

Open in new window


in your specific case there's  no issue, but if you had

with open('c:\users\tom\new_downloads\Q_29099534.txt')

Open in new window

then \n and \t would have caused issues as they would have been interpreted as newline character or tab character.
So out of habit it's best do escape all backslashes
0
 
aikimarkCommented:
import re
import collections

with open('c:\users\mark\downloads\Q_29099534.txt') as f:
    text = f.read(-1)
    text += ' ' + ' '.join(text.split(' ')[1:])
    print collections.Counter(re.findall(r'\b(\w+\s+\w+)\b', text))

Open in new window

produces
Counter({'insufficient space': 3, 'oracle error': 2, 'error insufficient': 1, 'complete order': 1, 'space\ncomplete': 1, 'space\noracle': 1, 'space insufficient': 1, 'backend error': 1, 'error\ninsufficient': 1, 'error oracle': 1}

Open in new window

0
 
aikimarkCommented:
Since the text contains multiple lines, this tweak will convert to purely space-separated tuples:
import re
import collections

with open(r'c:\users\mark\downloads\Q_29099534.txt') as f:
    text = f.read(-1)
    text += ' ' + ' '.join(text.split(' ')[1:])
    text = re.sub(r'\r\n', ' ', text)
    text = re.sub(r'\n', ' ', text)
    print collections.Counter(re.findall(r'\b(\w+\s+\w+)\b', text))

Open in new window

0
 
aikimarkCommented:
Maybe using the split() method is more Pythonic.
import re
import collections

with open(r'c:\users\mark\downloads\Q_29099534.txt') as f:
    text = f.read(-1)
    text += ' ' + text.split(' ',1)[1]
    text = re.sub(r'\r\n', ' ', text)
    text = re.sub(r'\n', ' ', text)
    print collections.Counter(re.findall(r'\b(\w+\s+\w+)\b', text))

Open in new window

0
 
Thirupathi LagishettiAuthor Commented:
Thank you so much for your help @aikimark, its really solved my problem.cheers!!!
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 6
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now