Link to home
Create AccountLog in
Avatar of axa9055
axa9055

asked on

How to parse text log file for emails?

Hi Experts,

I need to write a Coldfusion function that would read a text log file (e.g. comments.txt) and extract two things, the "email" and a short "comment" associated with the email. There's a structure to the text which should make it somewhat easier to parse. Each sentence in this file that carries the information I need has the following structure:

Comment from example@example.com says "some words $%!"

As you can see from the above structure/template, the email is always between "Comment from " and " says " so there's that and the comment itself is always between quotation marks and it can contain all kinds of characters including spaces but is only a short sentence (no new line to worry about".

So I need all the emails and comments in this txt file. I'm mostly stuck on how to loop through the text file and parse the text. I'm guessing each time I loop I would need to extract the two substrings and insert them in my database and then clean out what was parsed until there's nothing left to parse and finish the loop.

I'd really appreciate your help on this.
Avatar of _agx_
_agx_
Flag of United States of America image

Assuming CF8+ *and* that the lines always ends with the "comment", a small loop and regex should work


<cfloop file="c:\path\to\yourLogFile.txt"  index="line">
      <cfset matches = reFindNoCase('^Comment from (.+) says "(.+)"$', trim(line), 1, true)>
      <cfif arrayLen(matches.len) eq 3>
                <!--- extract the 2 values ...--->
            <cfset email   = mid(line, matches.pos[2], matches.len[2])>
            <cfset comment = mid(line, matches.pos[3], matches.len[3])>
            <!--- do something with the values ... --->      
            <cfoutput>
            debug:: <strong>email=</strong> #email# <strong>comment=</strong>#comment#<hr>
            </cfoutput>
      </cfif>
</cfloop>
Avatar of axa9055
axa9055

ASKER

Thanks agx, but how can this be modified to accommodate cases where the lines aren't separate or there's other junk text before or after the fixed structure I mentioned?
When it comes to text parsing the "real" format always matters :) Can you provide an example?
What might work is to remove the starts/ends with requirements in this expression:
   <!--- ^ (starts with) and $ (ends with) --->
   <cfset matches = reFindNoCase('^Comment from (.+) says "(.+)"$', trim(line), 1, true)>

Instead, change it to:
    <cfset matches = reFindNoCase('Comment from (.+) says "(.+)"', trim(line), 1, true)>

That'll probably work. But like I said, it a lot depends on the "real" file format.
Avatar of axa9055

ASKER

Thanks agx, removing the ^ and $ worked but I'm not sure I understand what you mean by "real" format. It's a .txt file and the structure is what I said but sometimes they are not in separate lines and also sometimes there's extra spaces between the words in the structure. If you could help me with these two issues I'll give you the points because I tested your code and it works on a sample txt I created which had none of the above issues.

Thanks again,
I just meant a better example of the actual contents.  When you use regex's, you're searching for a specific pattern. Any exceptions to that pattern make a world of difference.  I think I understand the varying spaces. But I'm not sure what you mean by not in separate lines. Can you give an example?
Avatar of axa9055

ASKER

Well your loop is read one line at a time correct? so when it hits something like
<!-- begin .txt --->
Comment from example1@example.com says "comment1"  junkjunk junnk Comment from example2@example.com says "comment2"

Comment from example3@example.com says "comment3"
<!--- end .txt --->

as you can see in the above the first two occurrences are  all in one line, it only takes the first and third instances and ignore the second.
also what if there's more spaces as I mentioned previously so what if we have something like:

Comment from example1@example.com says "comment1"
Comment         from          example2@example.com says "comment2"

as you can see in the second line the structure is changed a bit because of the extra space.
Yeah, <cfloop file> does read one line at a time. But reFindNoCase/reFind will only detect the 1st instance of the pattern found.  So reMatch function would probably work better.

If your file isn't too big, you could read the whole thing into a variable. Then use reMatch to return an array of matched instances

     arr[1] = Comment from example1@example.com says "comment1"  
     arr[2] = Comment from example2@example.com says "comment2"
     arr[3] = Comment from example3@example.com says "comment3"

Then you could loop through the array, and extract the email and comment like before.  The \s+ means one or more spaces. So that should handle the extra spaces you mentioned.

<cfset data = FileRead("c:\path\to\yourFile.txt")>
<cfset matches = reMatch('(Comment\s+from\s+[^\s]+\s+says\s+"[^"]+"\s*)', trim(data))>
<cfloop array="#matches#" index="line">
   <cfset results = reFindNoCase('Comment\s+from\s+([^\s]+)\s+says\s+"([^"]+)"', trim(line), 1, true)>
   <cfif arrayLen(results.len) gte 3>
      <cfset email   = mid(line, results.pos[2], results.len[2])>
      <cfset comment = mid(line, results.pos[3], results.len[3])>
      <!--- do something with the values ... --->   
      <cfoutput>
      debug:: <strong>email=</strong> #email# <strong>comment=</strong>#comment#<hr>
      </cfoutput>
   </cfif>
</cfloop>

Open in new window


If the file is very big, you could do the same thing with <cfloop file>. I'd just need to change the loop a little.
ASKER CERTIFIED SOLUTION
Avatar of _agx_
_agx_
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of axa9055

ASKER

Thank you agx, you helped solve my issue.