Solved

Split Strings into Words at Spaces AND Punctuation

Posted on 2007-04-08
12
1,979 Views
Last Modified: 2011-09-20
    This Python script ( http://www.aaronsw.com/2002/diff/diff.py )
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
     However, defining words as "anything between a space" leads to unwanted formatting in the diff.  I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)

     Example of problems because of breaking only at spaces:
         old:   This is nice.
         new: This is nice, good, and great.

         Diff:  This is -nice.-  +nice, good, and great.+

It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.

    Instead, I would like the diff to be:
                This is nice-.- +, good, and great.+

(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).

     I'm not sure which line in the script spits the html at spaces.  Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
    .          ,             ;               :          '           "            !         -  
period  comma  semicolon  colon  apost. dblquote  excl. hyphen        

     For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:

     This, he said, is "a well-made program!"

That way, the diff would be:
    This +is+, he said, -is- "a --well-- +better+ -made program!+!+"    
instead of:
    -This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+
0
Comment
Question by:Randall-B
  • 7
  • 5
12 Comments
 
LVL 17

Accepted Solution

by:
ramrom earned 250 total points
Comment Utility
Looks like line 52:
                  elif c in string.whitespace: out.append(cur+c); cur = ''
Create a "punctuation list": pList = string.whitespace + """.,;:'"!-""" then change 52 to
                  elif c in pList: out.append(cur+c); cur = ''
0
 

Author Comment

by:Randall-B
Comment Utility
ramrom,
   Good. For the most part, that seems to be it.  I've been testing it and it seems to have the right effect much of the time, but not in every instance that I would have expected. I don't know if that's just how Python's  TextDiff works, or if there's another line that can be modified in the script. Do you see anything else that would appear to be relevant in the script?
0
 
LVL 17

Expert Comment

by:ramrom
Comment Utility
Provide some html from which you don't get the desired result.
0
 

Author Comment

by:Randall-B
Comment Utility
OK, here is an actual example of the output: http://216.92.61.99/pythondiff2.htm (I saved the .py script's results to an .htm file and added line numbers for reference).
   Some examples of unexpected results involving punctuation are around lines 6-8,  14-15,  20-21,  36-37,  42,  59, and 116.  One of those involved parentheses (which were not on the p list; but the results remained the same when I added them.) I other cases, the problems involve things already on the list, like colons, periods, and commas.
   
0
 
LVL 17

Expert Comment

by:ramrom
Comment Utility
I said "Provide some html from which you don't get the desired result."  You gave me the results! I want the input. Please try again.
0
 

Author Comment

by:Randall-B
Comment Utility
OK. You can see the Original html here:  http://216.92.61.99/original.htm  and the Revised html here: http://216.92.61.99/revised.htm .   It created the diff results by comparing the Revised to the Original in those 2 inputs. Thanks.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:Randall-B
Comment Utility
ramrom,
   Were those html inputs suitable for testing?  
Here's a shorter example:

* Input 1 (original):

<p align="justify">This is a great web site. It outshines the competition. The experts are the best.</p>

* Input 2 (revised):

<p align="justify">This is a great web site. It outshines the competition, in my opinion, which is based on experience. The experts are the best.</p>

Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
Comment Utility
Thanks. That makes it a LOT easier. We are getting closer. Please post the result you want from those last 2 strings.
0
 

Author Comment

by:Randall-B
Comment Utility
I would want:

<p align="justify">This is a great web site. It outshines the competition<s>.</s> <u>, in my opinion, which is based on experience.</u> The experts are the best.</p>

(I had modified the script to make <s> . . . </s> and <u> . . . </u> instead of <del . . .  and <ins . . . )

But currently it is outputting something like this (note the repetition of "the competition" [once stricken with a period, and then added back in with a comma]):

<p align="justify">This is a great web site. It outshines <s>the competitition.</s><u>the competition, in my opinion, which is based on experience.</u> The experts are the best.</p>


That kind of thing is happening in various places with different punctuation, such as commas, colons, etc.

-------------

Here's another example:

* Input 1 (original):

<p align="justify">Which is better? PHP of Perl? This is a very subjective question. You ask this question to a PHP person you will hear PHP is better and if you ask this question to Perl person you will hear Perl is better.</p>

* Input 2 (revised):

<p align="justify">Which is better? PHP or Perl? This is a very subjective question. If you ask a PHP coder, you will hear that PHP is better; but if you ask a Perl person, you will hear that Perl is better.</p>


* Output (which could be improved to eliminate redundant deleting-and-re-adding of words, failure to treat a word and connected punctation as separate, etc.):

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you <s>will </s><u>will </u>hear <s>PHP is better and </s><u>that PHP is better; but </u>if you ask <s>this question to </s><u>a </u>Perl <s>person you </s><u>person, you </u>will hear <u>that </u>Perl <s>is </s><u>is </u>better.</p>


** Desired output:

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you will hear <u>that</u> PHP is better<u>;
  </u> <s>and </s><u>but </u>if you ask <s>this question to </s>a Perl person<u>, </u> you will hear <u>that </u>Perl is is better.</p>
0
 

Author Comment

by:Randall-B
Comment Utility
ramrom,
   Any further suggestions?  Thanks.
0
 

Author Comment

by:Randall-B
Comment Utility
It appears that no further suggestions are available, so I am accepting the first response, which did make some improvements as sought in my question.  Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
Comment Utility
Thanks. This was getting over my head.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Suggested Solutions

The really strange introduction Once upon a time there were individuals who intentionally put the grass seeds to the soil with anticipation of solving their nutrition problems. Or they maybe only played with seeds and noticed what happened... Som…
Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now