Split Strings into Words at Spaces AND Punctuation

    This Python script ( http://www.aaronsw.com/2002/diff/diff.py )
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
     However, defining words as "anything between a space" leads to unwanted formatting in the diff.  I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)

     Example of problems because of breaking only at spaces:
         old:   This is nice.
         new: This is nice, good, and great.

         Diff:  This is -nice.-  +nice, good, and great.+

It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.

    Instead, I would like the diff to be:
                This is nice-.- +, good, and great.+

(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).

     I'm not sure which line in the script spits the html at spaces.  Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
    .          ,             ;               :          '           "            !         -  
period  comma  semicolon  colon  apost. dblquote  excl. hyphen        

     For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:

     This, he said, is "a well-made program!"

That way, the diff would be:
    This +is+, he said, -is- "a --well-- +better+ -made program!+!+"    
instead of:
    -This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+
Randall-BAsked:
Who is Participating?
 
ramromConnect With a Mentor consultant Commented:
Looks like line 52:
                  elif c in string.whitespace: out.append(cur+c); cur = ''
Create a "punctuation list": pList = string.whitespace + """.,;:'"!-""" then change 52 to
                  elif c in pList: out.append(cur+c); cur = ''
0
 
Randall-BAuthor Commented:
ramrom,
   Good. For the most part, that seems to be it.  I've been testing it and it seems to have the right effect much of the time, but not in every instance that I would have expected. I don't know if that's just how Python's  TextDiff works, or if there's another line that can be modified in the script. Do you see anything else that would appear to be relevant in the script?
0
 
ramromconsultant Commented:
Provide some html from which you don't get the desired result.
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

 
Randall-BAuthor Commented:
OK, here is an actual example of the output: http://216.92.61.99/pythondiff2.htm (I saved the .py script's results to an .htm file and added line numbers for reference).
   Some examples of unexpected results involving punctuation are around lines 6-8,  14-15,  20-21,  36-37,  42,  59, and 116.  One of those involved parentheses (which were not on the p list; but the results remained the same when I added them.) I other cases, the problems involve things already on the list, like colons, periods, and commas.
   
0
 
ramromconsultant Commented:
I said "Provide some html from which you don't get the desired result."  You gave me the results! I want the input. Please try again.
0
 
Randall-BAuthor Commented:
OK. You can see the Original html here:  http://216.92.61.99/original.htm  and the Revised html here: http://216.92.61.99/revised.htm .   It created the diff results by comparing the Revised to the Original in those 2 inputs. Thanks.
0
 
Randall-BAuthor Commented:
ramrom,
   Were those html inputs suitable for testing?  
Here's a shorter example:

* Input 1 (original):

<p align="justify">This is a great web site. It outshines the competition. The experts are the best.</p>

* Input 2 (revised):

<p align="justify">This is a great web site. It outshines the competition, in my opinion, which is based on experience. The experts are the best.</p>

Thanks.
0
 
ramromconsultant Commented:
Thanks. That makes it a LOT easier. We are getting closer. Please post the result you want from those last 2 strings.
0
 
Randall-BAuthor Commented:
I would want:

<p align="justify">This is a great web site. It outshines the competition<s>.</s> <u>, in my opinion, which is based on experience.</u> The experts are the best.</p>

(I had modified the script to make <s> . . . </s> and <u> . . . </u> instead of <del . . .  and <ins . . . )

But currently it is outputting something like this (note the repetition of "the competition" [once stricken with a period, and then added back in with a comma]):

<p align="justify">This is a great web site. It outshines <s>the competitition.</s><u>the competition, in my opinion, which is based on experience.</u> The experts are the best.</p>


That kind of thing is happening in various places with different punctuation, such as commas, colons, etc.

-------------

Here's another example:

* Input 1 (original):

<p align="justify">Which is better? PHP of Perl? This is a very subjective question. You ask this question to a PHP person you will hear PHP is better and if you ask this question to Perl person you will hear Perl is better.</p>

* Input 2 (revised):

<p align="justify">Which is better? PHP or Perl? This is a very subjective question. If you ask a PHP coder, you will hear that PHP is better; but if you ask a Perl person, you will hear that Perl is better.</p>


* Output (which could be improved to eliminate redundant deleting-and-re-adding of words, failure to treat a word and connected punctation as separate, etc.):

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you <s>will </s><u>will </u>hear <s>PHP is better and </s><u>that PHP is better; but </u>if you ask <s>this question to </s><u>a </u>Perl <s>person you </s><u>person, you </u>will hear <u>that </u>Perl <s>is </s><u>is </u>better.</p>


** Desired output:

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you will hear <u>that</u> PHP is better<u>;
  </u> <s>and </s><u>but </u>if you ask <s>this question to </s>a Perl person<u>, </u> you will hear <u>that </u>Perl is is better.</p>
0
 
Randall-BAuthor Commented:
ramrom,
   Any further suggestions?  Thanks.
0
 
Randall-BAuthor Commented:
It appears that no further suggestions are available, so I am accepting the first response, which did make some improvements as sought in my question.  Thanks.
0
 
ramromconsultant Commented:
Thanks. This was getting over my head.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.