This Python script ( http://www.aaronsw.com/2002/diff/diff.py
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
However, defining words as "anything between a space" leads to unwanted formatting in the diff. I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)
Example of problems because of breaking only at spaces:
old: This is nice.
new: This is nice, good, and great.
Diff: This is -nice.- +nice, good, and great.+
It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.
Instead, I would like the diff to be:
This is nice-.- +, good, and great.+
(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).
I'm not sure which line in the script spits the html at spaces. Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
. , ; : ' " ! -
period comma semicolon colon apost. dblquote excl. hyphen
For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:
This, he said, is "a well-made program!"
That way, the diff would be:
This +is+, he said, -is- "a --well-- +better+ -made program!+!+"
-This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+