Solved

Split Strings into Words at Spaces AND Punctuation

Posted on 2007-04-08
12
1,980 Views
Last Modified: 2011-09-20
    This Python script ( http://www.aaronsw.com/2002/diff/diff.py )
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
     However, defining words as "anything between a space" leads to unwanted formatting in the diff.  I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)

     Example of problems because of breaking only at spaces:
         old:   This is nice.
         new: This is nice, good, and great.

         Diff:  This is -nice.-  +nice, good, and great.+

It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.

    Instead, I would like the diff to be:
                This is nice-.- +, good, and great.+

(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).

     I'm not sure which line in the script spits the html at spaces.  Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
    .          ,             ;               :          '           "            !         -  
period  comma  semicolon  colon  apost. dblquote  excl. hyphen        

     For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:

     This, he said, is "a well-made program!"

That way, the diff would be:
    This +is+, he said, -is- "a --well-- +better+ -made program!+!+"    
instead of:
    -This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+
0
Comment
Question by:Randall-B
  • 7
  • 5
12 Comments
 
LVL 17

Accepted Solution

by:
ramrom earned 250 total points
ID: 18873309
Looks like line 52:
                  elif c in string.whitespace: out.append(cur+c); cur = ''
Create a "punctuation list": pList = string.whitespace + """.,;:'"!-""" then change 52 to
                  elif c in pList: out.append(cur+c); cur = ''
0
 

Author Comment

by:Randall-B
ID: 18873598
ramrom,
   Good. For the most part, that seems to be it.  I've been testing it and it seems to have the right effect much of the time, but not in every instance that I would have expected. I don't know if that's just how Python's  TextDiff works, or if there's another line that can be modified in the script. Do you see anything else that would appear to be relevant in the script?
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873643
Provide some html from which you don't get the desired result.
0
 

Author Comment

by:Randall-B
ID: 18873745
OK, here is an actual example of the output: http://216.92.61.99/pythondiff2.htm (I saved the .py script's results to an .htm file and added line numbers for reference).
   Some examples of unexpected results involving punctuation are around lines 6-8,  14-15,  20-21,  36-37,  42,  59, and 116.  One of those involved parentheses (which were not on the p list; but the results remained the same when I added them.) I other cases, the problems involve things already on the list, like colons, periods, and commas.
   
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873958
I said "Provide some html from which you don't get the desired result."  You gave me the results! I want the input. Please try again.
0
 

Author Comment

by:Randall-B
ID: 18874010
OK. You can see the Original html here:  http://216.92.61.99/original.htm  and the Revised html here: http://216.92.61.99/revised.htm .   It created the diff results by comparing the Revised to the Original in those 2 inputs. Thanks.
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Comment

by:Randall-B
ID: 18880890
ramrom,
   Were those html inputs suitable for testing?  
Here's a shorter example:

* Input 1 (original):

<p align="justify">This is a great web site. It outshines the competition. The experts are the best.</p>

* Input 2 (revised):

<p align="justify">This is a great web site. It outshines the competition, in my opinion, which is based on experience. The experts are the best.</p>

Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18883887
Thanks. That makes it a LOT easier. We are getting closer. Please post the result you want from those last 2 strings.
0
 

Author Comment

by:Randall-B
ID: 18884137
I would want:

<p align="justify">This is a great web site. It outshines the competition<s>.</s> <u>, in my opinion, which is based on experience.</u> The experts are the best.</p>

(I had modified the script to make <s> . . . </s> and <u> . . . </u> instead of <del . . .  and <ins . . . )

But currently it is outputting something like this (note the repetition of "the competition" [once stricken with a period, and then added back in with a comma]):

<p align="justify">This is a great web site. It outshines <s>the competitition.</s><u>the competition, in my opinion, which is based on experience.</u> The experts are the best.</p>


That kind of thing is happening in various places with different punctuation, such as commas, colons, etc.

-------------

Here's another example:

* Input 1 (original):

<p align="justify">Which is better? PHP of Perl? This is a very subjective question. You ask this question to a PHP person you will hear PHP is better and if you ask this question to Perl person you will hear Perl is better.</p>

* Input 2 (revised):

<p align="justify">Which is better? PHP or Perl? This is a very subjective question. If you ask a PHP coder, you will hear that PHP is better; but if you ask a Perl person, you will hear that Perl is better.</p>


* Output (which could be improved to eliminate redundant deleting-and-re-adding of words, failure to treat a word and connected punctation as separate, etc.):

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you <s>will </s><u>will </u>hear <s>PHP is better and </s><u>that PHP is better; but </u>if you ask <s>this question to </s><u>a </u>Perl <s>person you </s><u>person, you </u>will hear <u>that </u>Perl <s>is </s><u>is </u>better.</p>


** Desired output:

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you will hear <u>that</u> PHP is better<u>;
  </u> <s>and </s><u>but </u>if you ask <s>this question to </s>a Perl person<u>, </u> you will hear <u>that </u>Perl is is better.</p>
0
 

Author Comment

by:Randall-B
ID: 18913722
ramrom,
   Any further suggestions?  Thanks.
0
 

Author Comment

by:Randall-B
ID: 18925957
It appears that no further suggestions are available, so I am accepting the first response, which did make some improvements as sought in my question.  Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18927718
Thanks. This was getting over my head.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Python tuples 2 115
Error catching in Python 8 48
general software design question Python related 1 100
Python:compare IP with IP:blahblah and output the entire line?? 7 55
Less strange, but still introduction This introduction was added (1st August, 2011) to reflect some reactions.  Firstly, the term basics in the title of the article...  As any other word, it is a symbol with meaning attached to the word by some a…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …

920 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now