Solved

Split Strings into Words at Spaces AND Punctuation

Posted on 2007-04-08
12
1,989 Views
Last Modified: 2011-09-20
    This Python script ( http://www.aaronsw.com/2002/diff/diff.py )
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
     However, defining words as "anything between a space" leads to unwanted formatting in the diff.  I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)

     Example of problems because of breaking only at spaces:
         old:   This is nice.
         new: This is nice, good, and great.

         Diff:  This is -nice.-  +nice, good, and great.+

It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.

    Instead, I would like the diff to be:
                This is nice-.- +, good, and great.+

(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).

     I'm not sure which line in the script spits the html at spaces.  Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
    .          ,             ;               :          '           "            !         -  
period  comma  semicolon  colon  apost. dblquote  excl. hyphen        

     For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:

     This, he said, is "a well-made program!"

That way, the diff would be:
    This +is+, he said, -is- "a --well-- +better+ -made program!+!+"    
instead of:
    -This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+
0
Comment
Question by:Randall-B
  • 7
  • 5
12 Comments
 
LVL 17

Accepted Solution

by:
ramrom earned 250 total points
ID: 18873309
Looks like line 52:
                  elif c in string.whitespace: out.append(cur+c); cur = ''
Create a "punctuation list": pList = string.whitespace + """.,;:'"!-""" then change 52 to
                  elif c in pList: out.append(cur+c); cur = ''
0
 

Author Comment

by:Randall-B
ID: 18873598
ramrom,
   Good. For the most part, that seems to be it.  I've been testing it and it seems to have the right effect much of the time, but not in every instance that I would have expected. I don't know if that's just how Python's  TextDiff works, or if there's another line that can be modified in the script. Do you see anything else that would appear to be relevant in the script?
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873643
Provide some html from which you don't get the desired result.
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 

Author Comment

by:Randall-B
ID: 18873745
OK, here is an actual example of the output: http://216.92.61.99/pythondiff2.htm (I saved the .py script's results to an .htm file and added line numbers for reference).
   Some examples of unexpected results involving punctuation are around lines 6-8,  14-15,  20-21,  36-37,  42,  59, and 116.  One of those involved parentheses (which were not on the p list; but the results remained the same when I added them.) I other cases, the problems involve things already on the list, like colons, periods, and commas.
   
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873958
I said "Provide some html from which you don't get the desired result."  You gave me the results! I want the input. Please try again.
0
 

Author Comment

by:Randall-B
ID: 18874010
OK. You can see the Original html here:  http://216.92.61.99/original.htm  and the Revised html here: http://216.92.61.99/revised.htm .   It created the diff results by comparing the Revised to the Original in those 2 inputs. Thanks.
0
 

Author Comment

by:Randall-B
ID: 18880890
ramrom,
   Were those html inputs suitable for testing?  
Here's a shorter example:

* Input 1 (original):

<p align="justify">This is a great web site. It outshines the competition. The experts are the best.</p>

* Input 2 (revised):

<p align="justify">This is a great web site. It outshines the competition, in my opinion, which is based on experience. The experts are the best.</p>

Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18883887
Thanks. That makes it a LOT easier. We are getting closer. Please post the result you want from those last 2 strings.
0
 

Author Comment

by:Randall-B
ID: 18884137
I would want:

<p align="justify">This is a great web site. It outshines the competition<s>.</s> <u>, in my opinion, which is based on experience.</u> The experts are the best.</p>

(I had modified the script to make <s> . . . </s> and <u> . . . </u> instead of <del . . .  and <ins . . . )

But currently it is outputting something like this (note the repetition of "the competition" [once stricken with a period, and then added back in with a comma]):

<p align="justify">This is a great web site. It outshines <s>the competitition.</s><u>the competition, in my opinion, which is based on experience.</u> The experts are the best.</p>


That kind of thing is happening in various places with different punctuation, such as commas, colons, etc.

-------------

Here's another example:

* Input 1 (original):

<p align="justify">Which is better? PHP of Perl? This is a very subjective question. You ask this question to a PHP person you will hear PHP is better and if you ask this question to Perl person you will hear Perl is better.</p>

* Input 2 (revised):

<p align="justify">Which is better? PHP or Perl? This is a very subjective question. If you ask a PHP coder, you will hear that PHP is better; but if you ask a Perl person, you will hear that Perl is better.</p>


* Output (which could be improved to eliminate redundant deleting-and-re-adding of words, failure to treat a word and connected punctation as separate, etc.):

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you <s>will </s><u>will </u>hear <s>PHP is better and </s><u>that PHP is better; but </u>if you ask <s>this question to </s><u>a </u>Perl <s>person you </s><u>person, you </u>will hear <u>that </u>Perl <s>is </s><u>is </u>better.</p>


** Desired output:

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you will hear <u>that</u> PHP is better<u>;
  </u> <s>and </s><u>but </u>if you ask <s>this question to </s>a Perl person<u>, </u> you will hear <u>that </u>Perl is is better.</p>
0
 

Author Comment

by:Randall-B
ID: 18913722
ramrom,
   Any further suggestions?  Thanks.
0
 

Author Comment

by:Randall-B
ID: 18925957
It appears that no further suggestions are available, so I am accepting the first response, which did make some improvements as sought in my question.  Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18927718
Thanks. This was getting over my head.
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Python output problem 10 54
Parse csv file and generate graphs in HTML in bash 8 239
ElasticSearch Filter Query 1 257
Macports Import Problem 4 64
Variable is a place holder or reserved memory locations to store any value. Which means whenever we create a variable, indirectly we are reserving some space in the memory. The interpreter assigns or allocates some space in the memory based on the d…
A set of related code is known to be a Module, it helps us to organize our code logically which is much easier for us to understand and use it. Module is an object with arbitrarily named attributes which can be used in binding and referencing. …
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…

828 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question