Solved

Split Strings into Words at Spaces AND Punctuation

Posted on 2007-04-08
12
2,002 Views
Last Modified: 2011-09-20
    This Python script ( http://www.aaronsw.com/2002/diff/diff.py )
apparently separates html source code into words by splitting the html at every space. Then it renders a nice html diff.
     However, defining words as "anything between a space" leads to unwanted formatting in the diff.  I want to change it to split at punctuation, as well as spaces. (Currently it only splits words outside of html tags, and that should stay the same.)

     Example of problems because of breaking only at spaces:
         old:   This is nice.
         new: This is nice, good, and great.

         Diff:  This is -nice.-  +nice, good, and great.+

It shows the word "nice" twice in the diff, because it treats "nice." (including the period) as one word.

    Instead, I would like the diff to be:
                This is nice-.- +, good, and great.+

(showing that the period was deleted, and a comma was added right after it, etc.; but showing the word "nice" only once).

     I'm not sure which line in the script spits the html at spaces.  Please help me identify it and show me how to modify the script to split at these characters also (in addition to spaces):
    .          ,             ;               :          '           "            !         -  
period  comma  semicolon  colon  apost. dblquote  excl. hyphen        

     For example, the desired effect is to treat this sentence has having about 13 words, rather than 5:

     This, he said, is "a well-made program!"

That way, the diff would be:
    This +is+, he said, -is- "a --well-- +better+ -made program!+!+"    
instead of:
    -This,- +This is,+ he said, "a --well-made-- +better-made+ -program!"- +program!!"+
0
Comment
Question by:Randall-B
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
12 Comments
 
LVL 17

Accepted Solution

by:
ramrom earned 250 total points
ID: 18873309
Looks like line 52:
                  elif c in string.whitespace: out.append(cur+c); cur = ''
Create a "punctuation list": pList = string.whitespace + """.,;:'"!-""" then change 52 to
                  elif c in pList: out.append(cur+c); cur = ''
0
 

Author Comment

by:Randall-B
ID: 18873598
ramrom,
   Good. For the most part, that seems to be it.  I've been testing it and it seems to have the right effect much of the time, but not in every instance that I would have expected. I don't know if that's just how Python's  TextDiff works, or if there's another line that can be modified in the script. Do you see anything else that would appear to be relevant in the script?
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873643
Provide some html from which you don't get the desired result.
0
[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

 

Author Comment

by:Randall-B
ID: 18873745
OK, here is an actual example of the output: http://216.92.61.99/pythondiff2.htm (I saved the .py script's results to an .htm file and added line numbers for reference).
   Some examples of unexpected results involving punctuation are around lines 6-8,  14-15,  20-21,  36-37,  42,  59, and 116.  One of those involved parentheses (which were not on the p list; but the results remained the same when I added them.) I other cases, the problems involve things already on the list, like colons, periods, and commas.
   
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18873958
I said "Provide some html from which you don't get the desired result."  You gave me the results! I want the input. Please try again.
0
 

Author Comment

by:Randall-B
ID: 18874010
OK. You can see the Original html here:  http://216.92.61.99/original.htm  and the Revised html here: http://216.92.61.99/revised.htm .   It created the diff results by comparing the Revised to the Original in those 2 inputs. Thanks.
0
 

Author Comment

by:Randall-B
ID: 18880890
ramrom,
   Were those html inputs suitable for testing?  
Here's a shorter example:

* Input 1 (original):

<p align="justify">This is a great web site. It outshines the competition. The experts are the best.</p>

* Input 2 (revised):

<p align="justify">This is a great web site. It outshines the competition, in my opinion, which is based on experience. The experts are the best.</p>

Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18883887
Thanks. That makes it a LOT easier. We are getting closer. Please post the result you want from those last 2 strings.
0
 

Author Comment

by:Randall-B
ID: 18884137
I would want:

<p align="justify">This is a great web site. It outshines the competition<s>.</s> <u>, in my opinion, which is based on experience.</u> The experts are the best.</p>

(I had modified the script to make <s> . . . </s> and <u> . . . </u> instead of <del . . .  and <ins . . . )

But currently it is outputting something like this (note the repetition of "the competition" [once stricken with a period, and then added back in with a comma]):

<p align="justify">This is a great web site. It outshines <s>the competitition.</s><u>the competition, in my opinion, which is based on experience.</u> The experts are the best.</p>


That kind of thing is happening in various places with different punctuation, such as commas, colons, etc.

-------------

Here's another example:

* Input 1 (original):

<p align="justify">Which is better? PHP of Perl? This is a very subjective question. You ask this question to a PHP person you will hear PHP is better and if you ask this question to Perl person you will hear Perl is better.</p>

* Input 2 (revised):

<p align="justify">Which is better? PHP or Perl? This is a very subjective question. If you ask a PHP coder, you will hear that PHP is better; but if you ask a Perl person, you will hear that Perl is better.</p>


* Output (which could be improved to eliminate redundant deleting-and-re-adding of words, failure to treat a word and connected punctation as separate, etc.):

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you <s>will </s><u>will </u>hear <s>PHP is better and </s><u>that PHP is better; but </u>if you ask <s>this question to </s><u>a </u>Perl <s>person you </s><u>person, you </u>will hear <u>that </u>Perl <s>is </s><u>is </u>better.</p>


** Desired output:

<p align="justify">Which is better? PHP <s>of </s><u>or </u>Perl? This is a very subjective question. <s>You </s><u>If you </u>ask <s>this question to </s>a PHP <s>person </s><u>coder, </u>you will hear <u>that</u> PHP is better<u>;
  </u> <s>and </s><u>but </u>if you ask <s>this question to </s>a Perl person<u>, </u> you will hear <u>that </u>Perl is is better.</p>
0
 

Author Comment

by:Randall-B
ID: 18913722
ramrom,
   Any further suggestions?  Thanks.
0
 

Author Comment

by:Randall-B
ID: 18925957
It appears that no further suggestions are available, so I am accepting the first response, which did make some improvements as sought in my question.  Thanks.
0
 
LVL 17

Expert Comment

by:ramrom
ID: 18927718
Thanks. This was getting over my head.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question