Link to home
Start Free TrialLog in
Avatar of Sharath B
Sharath BFlag for India

asked on

Tool or Word macro to find identical content in a word file

Hi,

Any tool that can find identical lines or words within a content of 1000+ words

I don't want to check plagiarism rather check identical content within a word file, any help with a macro or a online tool will help.


Thanks

Avatar of John Korchok
John Korchok
Flag of United States of America image

How many duplicate consecutive words do you need to find?

What do you want to do with the matches information?
Avatar of Sharath B

ASKER

May be 10+
Just color the info would be great that's duplicated within the file
If 3 words match same in the file it can be colored
May be 10+
Should your range be 3-15 words, 3-20 words, or something else?

Just thought of this one...what if you have a block of words and find a different match of fewer words elsewhere?
Yes even than need to color 3 to 20 words if identical can color
To do this properly, I need to eliminate punctuation.  Should I do make a copy of just the words in a new document or something different?
Attached sample file
Sample.docx
Thank you for the sample document.  Are there any duplicate 3-20 consecutive word matches in this document?
Yes
The code I wrote identified these duplicate phrases.  Please confirm/correct before I continue.

Original start: 1018        word count: 11             Phrase: callback URL or publicly accessible UI consider using PIN based authorization
Duplicate start: 1321
Original start: 1031        word count: 9              Phrase: or publicly accessible UI consider using PIN based authorization
Duplicate start: 1334
Original start: 1043        word count: 7              Phrase: accessible UI consider using PIN based authorization
Duplicate start: 1346
Original start: 1054        word count: 6              Phrase: UI consider using PIN based authorization
Duplicate start: 1357
Original start: 1066        word count: 4              Phrase: using PIN based authorization
Duplicate start: 1369
Original start: 31          word count: 3              Phrase: legged OAuth flow
Duplicate start: 409
Original start: 68          word count: 3              Phrase: on behalf of
Duplicate start: 198
Original start: 98          word count: 3              Phrase: ll need to
Duplicate start: 245
Original start: 906         word count: 3              Phrase: ll need to
Duplicate start: 218
Original start: 364         word count: 3              Phrase: you to obtain
Duplicate start: 243
Original start: 904         word count: 3              Phrase: will need to
Duplicate start: 1072

Open in new window

Hope it colors the words in word file
1. Your comment didn't respond to my comment.  Please check my work.
2. Coloring will be more challenging, especially if we have lots of duplicates.  There are a limited number of named colors and some of those are unusable.  I'm not limited to the use of those named colors, but then I'm faced with the problem of using visually distinct colors.

================
To help you understand my code's output and do your validation, the original/duplicate start value is the location of the string in a cleaned-up version of the text.  I extracted all the words from the original document and rejoined them with a space delimiter.  This eliminated all punctuation from the original document's text.  This was the underlying text used for searching.

MS Word treats its collection of "words" somewhat strangely.  Most of the time, a "word" includes the trailing space character(s).  When there is a punctuation character, that character is a "word".  So, I couldn't rely or use the words collection to do the processing.

If my code has correctly identified duplicate phrases, then I have to work back through the original document's "words" collection to explore highlighting options.
Now i am clear what the report was about yes looks promising. Yes this is what i need

The code is in the macro-enabled document I've uploaded.  I found an acceptable coloring method.  You are welcome to change the order of the color constants.
Q_29229331.docm
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
How is your testing going?