Sharath B
asked on
Tool or Word macro to find identical content in a word file
Hi,
Any tool that can find identical lines or words within a content of 1000+ words
I don't want to check plagiarism rather check identical content within a word file, any help with a macro or a online tool will help.
Thanks
How many duplicate consecutive words do you need to find?
What do you want to do with the matches information?
What do you want to do with the matches information?
ASKER
May be 10+
Just color the info would be great that's duplicated within the file
If 3 words match same in the file it can be colored
Just color the info would be great that's duplicated within the file
If 3 words match same in the file it can be colored
May be 10+Should your range be 3-15 words, 3-20 words, or something else?
Just thought of this one...what if you have a block of words and find a different match of fewer words elsewhere?
ASKER
Yes even than need to color 3 to 20 words if identical can color
To do this properly, I need to eliminate punctuation. Should I do make a copy of just the words in a new document or something different?
ASKER
Attached sample file
Sample.docx
Sample.docx
Thank you for the sample document. Are there any duplicate 3-20 consecutive word matches in this document?
ASKER
Yes
The code I wrote identified these duplicate phrases. Please confirm/correct before I continue.
Original start: 1018 word count: 11 Phrase: callback URL or publicly accessible UI consider using PIN based authorization
Duplicate start: 1321
Original start: 1031 word count: 9 Phrase: or publicly accessible UI consider using PIN based authorization
Duplicate start: 1334
Original start: 1043 word count: 7 Phrase: accessible UI consider using PIN based authorization
Duplicate start: 1346
Original start: 1054 word count: 6 Phrase: UI consider using PIN based authorization
Duplicate start: 1357
Original start: 1066 word count: 4 Phrase: using PIN based authorization
Duplicate start: 1369
Original start: 31 word count: 3 Phrase: legged OAuth flow
Duplicate start: 409
Original start: 68 word count: 3 Phrase: on behalf of
Duplicate start: 198
Original start: 98 word count: 3 Phrase: ll need to
Duplicate start: 245
Original start: 906 word count: 3 Phrase: ll need to
Duplicate start: 218
Original start: 364 word count: 3 Phrase: you to obtain
Duplicate start: 243
Original start: 904 word count: 3 Phrase: will need to
Duplicate start: 1072
ASKER
Hope it colors the words in word file
1. Your comment didn't respond to my comment. Please check my work.
2. Coloring will be more challenging, especially if we have lots of duplicates. There are a limited number of named colors and some of those are unusable. I'm not limited to the use of those named colors, but then I'm faced with the problem of using visually distinct colors.
================
To help you understand my code's output and do your validation, the original/duplicate start value is the location of the string in a cleaned-up version of the text. I extracted all the words from the original document and rejoined them with a space delimiter. This eliminated all punctuation from the original document's text. This was the underlying text used for searching.
MS Word treats its collection of "words" somewhat strangely. Most of the time, a "word" includes the trailing space character(s). When there is a punctuation character, that character is a "word". So, I couldn't rely or use the words collection to do the processing.
If my code has correctly identified duplicate phrases, then I have to work back through the original document's "words" collection to explore highlighting options.
2. Coloring will be more challenging, especially if we have lots of duplicates. There are a limited number of named colors and some of those are unusable. I'm not limited to the use of those named colors, but then I'm faced with the problem of using visually distinct colors.
================
To help you understand my code's output and do your validation, the original/duplicate start value is the location of the string in a cleaned-up version of the text. I extracted all the words from the original document and rejoined them with a space delimiter. This eliminated all punctuation from the original document's text. This was the underlying text used for searching.
MS Word treats its collection of "words" somewhat strangely. Most of the time, a "word" includes the trailing space character(s). When there is a punctuation character, that character is a "word". So, I couldn't rely or use the words collection to do the processing.
If my code has correctly identified duplicate phrases, then I have to work back through the original document's "words" collection to explore highlighting options.
ASKER
Now i am clear what the report was about yes looks promising. Yes this is what i need
The code is in the macro-enabled document I've uploaded. I found an acceptable coloring method. You are welcome to change the order of the color constants.
Q_29229331.docm
Q_29229331.docm
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
How is your testing going?
How to find duplicate phrases/paragraphs in a long document
How to find and highlight duplicate paragraphs in Word document
Remove Duplicate Paragraphs From The Entire Word Document
Word VBA - find duplicate paragraphs