Solved

VBA Word Find page number in a text where a word appears.

Posted on 2014-11-01
30
431 Views
Last Modified: 2014-11-10
I have transcribed a ca. 400 page novel into an ordinary Word document. (Actually I have dozens of such novels transcribed.)
     I have a program that searches for occurrences of certain key words in the text. The program then records (a) the word (with punctuation), (b) a few preceding words and (c) a few following words for context. I would also like to record where those keywords appear in the printed book that is the source of the Word document. I have put marks (e.g., |p 2;) in the Word document to show the start of each printed page, but those marks could confuse the computer searching for strings that span more than one page.
     I have therefore written a program to go through the document to note the number of characters between the start of each new page and the top of the document. That program searches for each page mark, notes the page number, deletes the mark, and uses “Selection.Range.Start” to get the distance between the top of the document and the given page number. (Note: It also records the first few words on the given page just to confirm the data are correct, but that is a separate task than the one being described here.) I then have two pieces of information to store for each page of the original text: the page number and its location.  
     I would then like to set up some sort of look-up table to locate any word on any given page. For example,
          If page 2 starts at 1,200 characters from the top of the document and
          Page 3 starts at 2,300 characters,

     Then if Selection.Range.Start tells me that a word I have begins at 2,000 characters from the top of the document then I would like the computer to be able to say the word can be found on page 2 of the original text. Its location is greater than 1,200 but less than 2,300 characters.
    I would like to avoid having to go through up to 400 searches for any given location just to find the correct page number.
     Thanks for any easy solution to the problem. If that is impossible, thanks for any moderately difficult solution.
JRA in Priddis
0
Comment
Question by:JohnRobinAllen
  • 12
  • 11
  • 7
30 Comments
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40417968
Hi JRA,

It sounds as if you are fighting the Word design concept as opposed to using it.

Don't your page markers cause subsequent pages to overflow later and hence change the pagination or do you work backwards.

Have your tried using the Range.Information() method instead?

Sub GetKeyWordPages()
    Dim iPages() As Integer
    Dim p As Integer
    Dim rng As Range
    
    Set rng = ActiveDocument.Range
    With rng.Find
        .Text = "MyKeyword"
        .MatchCase = False
        .MatchWholeWord = True
        Do While .Execute
            ReDim Preserve iPages(p)
            iPages(p) = rng.Information(wdActiveEndPageNumber)
            p = p + 1
        Loop
    End With
    If p > 0 Then
        For p = 0 To UBound(iPages)
            Debug.Print iPages(p)
        Next p
    End If
End Sub

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 40418158
Use manual page breaks (Ctrl+Enter).  This will result in your Word document having the same pages as the original.

Here is a test I ran on a nine page document and the results in the immediate window:
for each pg in ActiveWindow.Panes(1).Pages
    debug.pring pg.breaks(1).pageindex,pg.breaks(1).range.start
next

 1             0 
 2             1780 
 3             2859 
 4             3814 
 5             5365 
 6             6302 
 7             7231 
 8             9145 
 9             11683 

Open in new window

0
 

Author Comment

by:JohnRobinAllen
ID: 40418407
Further details about my problem.
        I cannot put any page information inside the document. In my current text, “Bel-Ami” at the bottom of p. 54 shows “avec des concessions de terre accordées” but “concessions” is hyphenated so that a page marker would show “avec des conces|p 55 sions de terre accordées”. If the program then searched for “concessions” it would not find that occurrence.
        My solution, then, is to put in those page markers on a temporary basis. The start of a word (or portion of a word) that starts a new page is marked as “|p xx ” where xx is the page number in the source book that I’m indexing. I have written a program vaguely similar to Graham Skan’s program to (a) Record the distance the first marker is from the top of the text, (b) Record the original page number of that marker in the printed source text; in my standard edition of “Bel-Ami” the first page marker is “|p 29” since the text itself begins on page 29, after the title pages and introduction. (c) delete the entire first marker so its size will not affect the next marker; (c) repeat steps a through c with all the remaining markers. (My program also records the first few words after each page marker. I’ll use that later to reset the distances in the event that a minor editing correction changes all the recorded numbers.)
        For the present problem, then, that program has recorded each new page number and its distance from the top of the document. The program will then save that information as a document variable that I can retrieve whenever it is needed. With that variable, I can then, in theory, create a table of distances and corresponding page numbers. What I need is a method to use that table to look up the page number of any word in the document by measuring how far the word is from the top of the document to see that that distance is greater than the start of page x but less than the start of page x + 1.
        Of course I could go through each item in the table until I find the correct page number, but that would take a lot of time when I try to make an index of, say, a couple thousand words. I could do a binary search in the table to locate the appropriate value, but isn’t there some function that could tell me instantly the first value greater than a given value in a table? That would show the first page after the given word, so its page number would be that value minus 1.
     --John robin
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40418455
I'm not asking you to put a page number in the Word document.  I'm recommending that you break your pages the same places that the original material.

You could insert user fields that wouldn't be visible or have any user interactions.
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40418640
I am still struggling to understand the need for creating your own mark-up. Why not use the Range.Information() method to record the page numbers at the same time as finding your text.

If it does have to be done in two passes, wouldn't bookmarks be a better option?
0
 

Author Comment

by:JohnRobinAllen
ID: 40419413
Aikimark’s suggestion that I break the text into pages that match the original means I would have a page break inside of some words. In the example I cited above, the word “concessions” would be split with a page break in the middle, as it is in the original text where the word appears as “conces-” at the bottom of one page and as “sions” at the top of the next page. I would subsequently not be able to see the word in a search for “concessions”.
        Aikimark’s further suggestion that I insert invisible fields is intriguing. If I insert a hidden string inside a word, the word does not appear to change. The string is invisible. However if I search for that word, the computer will not see the word because the computer sees the hidden text even if we do not. When you say I could insert “user fields that wouldn’t be visible or have any user interactions” inside a word, are you talking about something else? Something that a search would not see? I’m interested in the suggestion.
        Graham Skan’s suggestion that I use Range.Information to record page numbers is similarly intriguing. I would dearly love not to create my own mark-up, but how would I tell the computer that a word that appears on page 27 of my word document appears on page 53 in the original text I am indexing? It would be halcyon if you could solve that. Knowing your skill in the black arts of programming, I suspect that you could solve that with your left hand tied behind your back.
        Perhaps your suggestion to use bookmarks is the solution. I could easily insert a bookmark at the start of each new page in the original text, but is there some way that the computer can search for the unknown bookmark that precedes the selected word? If so, and if VBA can tell me the name of that preceding bookmark, then the problem is solved with elegance.
        Meanwhile I, as a mere mortal, have to store the original page numbering information somewhere. If I put it into the text as per Aikimark’s suggestion, that contaminates the text for searches of words or letters surrounding the inserted information (unless you know a way I can put it in the text and make it invisible to the computer when searching for words).
        My proposed solution is to store the information in a table or string variable. If you can think of a way I can store the information in a Range variable, that would be better, but I thought the range variables give information about the current document, not extraneous information that comes from a printed book and that is invisible in the document.
        If the bookmark option will not work, I fear that I will have to use a binary search in a table the same way one uses a binary search to alphabetize words. That would tell me that word x appears after the start of page y and before the start of page z. Before I do the work of programming that and then having the computer make binary searches, I hope that persons whose pay scale is above mine can point out a simpler solution.
        —john robin (allen)
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40419447
I've only just grasped that you are matching two documents, though I'm still not entirely clear about the mechanics or the objective.

Do you have a digital file that is paginated as the printed novel or is this a manual process?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40419504
If you insert hidden text, you would need to do one or two searches.  First, search on the whole word and then search for the word broken up by your hidden text.

Another possibility is to insert hidden text before the hyphen.  The hidden text would be same as the word text after the hyphen.  That way, you should be able to find the whole word, but the document could still be paginated as necessary.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40419508
How many words are hyphenated across page breaks, anyway?
0
 

Author Comment

by:JohnRobinAllen
ID: 40419755
I’m matching by hand two documents. My objective is to have a Word document with a literary text that is a polished version of a printed book of the same text. One of the features of the polished version is that words that are hyphenated to stretch across two lines (or two pages) are closed up and appear without such added hyphens. Dozens of words are so affected.
        Since a single page in a Word document usually contains more words than a single page in a printed book, the text is formatted to fill each Word document page fully. My version of “Bel-Ami” in Word fills 213 pages, but the original, printed text fills up 387 pages. It would waste a lot of paper if I broke the text in printed Word document so that it would match the pagination of the printed text.
        Other features in the Word document are that problem vocabulary words, including expressions containing several words, are character formatted as “Vocab” and appropriately highlighted. Names of persons and places mentioned are also marked by different character formatting.
        My code in the Word document can then create, among other things, separate documents of vocabulary lists and other lists that display in context, either alphabetically or in order of appearance, all the marked words or expressions in the Word document, along with a page and paragraph number on that page for each cited occurrence. To generate such lists, the computer must be able to locate every marked expression of from one to six or seven words, even when it may cross page boundaries.
        At the user’s choice, the page and paragraph numbers displayed in such lists will refer either to (a) the Word document or (b) the original printed book. I have done all the work to produce such lists with references to the Word document. References to a printed book would be more useful for most persons, so I’m working on that now. I know the problem can be solved, but I do not know if there is a simpler solution than binary searches in a table, as described.
        Thanks for all you attention to this tricky problem.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40419776
I think there is a virtual hyphen available.  I'll double check this.  Been a long time since I used or taught that feature.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40419867
Ctrl+- will insert an optional hyphen.

Potentially, you could replace hyphens with these optional hyphen characters that are only visible if a break is required.  In essence, you are telling Word that it is ok to break this word at this point, rather than starting on the next line.
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40420670
Re you comment: 40419413

You could run through all the bookmarks looking at the Range.End property looking for largest value that is not greater than the starting point. The built-in list returns the bookmarks in .Name order. If performance is a problem, then a sorted list might be necessary.

Is this a multistage process where information is to be stored between sessions. If not, then you could probably store the information in variable arrays.
0
 

Author Comment

by:JohnRobinAllen
ID: 40420860
Your bookmark solution is promising. How can I go through all of the bookmarks in my document? It is late right now here in the Rocky Mountains, but perhaps I can find out how to go through the bookmarks when I get up tomorrow.
    jra
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40420861
you can goto bookmarks.  In the dialog, you should be able to select a specific bookmark or click next.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40421048
Bookmarks are indexed in name order. Therefore if they are named with page numbers padded out with leading zeros so that they are all the same length, then the previous one in the document will also be the one with the previous index.
You may already have your own but here is a little function to do that:
Function CreateBookMarkName(p As Integer, totalpages As Integer) As String
    CreateBookMarkName = "P" & Format(p, String(Len(CStr(totalpages)), "0"))
End Function

Open in new window

If you need more than one bookmark per page, you could have suffixes (suffices?)  using the same technique.
Function CreateBookMarkNameSfx(p As Integer, totalpages As Integer, Suffix As Integer, MaxSuffixCount As Integer) As String
    CreateBookMarkNameSfx = "P" & Format(p, String(Len(CStr(totalpages)), "0")) & "_" & Format(Suffix, String(Len(CStr(MaxSuffixCount)), "0"))
End Function

Open in new window

0
 

Author Comment

by:JohnRobinAllen
ID: 40422535
I now have a bookmark at the start of each page. For easier identification, each bookmark name is "zz" plus the page number with preceding zeros where necessary. Since the first page begins on page 29, the bookmarks look like this:
     zz029
     zz030
     zz031
and so forth up to "zz415."
     My problem is when I have a word in the text selected, how can I find the preceding bookmark?
     If I know the first and last page in the printed book, I can go through a loop to locate to each bookmark, see how far it is from the top of the document and put that value in a table. Then if I know the location of any given word, I can use a binary search on the table to find the location of the previous bookmark, and its name would give me the page number. That is the system I proposed earlier. The advantage of using bookmarks to hold page location info is that if I find an error in the transcription and correct it, the bookmarks will let me generate a new table of locations.
     I hope, however, that there is some easy way to know the name of a preceding bookmark that begins with the letters "zz". Then I wouldn't need a table or a binary search routine.
     Is the name of a preceding bookmark an "impossible dream"?
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40422733
If you aren't going from bookmark to bookmark so that decrementing the index directly is not an option, you would have to  search for the position of the previous bookmark. You could use the bookmark ranges directly without having a table. I don't know which would be faster.
Function PreviousBookmarkIndex() As Integer
    Dim rng As Range
    Dim b As Integer
    
    Set rng = Selection.Range
    For b = ActiveDocument.Bookmarks.Count To 1 Step -1
        If ActiveDocument.Bookmarks(b).Range.End < rng.Start Then
            PreviousBookmarkIndex = b
            Exit Function
        End If
    Next b
End Function

Open in new window


By binary search, do you mean a sort of bisecting search? I.e.  start in the middle, compare and go up or down according to the result, then start in the middle of that half and go on until the compare returns an exact match. That would be worth considering if better performance is needed.
0
 

Author Comment

by:JohnRobinAllen
ID: 40422910
I was afraid you would say that. I'll work on getting a binary search subroutine written (what you describe above), but first I have to polish the text. It may take me a day or two before I have the results, but I'll get them I have to do it, or else I won't be able to have a true index for words in the program
     --jra
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 500 total points
ID: 40423205
This iterates pretty fast.
for bm=1 to activedocument.Bookmarks.count
    debug.print activedocument.Bookmarks(bm).Start, activedocument.Bookmarks(bm).name
next

Open in new window


In addition to the binary search, you might want to try a proportional/interpolation search.  This is usually faster than binary search for regularly spaced values.
http://en.wikipedia.org/wiki/Interpolation_search
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40423513
We should probably consider a merge algorithm as well.  You gather the words and their positions in one pass, (if necessary) sort the positions for each word, and then iterate the bookmarks once, updating the word-related data.

Note: There are some optimizations that can be applied to the merge process.
0
 

Author Comment

by:JohnRobinAllen
ID: 40426424
I'm grateful for your suggestions and will work on them as quickly as possible. Some other commitments have forced me to leave this problem for a day or two, but I'll get back soon.
     jra
0
 

Author Comment

by:JohnRobinAllen
ID: 40428928
Graham Skan’s suggested that the fastest way to search for an item in a sorted list of bookmarks is not necessarily a binary search. That starts with determining whether the item is above or below the very middle of the list. If it’s above we see if it is above or below the middle of the middle of the sub section, and so forth until we locate exactly where the item is.
        That would be a silly way to see where the word “zebra” would fit into an alphabetized word list. Before we begin the search we know it will nowhere near the middle but will be somewhere near the bottom. Similarly, if I know the location my cursor inside a document, that can suggest where the nearest bookmark is.
        Here is my plan: My document has a bookmark at the start of each new page, and the bookmark name is “zz” plus the page number with preceding zeros. My text begins on page 29 and runs to page 415, so my bookmark names run from “zz029” to “zz415”. When a user opens up the document, that will trigger a routine to make and fill two integer arrays visible to all routines in the code:
       
        lngLoc() will give the “start” location of each bookmark.
        intPage() will give the corresponding page location of each bookmark.
       
        (See the GrahamSkan function attached.) I’ll write a function that, when called, will display or store the page and paragraph count in the original text of the current cursor position. Filling the two integer arrays gives us the number of pages in the original document, and if we divide the total number of characters in the document the number of pages, we get the average number of characters in each page. If we integer divide the “start” location of the cursor by the average number of characters on each page, we have a likely location of the nearest lngLoc bookmark location in front of the current cursor location. The size of the difference between the “start” location and the first guess of where the preceding bookmark location will help with any necessary subsequent guess until we have the preceding lngLoc to our current cursor position. If we then turn on Selection.Extend and jump to the location of that bookmark, we can then count the number of CrLf (Ascii(13) + Ascii(10)) to get the number of the current paragraph. The function can then return the string either to display something like “Page 53, ¶ 3”.
        I’ll work on writing that today, but if anyone has any suggestions where that could be improved, I’m open to anything that will make this task easier.
          --jra
GS.txt
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 40428979
Thank you for the credit, but my point was only that a binary search was faster than stepping through the list.

I think your solution is nearer to aikimark's suggestion of an interpolation search.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40429230
Interpolation search example:

Assume the .Start property of the first and last bookmarks:
zz029 = 54321
zz451= 1234567

Assume that all 423 bookmarks and their .Start property values are stored in some array-like data structure.

We find three occurrences of word "concessions" at Start property values (86422, 246800, 997755)

We calculate the first index to start searching, rounding the three results
12 = 423 * ((86422-54321) / (1234567-54321))
69 = 423 * ((246800-54321) / (1234567-54321))
338 = 423 * ((997755-54321) / (1234567-54321))

In some cases, these results get you so close that you only need to search the adjacent locations if the first item isn't a match.

If necessary, you could do another interpolation calculation from your current "page" to the end or beginning of the list.
0
 

Author Comment

by:JohnRobinAllen
ID: 40429436
I've written (in theory) the code that implements the Skan and Akimark suggestions. It just needs debugging. When done, if it works, I'll post it.
     Of course when something doesn't work, it's Microsoft's fault. When it does, it's EE's glory;
                  jra
0
 

Author Comment

by:JohnRobinAllen
ID: 40430611
The code below follows the two major suggestions and solves my problem. If anyone can suggest improvements, I would be grateful. Credit will go to both AiKiMark and GrahamSkan.
     Many thanks!
     john robin (allen) Three-routines.txt
0
 

Author Closing Comment

by:JohnRobinAllen
ID: 40430624
Implementation of the AiKiMark and GrahamSkan suggestions is in the Three routines.tst file I uploaded above. While this will close the question, if anyone finds something that could improve the code, I can revise it and post the code again.
     To solve the problem requires putting bookmarks into the text. I have written a code that simplifies that task. If anyone would like me to post that code too, please let me know. However I do not know if that would require me to put the code into a new question. Since I already have that code, I cannot logically pose it as an unsolved question.
    Thanks to both Graham Skan and AiKiMark, whose names appear in the comment to the uploaded code.
         --john robin (allen)
0
 

Author Comment

by:JohnRobinAllen
ID: 40432585
The file I loaded previously has two small bugs now corrected in a newer version attached to this message. When working with texts with more than ca. 32,000 characters, two integer values have to be changed to long integers.
     Sorry for my error. I had been testing the code on just the first chapter of a ca. 400 page novel. It crashed when applied to the full novel.Three-routines.docx
      My apologies
      John Robin (Allen)
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40432697
best to always use LONG instead of INTEGER data types (rule of thumb)
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Since upgrading to Office 2013 or higher installing the Smart Indenter addin will fail. This article will explain how to install it so it will work regardless of the Office version installed.
If you need to start windows update installation remotely or as a scheduled task you will find this very helpful.
Get people started with the process of using Access VBA to control Outlook using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Microsoft Outlook. Using automation, an Access applic…
Show developers how to use a criteria form to limit the data that appears on an Access report. It is a common requirement that users can specify the criteria for a report at runtime. The easiest way to accomplish this is using a criteria form that a…

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now