Link to home
Start Free TrialLog in
Avatar of SpencerSteel
SpencerSteel

asked on

Converting IMAGE .docs to TEXT for INDEXING in SQL SERVER 7.0

Hello again,

Here a little backgound.

I have inherited a DAO 'CV database application' which I'm trying to bring into the 21st century.

All the CV's are held in a CandidatesCV table as IMAGE objects. The first issue I had was updating these so they are readable by new ADO techniques. I managed to get all the solved via here which has been great - although the live system is still using the older store on CV's whilst I get the rest of the application up to speed and make the cross over.

Another key nightly task is a routine that takes all newly added CV and 'converts' them into plain text, so they may be indexed for searching on. I actually find MS Indexing very good for this and have no desire to change it - however the way for transforming the data is ugly to say the very least. This is how it is done ...

1. Query to get list of new CVs
2. Opens form
3. Binds a "word.object" class text box to the first CV
4. Run the SEND KEYS command (oh yes) to basically go into the document, select all, copy and then set the focus to another normal text box
5. SEND KEYS Paste the data (bluergh)
6. This 'pure text' is inserted into another column (varchar 8000) for indexing
7. Repeat


OK, as I said before I have a better CV storage routine now and have a testbed of about 200 CV's in a NewCV's table (see below for specifiactions)

What I would like to do is extract all the pure text from these IMAGE docs and put them into a NewCVText column for indexing. I am of the understanding you cannot index IMAGE datatypes. I assume this is why the above needs to be done (please confirm)

Ideally rather than a nightly bulk, I would like this to 'pure text extraction' happen when an Admin import ('suck in') a new CV document into the database - which I should be able to do no problem once I understand how it is done for one cv.

So, my table is something like this


TblNewCVs
---------

CandidateID
NewCandidateCV (IMAGE 16) <<< have about 200 cv's in here ready for playing with
NewCandidateText (varchar 8000) <<< is this the best datatype for indexing against ?

--------------------------------

If anyone has any ideas and demo on how to 'convert' my IMAGE/BLOB into pure text for indexing, please let me know. Obviosuly the slickest one wins - although I would still like it to be understandable !

There's 800 up for this one. I can split between helpers.

Thanks

Spencer Steel

Avatar of SpencerSteel
SpencerSteel

ASKER

Increasing points ...
..oh looks like I can't yet !
I think I will add that this is an Access front end with VBA to SQL Server using an ADO connection string.

Sorry I forgot to mention that.

S.S.
ASKER CERTIFIED SOLUTION
Avatar of ispaleny
ispaleny
Flag of Czechia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That isn't actually the issue - but thanks I will take that on board.

What I am looking for is an VBA/ADO routine to extract each CV from it's word document IMAGE field, extract the raw ASCII text and place it into a 'indexable' field (text/ntext)

Maybe I've posted in the wrong forum ... thinking about it this is more of a VB question. Eeek.

S.S.
Y
ou must have a primary or unique key on CandidateID.
Goto EM->"RightClick on YourTable"->"Full-Text Index Table"->"Define Full-Text indexing on a Table"
and follow instructions on a wizard.

A full-text index is populated off-line. Do no forget to schedule it.
OK VB - I will look at it.
Again, I am not explaining myself

I know HOW to full index - that isn't the problem - I have been using full indexing for a while but the routine uses the very horrible method for extracting from a word document the text from which to index.

What I want is a better way of converting the IMAGE to the raw text using ADO/VBA. Once the text is in a field from this IMAGE field I am home and dry !

Sorry for confusion

S.S.
Oh..its you ! ..nice to see you again by the way... you helped me last time - so you should roughly know what Im after ... a way to convert that IMAGE field to just plain text for indexing.

Thank you for your help last time - I have made some excellent routines from it.

S.S.
Dim a As Word.Application
 Dim d As Word.Document
 Dim str as String
 Set a = New Word.Application
 
 'Put a loop here. It is very simple extraction only.
 Set d = a.Documents.Open("c:\test.doc", True, True)
 str = d.Sections.First.Range.Text
 MsgBox str
 d.Close SaveChanges:=wdDoNotSaveChanges

 a.Quit
 Set d = Nothing
 Set a = Nothing
A better way to get a good representation of the text is to use Word Automation and take the text out and save to a text column using ADO's stream object similar to the code I posted in the other thread.  If you need to keep the original word document and space is not an issue, than you can save the native format in one column (like you are doing now) and save the text format in another with text data type, where you can have your Full Text Search set up.

Let me know if you need code to save a string (instead of a file) to a text column.

Anthony
Now we have a string, but if I remember, there was a little problem with Access in your last question.

Do you work with MSSQL database or Access now ?
well - i have an Access 2000 frontend ... and yes, I'm not 100 convinced how to get this into the column - here's what I got so far ... but it doesn't like poking the string in there - probaly too long - and it has many spaces - could be good to strip them out. Good stuff though !
------------------
I'm just using a static CandidateID as well (41153!)

Dim a As Word.Application
Dim d As Word.Document
Dim str As String, mySQL As String, myRS As ADODB.Recordset

DoCmd.Hourglass True


Set a = New Word.Application

Set d = a.Documents.Open("c:\temp\41153.doc", True, True)
str = d.Sections.First.Range.Text

mySQL = "INSERT INTO tblNewCVs (NewCandidateText) VALUES '" & str & "' WHERE CandidateID = 41153"

Set myRS = CallADO(mySQL, adOpenForwardOnly, dbPessimistic)

d.Close SaveChanges:=wdDoNotSaveChanges
a.Quit
Set d = Nothing
Set a = Nothing
DoCmd.Hourglass False
This will work for a small string up to 8000/16000 chars.
okies - well, we have broken the back of it - gotta shoot for the weekend now - but you'll get the points - I'll nail it down on Monday

Thanks again - you are very clever !

Take good care

S.S.
Modified procedure of acperkins (not tested)

Private Sub AddImage2Field(ByRef rsf As ADODB.Field, ByRef str As String)
 Dim stm As ADODB.Stream
 Set stm = New ADODB.Stream
 With stm
  .Type = adTypeText
  .Open
  .WriteText str
  rsf.Value = .Read
  .Close
 End With
 Set stm = Nothing
End Sub

Public Sub Cnv()
 Dim con As ADODB.Connection
 Dim rs As ADODB.Recordset
 Dim a As Word.Application
 Dim d As Word.Document
 Dim str As String
 
 ...
 ...
 rs.Open "select * from tblNewCVs", con, adOpenDynamic, adLockOptimistic
 Set a = New Word.Application
 Do While Not rs.EOF
  ...put your image extacting routine here for rs.Fields("NewCandidateCV").Value to "c:\test.doc"
  Set d = a.Documents.Open("c:\test.doc", True, True)
  str = d.Sections.First.Range.Text
  AddImage2Field rs.Fields("NewCandidateText"), str
  rs.Update
  d.Close SaveChanges:=wdDoNotSaveChanges
  rs.MoveNext
 Loop
 rs.Close
 a.Quit
 Set rs = Nothing
 Set d = Nothing
 Set a = Nothing
End Sub

See you next week :)
Hello again my friend !

Took me most of the day to get around to picking this up. Got an error on ...

----------------------

Function CVtoText(ByRef Rst As ADODB.Field, ByRef str As String)

Dim stm As ADODB.Stream
Set stm = New ADODB.Stream
With stm
    .type = adTypeText
    .Open
    .WriteText str
   
rsf.Value = .Read <<< ERROR operation is not allowed in this context

    .Close
End With

-------------------------

I've ?str and it contains the correct string - so, so far so good - it opens up the file from SQL, saves it to temp, opens and reads that into a string ... but fails on the line, which to be honest, I have no idea what it does !

Hope you can help

Many thanks !

S.S.

SpencerSteel,
I am leaving an office now and I will reply after several hours.
OK mate - tomorrow will be fine ! No rush ... have a good evening !

S.S.
There is a typo in your code (it is correct in ispaleny's)
Either this line:
Function CVtoText(ByRef Rst As ADODB.Field, ByRef str As String)
Should be:
Function CVtoText(ByRef Rsf As ADODB.Field, ByRef str As String)

Or this line:
rsf.Value = .Read
Should be:
rst.Value = .Read

Anthony
I am an idiot !

Thanks

S.S.
Changed - still errors ...

Function CVtoText(ByRef Rsf As ADODB.Field, ByRef str As String)

----------------------

Dim stm As ADODB.Stream
Set stm = New ADODB.Stream
With stm

    .type = adTypeText
    .Open
    .WriteText str

    Rsf.Value = .Read <<<<< operation in not allowed in this context

    .Close

End With

Set stm = Nothing

End Function

----------------------

if I put some text into the column (NewCandidateText) that is passed into this function (I use just one test candidate) it *is* being passed correct to the Function ...

?rst.value = "whatever is in that field"

So it's not that !

.ReadText instead of .Read doesn't error - nor does it write text back into the field

Thanks again

S.S.
I am not entirely convinced a Stream object is required here (even for values larger than 8K), but in any case add this line to your current code:

Function CVtoText(ByRef Rsf As ADODB.Field, ByRef str As String)

Dim stm As ADODB.Stream
Set stm = New ADODB.Stream
With stm

   .type = adTypeText
   .Open
   .WriteText str
   .Position = 0   'Add this line
   Rsf.Value = .Read

   .Close

End With

Set stm = Nothing

End Function

Explanation:
If you add:
Debug.Print .EOS
immediately after the line
.WriteText str
you will notice that the value is True.  This tells you that it is at the End Of Stream, hence you must reset it to read (.Position = 0).

Anthony
And you are right, it should be ReadText

Anthony
Taking liberty with ispaleny's code, try this (without using the Stream Object):

rs.Open "select * from tblNewCVs", con, adOpenDynamic, adLockOptimistic
Set a = New Word.Application
Do While Not rs.EOF
 ...put your image extacting routine here for rs.Fields("NewCandidateCV").Value to "c:\test.doc"
 Set d = a.Documents.Open("c:\test.doc", True, True)
 str = d.Sections.First.Range.Text
 'AddImage2Field rs.Fields("NewCandidateText"), str  'Comment this code
 rs.Fields("NewCandidateText").Value = str  'Add this line
 rs.Update
 d.Close SaveChanges:=wdDoNotSaveChanges
 rs.MoveNext
Loop
rs.Close
a.Quit
Set rs = Nothing
Set d = Nothing
Set a = Nothing
End Sub

Anthony
If you would rather use the Stream object, than here is ispaleny's revised code for text:

Private Sub AddImage2Field(ByRef rsf As ADODB.Field, ByRef str As String)
Dim stm As ADODB.Stream
Set stm = New ADODB.Stream
With stm
 .Type = adTypeText
 .Open
 .WriteText str
 .Position = 0           'Add this to reset or 0 bytes will be copied
 rsf.Value = .ReadText   'Change this from .Read or you will get the error "Operation is not allowed in this context" as it contradicts the line .Type = adTypeText
 .Close
End With
Set stm = Nothing
End Sub

Anthony
Hello mate - are you 2 the only people who answer questions on there then ?!

I feel another split coming on ... :)

I'll try all these methods shortly - I can 'read' the code fine just not experience enought to know the best methods to use (yet!)

I guess i'll find out in a moment - but 'srt' ends up a massive string with lots of spaces in, indeally these are all going to want to be stripped out - remember this is just for 'indexing'. What's the biggest field type I can index on ? Ntext ? Or varchar(8000)

Also, here's a biggie, what would we used streaming if your direct method work and what is 'best practice' ?

Cheers for the interest - i'll make sure you are rewarded.

S.S.

...OK, method 1 seems to work - expect the string has lots of spaces in - which is understandable. Looking for a Replace(" ", "") function type thing, but for more than one space (or tab!) ... will keep looking ... bit baffled as to what method to try for the best ...

Will continue ...

S.S.
As this is done really, and I'm just looking for a character manipulation routine - i've opened another question.

I've requested a spilt on this one again - as it's the same 2 guys as the last time - over the 2 question you both deserve equal.

I hope this is fine

S.S.
SpencerSteel,
I was out of an office longer than expected. It seams that everything is solved now.

The replace function to truncate spaces
-----------------------------------------------------
Most generic one is a loop with replacing of 2 spaces by one space with a test of LEN(str) change. But it is too slow for larger gaps. Add the same loop before it replacing 8 chars (or more) by one for speeding up.
Of course it can be done faster with an assembler, but REPLACE is a VB built-in function :)
I would recommend another approach.  Save the text without removing embedded spaces. After all, how are you going to search for words that are now joined, this will lead to false hits.  Besides disk space is cheap, processing (to remove empty spaces) is not. Instead of attempting to index, create a Full Text Search catalog on this field.  This is a very powerful feature in SQL Server as it lets you search on words, phrases, etc. using boolean logic.  Read up on it in BOL and you will see what I mean.

Anthony
Anthony ... I am using Full Text index ... which is what I thought I needed the plain text for ... I didn't think I could index on the BLOB object ?

(agreed about diskspace! just like it to look nice)

S.S.
I guess SpencerSteel wants to unformat the text

FROM

Nuclear tests<newline><many white spaces>i n<3spaces>t h e<3spaces>U. S. A.

TO

Nuclear tests in the U.S.A.


Some kind of an enhanced filter in Full Text Search can help, but if I have all the string in RAM now ...
Right *deep breath*

1. I am using FULL TEXT INDEX on the existing store of CV's - it's great ! It works on a 'plain text' version of the CV as FTI only works (so I've been told) on varchar types - won't run on the BLOBs

2.I've converted (a sample) of the old CV store into our New ADO friendly CV and want the best way to convert a CV to 'plain text' for searching on. The old method worked on a batch update every night. Now, with this streaming and you guys I can create the text column 'on the fly' as the CV is sucked into the system.

3. If however, there is a way of FTIndexing the actual Word documents, please, please, please let me know asap as this will save a ton of work !

Does this clear things up ?

S.S.
A1. I am not sure about MSSQL7, but in MSSQL2K you can.
A2. FTI SPs and user objects in the insert trigger
A3. Yes, it can be done by an enhanged filter in FTI (I heard about PDF) or by an MSOFFICE search database for files.
>>won't run on the BLOBs << Sure you can have a Full Text Search catalog on a Text, NText, Image, etc.

>> If however, there is a way of FTIndexing the actual Word documents, please, please, please let me know asap as this will save a ton of work !<<
See above.

Anthony
See BOL

Microsoft® SQL Server™ 2000 includes filters for these file extensions: .doc, .xls, .ppt, .txt, and .htm.

In the Full-Text Indexing Wizard, select the image column for indexing, and then specify a Binding column to hold the document type.


The sp_fulltext_column stored procedure accepts an argument for the column to contain the document types.
To view the document type, use the sp_help_fulltext_columns stored procedure to return the column name and column ID.

BUT YOU HAVE MSSQL7. I have had it installed for 3 days ... :)
I think this must be a SQL 7.0 thing then - because as I run the Full Index wizard, I do not have the option to index the NewCandidate (image 16) field.

Hence my need to create a text field of the CV.

Guess it's upgrade time.

Good excuse to blow some cash ...

S.S.
>>Sure you can have a Full Text Search catalog on a Text, NText, Image, etc.<<

This is not quite correct.  You cannot have a Full Text Search catalog on an Image column (even in SQL Server 2000).  You can however on a Text column.  So the solution is to place your CV's in Text format into a column of type Text (if you have not already done that).

Anthony
Anthony,

I think we are all talking the same language here

2 columns

NewCandidateCV = Word object = for extracting and viewing in all in prettiness

NewCandidateText = the raw text of the above for Indexing against.

This is how we have been running it for ages. It's just the methods of population have been crap which I am now sorting out - I'm nearly there I think.

Gotta run.

Points tomorrow - promise :)

S.S.
acperkins,
You can have an FTS index on a image column in MSSQL2K,
but you must have also one more char column to store a document type.
You are right, I stand corrected. From MSDN:
<quote>
Formatted text strings, such as Microsoft® Word™ document files or HTML files, cannot be stored in character string or Unicode columns because many of the bytes in these files contain data structures that do not form valid characters. Database applications may still have a need to access this data and apply full-text searches to it. Many sites store this type of data in image columns, because image columns do not require that each byte form a valid character. SQL Server 2000 introduces the ability to perform full-text searches against these types of data stored in image columns. SQL Server 2000 supplies filters that allow it to extract the textual data from Microsoft Office™ files (.doc, .xls, and .ppt files), text files (.txt files), and HTML files (.htm files). When you design the table, in addition to the image column that holds the data, you include a binding column to hold the file extension for the format of data stored in the image column. You can create a full-text index that references both the image column and the binding column to enable full-text searches on the textual information stored in the image column. The SQL Server 2000 full-text search engine uses the file extension information from the binding column to select the proper filter to extract the textual data from the column.
</quote>
Cool.

Has anyone got any idea what my problem was now :) ?

I've gotta set up a stupid 0 point question and split the points ...

Surely a 'allocated x of y to this user' would be better ?

Later !

S.S.
SpencerSteel

You asked to split points between ispaleny and acperkins.
I have reduced the points on this question from 500 to 250 as indicated by your request at Community Support. Please copy the URL and create a new question in this topic area for the other Experts to whom you wish to award points. The title of the question should read "Points for", followed by the Expert's name. In the question itself, you should paste the link to the original question and perhaps a comment stating that the points are for their help with that question. Once you have created the new questions, you can go back to the original, and accept the comment from the Expert for whom you did not create a new question. The Experts will  comment in your new "Points for" question(s), which you then accept and grade to close.
If you have any questions, please don't hesitate to ask.
Thank you.

** Mindphaser - Community Support Moderator **
Clear as mud mate ...

Jesssssssus - any chance of looking into a

>click Accept As Answer Link<

"YOU HAVE ALLOCATED 500 POINTS TO THIS QUESTION. HOW MANY DO YOU WISH TO GIVE FOR THIS ANSWER?" >default is maximum<

Question remains open until all points are allocated.

Seems a lot simplier to little old me.

I'll wade through that later.

Thanks for the instructions.

S.S.
Spencer

I am sorry to put you through that. The engineers are working on the rewrite and we were promised that split will be much easier.

Let's keep the fingers crossed.

acperkins there is a question for you at https://www.experts-exchange.com/questions/20572105/Points-for-acperkins.html

... done ....

** Mindphaser - Community Support Moderator **
Thank you.