Link to home
Start Free TrialLog in
Avatar of simonsabin
simonsabin

asked on

Convert text to Sentence Case

I need to be able to convert text to the proper sentence case. i.e
MY NAME IS FRED I AM A NERD. I HAVE NO FRIENDS IN LONDON.
to
My name is Fred I am a nerd. I have no friends in London.

The key factors being identification of keywords to capitalise them, e.g. Fred, I and London

A combination that would work is using word convert the case and then do a spell check. However you can not programatically control the spell checker (except for starting it) in Word.
Suggestions please.

Loads of points available.

ps already considered creating list of keywords, rejected due to the time involved.

Will also be working with diseases and non standard words, so global spell check and replace may not work.
Avatar of Vbmaster
Vbmaster

How else can the program know that a word is to be capitalizad, except using list of keywords? Guessing?
Avatar of simonsabin

ASKER

They key point is I don't want to build up the keyword list. Word already has it in its dictionary. How can I use it?

I need to be able to spell check and now whether the word found is a suggestion of wrong because it should be capitalised.
Try this

' #VBIDEUtils#************************************************************
' * Programmer Name  : Waty Thierry
' * Web Site         : www.geocities.com/ResearchTriangle/6311/
' * E-Mail           : waty.thierry@usa.net
' * Date             : 29/06/99
' * Time             : 13:54
' **********************************************************************
' * Comments         : Changing Strings to Title Case
' *
' *
' **********************************************************************

Function TitleCaps(InString As String) As String
   ' *** Changing Strings to Title Case
   ' *** Useful for certain type of applications (eg for names and addresses etc),
   ' *** being able to Title Caps (ie capitalise first letter of each word) is achieved
   ' *** with the following function.
   ' *** However, be careful with names such as McDonalds as it will become Mcdonalds.
   ' *** Perhaps it would be a good idea to edit the code to cope with this...
   ' *** of applications (eg for names and addresses etc), being able to Title Caps
   ' *** (ie capitalise first letter of each word) is achieved with the following function.
   ' *** However, be careful with names such as McDonalds as it will become Mcdonalds.
   ' *** Perhaps it would be a good idea to edit the code to cope with this...

   Dim OutString        As String
   Dim CurrentLetter    As String
   Dim CurrentWord      As String
   Dim TCaps            As String
   Dim StrCount         As Integer

   ' *** Converts [instring] to Title Caps (as best it can!)
   OutString = ""
   If InString = "" Then
      TitleCaps = ""
      Exit Function
   End If

   CurrentWord = ""
   For StrCount = 1 To Len(InString)
      CurrentLetter = Mid(InString, StrCount, 1)
      CurrentWord = CurrentWord + CurrentLetter
      If InStr(" .,/\;:-!?[]()#", CurrentLetter) <> 0 Or _
         StrCount = Len(InString) Then
         TCaps = UCase(Left(CurrentWord, 1)) + _
            LCase(Right(CurrentWord, Len(CurrentWord) - 1))
         OutString = OutString + TCaps
         CurrentWord = ""
      End If
   Next
   TitleCaps = OutString

End Function

Waty,

Not what I am after, I need sentence case not proper case

ps. You can use StrConv(string,vbProperCase)
to do want you have done.
Forget it...

How do you know you're not dealing with a Language such as German, where all Nouns are capitalised?

Or with dutch where we do not capitalise anything, for example the fact that I am dutch or my first language is dutch.

Or with Scottish names like McDonalds where capitals appear in the middle of a word?

Or with some language that you've never seen before where you haven't got a clue whether a word is a name or any other type of word.

Even if the language is English, who's to say that the user is not talking about something that requires the inclusion of foreign language words? Such as "The Dutch word for difficult is moeilijk"

Etc... etc... etc...
In French, the month name are not capitalized like in English

January = janvier
February = février
....
I which case the word would not be capitalised. If you run Word spell checking on you sentence after you have change it from all capitals to Sentence Case then the words dutch and moeilijk are underlined as spelt wrongly. The difference however is that the suggestion for dutch is Dutch (i.e capitalise it) where as the suggestion for moeilijk is no suggestion or a different word.

I want to automatically replace words that should be capitalised and leave the rest. But you have no control with the word spell checker.
p.s. it is all english.

Also I know what ever solution is chosen it won't be perfect.
>you can not programatically control the spell checker
you can retrieve errors

I pasted your sentence (converted to lowercase) to Word97, created 'custom Writting Style' (only checks Capitalization) in Options, and used VBA:
    Dim i As Integer, pr1 As ProofreadingErrors, msg As String
    Set pr1 = Selection.Range.SpellingErrors
    For i = 1 To pr1.Count
        msg = msg & pr1.Item(i).Text & vbCr
    Next
    MsgBox msg
It returned:
fred
i
london

Maybe this can be a start?
Thanks ameba, just the trigger needed.

Here is the code I have used, any comments will be welcomed

  Dim w As Word.Application, d As Word.Document, s As Object, lng As Long
  Dim sugg As Word.SpellingSuggestions, r As Range
  Set w = GetObject("", "Word.Application")
 
  Set d = w.Documents.Add
 
  lng = Len(Text1)
  w.ActiveDocument.Content = Text1
 
  w.ActiveDocument.Content.Case = wdLowerCase
  w.ActiveDocument.Content.Case = wdTitleSentence
 
  For Each s In w.ActiveDocument.SpellingErrors
    Set sugg = s.GetSpellingSuggestions
    If sugg.Count > 0 Then
      If StrComp(sugg(1), s.Text, vbTextCompare) = 0 Then
        s.Text = sugg(1)
      End If
    End If
  Next
  Text1 = Left$(w.ActiveDocument.Content.Text, lng)
 
  d.Close False
  w.Quit
  Set w = Nothing
Sounds good, I'll check it later.
Must go now and vote for President of my country.
(Also I know what ever candidate is chosen it won't be perfect. :)
' I hope it is OK.
' Text1(0) multiline=true, Text1(1)
Option Explicit

Private Sub Form_Load()
    Text1(0).Text = "MY NAME IS FRED I AM A NERD. I HAVE NO FRIENDS IN LONDON, GREAT Britain." _
        & vbCrLf & "Mcdonalds is in NEW york."
End Sub

Private Sub Command1_Click()
    Text1(1).Text = XCase(Text1(0).Text)
End Sub

Public Function XCase(sInput As String) As String
    Dim w As Word.Application, d As Word.Document
    Dim r As Word.Range
    Dim sugg As Word.SpellingSuggestions
   
    Set w = GetObject("", "Word.Application")
    Set d = w.Documents.Add
    d.Range.LanguageID = wdEnglishUK
    d.Content.Text = sInput
   
    d.Content.Case = wdLowerCase
    d.Content.Case = wdTitleSentence
   
    For Each r In w.ActiveDocument.SpellingErrors
        Set sugg = r.GetSpellingSuggestions
        If sugg.Count > 0 Then
            If StrComp(sugg(1), r.Text, vbTextCompare) = 0 Then
                r.Text = sugg(1)
            End If
        End If
    Next
 
    ' Word will use vbCr instead of vbCrLf, and append one extra vbCr
    '    reverse this
    XCase = Replace(w.ActiveDocument.Range.Text, vbCr, vbCrLf)
    If Right$(XCase, 2) = vbCrLf Then
        XCase = Left$(XCase, Len(XCase) - 2)
    End If
   
    d.Close False
    w.Quit
    Set w = Nothing
End Function
Yeh I noticed the extra character appearing when I put the text back into VB.

JUST one more note, this is to be called probably >1 Million times. Getting the Word app and creating the document can be done once but is

d.Content.Text = sInput
     
    d.Content.Case = wdLowerCase
    d.Content.Case = wdTitleSentence
     
    For Each r In w.ActiveDocument.SpellingErrors
        Set sugg = r.GetSpellingSuggestions
        If sugg.Count > 0 Then
            If StrComp(sugg(1), r.Text, vbTextCompare) = 0 Then
                r.Text = sugg(1)
            End If
        End If
    Next
   
    ' Word will use vbCr instead of vbCrLf, and append one extra vbCr
    '    reverse this
    XCase = Replace(w.ActiveDocument.Range.Text, vbCr, vbCrLf)
    If Right$(XCase, 2) = vbCrLf Then
        XCase = Left$(XCase, Len(XCase) - 2)
    End If


the quickest code?
I think you don't need:
    d.Content.Case = wdLowerCase

In Options define 'custom' writting style, check only Capitalization checkBox and ignore other spelling errors

    If d Is Nothing Then
        Set w = GetObject("", "Word.Application")
        Set d = w.Documents.Add
       
        d.Range.LanguageID = wdEnglishUK
        w.Languages(wdEnglishUK).SpellingDictionaryType = wdSpelling
        w.Languages(wdEnglishUK).DefaultWritingStyle = "Custom"
         w.ActiveDocument.ActiveWritingStyle(wdEnglishUK) = "Custom"
    End If
------------

It is slow, few KB/minute.

1. What is the average size of your records (100-200 characters or many KB)?
2. Can you ignore MixedCase words (is Mcdonalds OK)?
I found that on making a second call the conversion to sentence case did not work that is why I added the conversion to lower case.

Average size about 30-40 words.
If this is a batch job, it will run 20 days (3K/minute, 200MB).

If Word97 automation is too slow, it is not big problem creating wordlist
You'll need:
; days of the week
; months
; holidays
; planets, stars, zodiac
; computers, programs, and languages
; names from Alice in Wonderland
; famous persons, deities and related adjectives
; misc. unique entities
; nationalities, languages, religions
; places and related adjectives
; some abbreviations

see this collection:
ftp://ftp.ox.ac.uk/pub/wordlists/
dirs: places, names, science, databases
40 mil. words is about 40,000 distinct words

Maybe you can pass only these words once, instead of 1000 times more.
(1 hour instead of 500-1000 hours processing in MS Word)

Of course, this requires some programming:
Create Table with only one field 'Capword'
Initially, it will contain all words (lowercased), but after getting info from Word, only words with Capitals (e.g. only 10,000 words).
Hm, maybe table needs two fields 'Capword' and 'lcasedword' ...
Just found out that the free format text don't need to be converted to sentence case. Only names which can be done with strconv.

Nay way was interesting. Thanks for the help.

How many points do you want?
Just found out that the free format text don't need to be converted to sentence case. Only names which can be done with strconv.

Any way was interesting. Thanks for the help.

How many points do you want?
ASKER CERTIFIED SOLUTION
Avatar of ameba
ameba
Flag of Croatia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Adjusted points to 500
Honest chap
Wow! Thanks!
Well its almost Christmas!!!
:-)
It is for me now. Thanks!