Link to home
Start Free TrialLog in
Avatar of bsharath
bsharathFlag for India

asked on

Detect a websites language

Hi,

Detect a websites language

I have a excel with 100's of website names and any way to check each and find if the website is English or what other language it is?
Avatar of Tapan Pattanaik
Tapan Pattanaik
Flag of India image

Hi bsharath,

Web Page Language Identification Based on URLs:

http://www.vldb.org/pvldb/1/1453880.pdf
Avatar of bsharath

ASKER

Any straight answer please?
Hi bsharath,

Here is tool developed by microsoft . Type/copy the characther (eg. chinse or japanese ) from the left side of box and click translate. it will automatically show which language it is in Translate from drop down. Please let me know if you have any issue.

http://www.bing.com/translator/?ref=SALL&br=ro&mkt=en-US
Thanks but as mentioned i want to input website names and need the results on what language are those sites
Avatar of GrahamSkan
No other volunteers, so here is my tuppence-worth
In principle, this could be done using a WebBrowser object to navigate to each URL, extracting the  text and placing it in a blank Word document.

I can get the Word part working, but not the browser part. For some reason, it never finishes loading the web page, so cannot go on to test the text. In case the problem is an obscure setting on my system, here is the code anyway. It is in an Excel User form. References are set to the Microsoft Word library, the HTML object library and the Internet controls library
 
Option Explicit
Dim WithEvents Web1 As InternetExplorer
Dim wdApp As Word.Application


Private Sub CommandButton1_Click()
    Dim strHTML As String
    Dim strURL As String
    
    Set wdApp = CreateObject("Word.Application")
    wdApp.Visible = True
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    
    strURL = "http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal"
    Web1.navigate strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    wdApp.CheckLanguage = True
    
    Set doc = Documents.Add
    doc.Range.Paste
    Select Case doc.Range.LanguageID
        Case 2057
            GetLanguage = "English"
        Case 1036
            GetLanguage = "French"
        'Case 1033 ...
    End Select
    doc.Close wdDoNotSaveChanges
End Function

Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
   WriteLog "DocumentComplete"
   If (pDisp Is Web1) Then
      MsgBox GetLanguage(Web1.Document.body.innerText)
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

Thanks
is this a Word macro?

I tried as word macro and this line shows red
Dim WithEvents Web1 As InternetExplorer

Should i have the website names in word?
Thanks for the reply

I have a website and a lot of activity from China who place all sorts of links for Nike/Clothes etc

Rather than me checking 100's of links i have

I have extracted them from my DB and if i find a NON English language URl then i can manually check it to find spam content

its basically help you make the job easier and i am not involved in any hacking :-)
The code is hosted in an Excel Userform. You also need to set the references that I mentioned in my last comment.

It only tests one site at the moment. Walking through a column of URLs on an Excel Sheet is the easy bit.
I added the refrence and inserted a new user form and when i run the macro i get this

---------------------------
Microsoft Visual Basic
---------------------------
Compile error:

User-defined type not defined
---------------------------
OK   Help  
---------------------------

Please see attached file
Language.xlsm
You need to set references to the he Microsoft Word library, the Microsoft HTML object library and the Microsoft Internet controls library.

Here is an updated version that works a bit better. It still needs the same references

Private Sub CommandButton1_Click()
    Dim strURL As String
    
    If wdApp Is Nothing Then
        On Error Resume Next
        Set wdApp = GetObject(, "Word.Application")
        On Error GoTo 0
        If wdApp Is Nothing Then
            Set wdApp = CreateObject("Word.Application")
            wdApp.Visible = True
        End If
    End If
    
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    
    strURL = "http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal"
    Web1.navigate strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    Dim p As Integer
    Dim para As Paragraph
    Dim i As Integer
    Dim L As Integer
    Dim LID As WdLanguageID
    Dim Languages() As WdLanguageID
    Dim Paracounts() As Integer
    Dim iMax As Integer
    Dim iMaxID As WdLanguageID
    L = -1
    wdApp.CheckLanguage = True
    WriteLog "Length: " & Len(strText)
    Set doc = Documents.Add
    doc.Range.Text = strText
    For Each para In doc.Paragraphs
        LID = para.Range.LanguageID
        For i = 0 To L
            If LID = Languages(i) Then
                Paracounts(i) = Paracounts(i) + 1
                Exit For
            End If
        Next i
        If i > L Then
            ReDim Preserve Languages(i)
            ReDim Preserve Paracounts(i)
            Paracounts(i) = 1
            Languages(i) = LID
            L = i
        End If
    Next para
    For i = 0 To L
        If iMax < Paracounts(i) Then
            iMax = Paracounts(i)
            iMaxID = Languages(i)
        End If
    Next i
    WriteLog "Count: " & iMax & ", ID: " & iMaxID
        Select Case iMaxID
            Case 2057
                GetLanguage = "English"
            Case 1036
                GetLanguage = "French"
            'Case 1033 ...
        End Select
    doc.Close wdDoNotSaveChanges
End Function


Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
   WriteLog "DocumentComplete"
   If (pDisp Is Web1) Then
      MsgBox GetLanguage(Web1.Document.body.innerText)
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

Column "B" i get -1's and 0's

Is that right?
In that version, there is no attempt to read or write from the sheet, unless you have coded it yourself, which I am sure that you able to do.  As I posted it, the code should merely report the result from a single hard-coded URL via a message box.

There still seem to be some timing issues, which I am currently trying to resolve.
This version seems to work every time.

The apparent timing issues have been resolved, but you might think that the performance is not very good. Most non-English sites have a bit of English somewhere, so that the language is returned as 'mixed' (language ID = 999999). For this reason the code examines every paragraph, keeping a count of the languages used. The returned language is the one with the maximum count of paragraphs in that language. This slows the application down.

It now uses a list of sites taken from column1 of Sheet 3 of the current workbook, the results being entered into column 2. Note that Word recognises over 200 languages and variations, so only a few are in the code (see wdlanguageID Enumeration in the Word VBA help.)
Option Explicit
Dim WithEvents Web1 As InternetExplorer
Dim wdApp As Word.Application
Dim r As Integer
Dim sh As Worksheet
Dim bNewWordApp As Boolean


Private Sub CommandButton1_Click()
    Dim strURL As String
    
    If wdApp Is Nothing Then
        On Error Resume Next
        Set wdApp = GetObject(, "Word.Application")
        On Error GoTo 0
        If wdApp Is Nothing Then
            Set wdApp = CreateObject("Word.Application")
            bNewWordApp = True
        End If
    End If
    wdApp.Visible = True
    
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    Set sh = ActiveWorkbook.Worksheets(3)
    r = 2
    strURL = sh.Cells(r, 1).value
    Web1.navigate strURL
    WriteLog "Navigating to " & strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    Dim p As Integer
    Dim para As Paragraph
    Dim i As Integer
    Dim L As Integer
    Dim LID As WdLanguageID
    Dim Languages() As WdLanguageID
    Dim Paracounts() As Integer
    Dim iMax As Integer
    Dim iMaxID As WdLanguageID
    
    WriteLog "GetLanguage. Text length: " & Len(strText)
    L = -1
    wdApp.CheckLanguage = True
    Set doc = Documents.Add
    doc.Range.Text = strText
    doc.Range.CheckSpelling
    doc.Range.Select
    WriteLog "GetLanguage. Paragraph count: " & doc.Paragraphs.Count
    For Each para In doc.Paragraphs
        para.Range.Select 'seems necessary to provoke automatic language detecion
        DoEvents
        LID = para.Range.LanguageID
        For i = 0 To L
            If LID = Languages(i) Then
                Paracounts(i) = Paracounts(i) + 1
                Exit For
            End If
        Next i
        If i > L Then
            ReDim Preserve Languages(i)
            ReDim Preserve Paracounts(i)
            Paracounts(i) = 1
            Languages(i) = LID
            L = i
        End If
    Next para
    For i = 0 To L
        If iMax < Paracounts(i) Then
            iMax = Paracounts(i)
            iMaxID = Languages(i)
        End If
    Next i
    WriteLog "Max language count: " & iMax & ", ID: " & iMaxID
    'End If
    Select Case iMaxID
    'Select Case doc.Range.LanguageID
        Case 2057, 1033, 3081
            GetLanguage = "English"
        Case 1036, 3084
            GetLanguage = "French"
        Case 1034, 3082
            GetLanguage = "Spanish"
        Case Else
            GetLanguage = "Language ID: " & iMaxID & " not yet encoded in this macro"
    End Select
    doc.Close wdDoNotSaveChanges
End Function


Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
    Dim strURL As String
   WriteLog "DocumentComplete: " & URL
   
   If (pDisp Is Web1) Then
        sh.Cells(r, 2).value = GetLanguage(Web1.Document.body.innerText)
        r = r + 1
        If sh.Cells(r, 1).value <> "" Then
            strURL = sh.Cells(r, 1).value
            WriteLog "Navigating to " & strURL
            Web1.navigate strURL
        Else
            If bNewWordApp Then
                wdApp.Quit
            End If
        End If
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

In excel i create a userform and paste this code?

If you have a working file can you please attach
My file has other things in it.

You need a Userform with one command button with a default name. Copy and paste the code into the Userform's code. Set the references, add your URLs to the sheets, and there you go.
I get system call failed
        sh.Cells(r, 2).Value = GetLanguage(Web1.Document.body.innerText)
That's a new one on me. Does the error message have a number?
First a word blank is opened and then IE and site is loaded and then i get

Run time error '-2147417856 (80010100);: Automation error System Call failed
ASKER CERTIFIED SOLUTION
Avatar of GrahamSkan
GrahamSkan
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I got this

---------------------------
Microsoft Visual Basic
---------------------------
Compile error:

Sub or Function not defined
---------------------------
OK   Help  
---------------------------

log has

16:36:06 Navigating to https://www.experts-exchange.com
16:36:06 StatusTextChange: Looking up https://www.experts-exchange.com
16:36:06 StatusTextChange: Connecting to site 64.156.132.150
16:36:07 StatusTextChange: Website found. Waiting for reply...
16:36:20 StatusTextChange: Start downloading from site: https://www.experts-exchange.com/
16:36:20 StatusTextChange: Downloading from site: https://www.experts-exchange.com/
16:36:20 StatusTextChange:
16:36:20 StatusTextChange:
16:36:20 StatusTextChange:
16:36:20 StatusTextChange: Waiting for https://www.experts-exchange.com/...
16:36:20 StatusTextChange: Waiting for https://www.experts-exchange.com/...
16:38:00 StatusTextChange: Done
16:38:04 DocumentComplete: https://www.experts-exchange.com/
17:41:19 Navigating to https://www.experts-exchange.com
17:41:19 StatusTextChange: Looking up https://www.experts-exchange.com
17:41:23 StatusTextChange: Connecting to site 64.156.132.150
17:41:23 StatusTextChange: Website found. Waiting for reply...
17:41:23 StatusTextChange: Start downloading from site: https://www.experts-exchange.com/
17:41:23 StatusTextChange: Downloading from site: https://www.experts-exchange.com/
17:41:24 StatusTextChange:
17:41:24 StatusTextChange:
17:41:24 StatusTextChange:
17:41:24 StatusTextChange: Waiting for https://www.experts-exchange.com/...
17:41:28 StatusTextChange: Waiting for https://www.experts-exchange.com/...
17:41:28 StatusTextChange: Waiting for https://www.experts-exchange.com/...
17:43:09 StatusTextChange: Done
17:43:13 DocumentComplete: https://www.experts-exchange.com/
You guys are mad.... And I mean that in a happy positive way, but this is not the right way to go about checking for spammers.

Sure, there are a few countries out there that are well known, but, spammers can happen from anywhere. Just as legitimate users can happen from anywhere.

What you should be doing is checking IP addresses and checking for multiple occurances against known members (known members being OK). Some free code I have used before is : http://www.nirsoft.net/utils/ipnetinfo.html

That whole messaging of "connecting to site" then "starting download" looks very very suspiscous and not just checking the country of origin.

You indicated that you were "manually" checking, but now you have a bot to log in and download and check - you will want to be very careful because detection and use of "bots" to access sites can be considered hacking or using techniques to avoid legitimite user access.

Be very careful where this thread leads - it will get shut down in a heart beat - already running very close to infringing Terms Of Use. Specifically :
5.2.The Services and Site may not be used by scripts, machines or automated Services without express written consent of the Site ownership.
Which is exactly what you are doing above...

The pity is you are using GrahamSkan to help build such a system, and one could argue for the commercial benefit of another (competitive) site.

Exactly what is your relationship with techyv ?
Hi Mark,

Thanks for the message. I definitely agree with your comments but it so difficult to find such spammers and banning ip ranges and ip's have been done before by me but it does not stop them.

I have a spam module that blocks 90% of such comments and the 10% is what i end up missing few from deleting them while we do manual moderation. 99% is from China and Kenya and 99% are from the same people with 100's of same URL's from Nike etc.

My intentions are clear and no way to hacking etc as i am against the same as well.

Finding a websites PR or finding a websites rank with alexa is also something that you may feel is like hacking as we retrieve similar kind of information from such sites. Or something close.

I just need help with a script that can find if a website is English Yes/NO

As my website is English a NON English link will make no sense in my site and this is one way a helpful way to find if that comment that has this link is a spammers comment.

This will identify a way for me to delete them.

I am no way close to argument with you Experts and would love to hear a better approach if any

I have tried different ways
1.  Banning ip's and a huge range from China
2. Different spam modules
3. Manual Moderation
4. Admin moderation for each comment before approval
5. Admin moderation if a link or email id if found etc

But still some escapes such techniques and get posted
Sorry, been tied up with other things.

What were the messages that you posted before the log text?

If you are reporting a new error, can you say where it occurs, please?
Thanks Graham for getting back
I am getting the error as in
ID: 38734016

Can you please tell me how i can check where it errors
When there is a compile error, the failing code should be highlighted in dark blue.
Attached file
can you tell me if i set it up right?

The references and the button click etc

As for me now nothing changes
bSharath,

You have asked more that 7000 questions on this site. Many of them had VBA macro code as an answer. It is hard to know what difficulty someone with that level of experience can be having.
 
The file that you posted seems OK except that the references to the Microsoft Internet Controls and the Microsoft HTML Object Library are not set.