Solved

Detect a websites language

Posted on 2012-12-30
28
431 Views
Last Modified: 2013-01-18
Hi,

Detect a websites language

I have a excel with 100's of website names and any way to check each and find if the website is English or what other language it is?
0
Comment
Question by:bsharath
  • 13
  • 11
  • 2
  • +1
28 Comments
 
LVL 21

Expert Comment

by:Tapan Pattanaik
ID: 38730608
Hi bsharath,

Web Page Language Identification Based on URLs:

http://www.vldb.org/pvldb/1/1453880.pdf
0
 
LVL 11

Author Comment

by:bsharath
ID: 38730636
Any straight answer please?
0
 
LVL 21

Expert Comment

by:Tapan Pattanaik
ID: 38731070
Hi bsharath,

Here is tool developed by microsoft . Type/copy the characther (eg. chinse or japanese ) from the left side of box and click translate. it will automatically show which language it is in Translate from drop down. Please let me know if you have any issue.

http://www.bing.com/translator/?ref=SALL&br=ro&mkt=en-US
0
 
LVL 11

Author Comment

by:bsharath
ID: 38731074
Thanks but as mentioned i want to input website names and need the results on what language are those sites
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38731097
No other volunteers, so here is my tuppence-worth
In principle, this could be done using a WebBrowser object to navigate to each URL, extracting the  text and placing it in a blank Word document.

I can get the Word part working, but not the browser part. For some reason, it never finishes loading the web page, so cannot go on to test the text. In case the problem is an obscure setting on my system, here is the code anyway. It is in an Excel User form. References are set to the Microsoft Word library, the HTML object library and the Internet controls library
 
Option Explicit
Dim WithEvents Web1 As InternetExplorer
Dim wdApp As Word.Application


Private Sub CommandButton1_Click()
    Dim strHTML As String
    Dim strURL As String
    
    Set wdApp = CreateObject("Word.Application")
    wdApp.Visible = True
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    
    strURL = "http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal"
    Web1.navigate strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    wdApp.CheckLanguage = True
    
    Set doc = Documents.Add
    doc.Range.Paste
    Select Case doc.Range.LanguageID
        Case 2057
            GetLanguage = "English"
        Case 1036
            GetLanguage = "French"
        'Case 1033 ...
    End Select
    doc.Close wdDoNotSaveChanges
End Function

Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
   WriteLog "DocumentComplete"
   If (pDisp Is Web1) Then
      MsgBox GetLanguage(Web1.Document.body.innerText)
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

0
 
LVL 11

Author Comment

by:bsharath
ID: 38731850
Thanks
is this a Word macro?

I tried as word macro and this line shows red
Dim WithEvents Web1 As InternetExplorer

Should i have the website names in word?
0
 
LVL 11

Author Comment

by:bsharath
ID: 38731967
Thanks for the reply

I have a website and a lot of activity from China who place all sorts of links for Nike/Clothes etc

Rather than me checking 100's of links i have

I have extracted them from my DB and if i find a NON English language URl then i can manually check it to find spam content

its basically help you make the job easier and i am not involved in any hacking :-)
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732007
The code is hosted in an Excel Userform. You also need to set the references that I mentioned in my last comment.

It only tests one site at the moment. Walking through a column of URLs on an Excel Sheet is the easy bit.
0
 
LVL 11

Author Comment

by:bsharath
ID: 38732012
I added the refrence and inserted a new user form and when i run the macro i get this

---------------------------
Microsoft Visual Basic
---------------------------
Compile error:

User-defined type not defined
---------------------------
OK   Help  
---------------------------

Please see attached file
Language.xlsm
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732107
You need to set references to the he Microsoft Word library, the Microsoft HTML object library and the Microsoft Internet controls library.

Here is an updated version that works a bit better. It still needs the same references

Private Sub CommandButton1_Click()
    Dim strURL As String
    
    If wdApp Is Nothing Then
        On Error Resume Next
        Set wdApp = GetObject(, "Word.Application")
        On Error GoTo 0
        If wdApp Is Nothing Then
            Set wdApp = CreateObject("Word.Application")
            wdApp.Visible = True
        End If
    End If
    
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    
    strURL = "http://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal"
    Web1.navigate strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    Dim p As Integer
    Dim para As Paragraph
    Dim i As Integer
    Dim L As Integer
    Dim LID As WdLanguageID
    Dim Languages() As WdLanguageID
    Dim Paracounts() As Integer
    Dim iMax As Integer
    Dim iMaxID As WdLanguageID
    L = -1
    wdApp.CheckLanguage = True
    WriteLog "Length: " & Len(strText)
    Set doc = Documents.Add
    doc.Range.Text = strText
    For Each para In doc.Paragraphs
        LID = para.Range.LanguageID
        For i = 0 To L
            If LID = Languages(i) Then
                Paracounts(i) = Paracounts(i) + 1
                Exit For
            End If
        Next i
        If i > L Then
            ReDim Preserve Languages(i)
            ReDim Preserve Paracounts(i)
            Paracounts(i) = 1
            Languages(i) = LID
            L = i
        End If
    Next para
    For i = 0 To L
        If iMax < Paracounts(i) Then
            iMax = Paracounts(i)
            iMaxID = Languages(i)
        End If
    Next i
    WriteLog "Count: " & iMax & ", ID: " & iMaxID
        Select Case iMaxID
            Case 2057
                GetLanguage = "English"
            Case 1036
                GetLanguage = "French"
            'Case 1033 ...
        End Select
    doc.Close wdDoNotSaveChanges
End Function


Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
   WriteLog "DocumentComplete"
   If (pDisp Is Web1) Then
      MsgBox GetLanguage(Web1.Document.body.innerText)
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

0
 
LVL 11

Author Comment

by:bsharath
ID: 38732133
Column "B" i get -1's and 0's

Is that right?
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732141
In that version, there is no attempt to read or write from the sheet, unless you have coded it yourself, which I am sure that you able to do.  As I posted it, the code should merely report the result from a single hard-coded URL via a message box.

There still seem to be some timing issues, which I am currently trying to resolve.
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732250
This version seems to work every time.

The apparent timing issues have been resolved, but you might think that the performance is not very good. Most non-English sites have a bit of English somewhere, so that the language is returned as 'mixed' (language ID = 999999). For this reason the code examines every paragraph, keeping a count of the languages used. The returned language is the one with the maximum count of paragraphs in that language. This slows the application down.

It now uses a list of sites taken from column1 of Sheet 3 of the current workbook, the results being entered into column 2. Note that Word recognises over 200 languages and variations, so only a few are in the code (see wdlanguageID Enumeration in the Word VBA help.)
Option Explicit
Dim WithEvents Web1 As InternetExplorer
Dim wdApp As Word.Application
Dim r As Integer
Dim sh As Worksheet
Dim bNewWordApp As Boolean


Private Sub CommandButton1_Click()
    Dim strURL As String
    
    If wdApp Is Nothing Then
        On Error Resume Next
        Set wdApp = GetObject(, "Word.Application")
        On Error GoTo 0
        If wdApp Is Nothing Then
            Set wdApp = CreateObject("Word.Application")
            bNewWordApp = True
        End If
    End If
    wdApp.Visible = True
    
    Set Web1 = CreateObject("InternetExplorer.Application")
    Web1.Visible = True
    Set sh = ActiveWorkbook.Worksheets(3)
    r = 2
    strURL = sh.Cells(r, 1).value
    Web1.navigate strURL
    WriteLog "Navigating to " & strURL
    
End Sub


Function GetLanguage(strText As String) As String
    Dim doc As Word.Document
    Dim p As Integer
    Dim para As Paragraph
    Dim i As Integer
    Dim L As Integer
    Dim LID As WdLanguageID
    Dim Languages() As WdLanguageID
    Dim Paracounts() As Integer
    Dim iMax As Integer
    Dim iMaxID As WdLanguageID
    
    WriteLog "GetLanguage. Text length: " & Len(strText)
    L = -1
    wdApp.CheckLanguage = True
    Set doc = Documents.Add
    doc.Range.Text = strText
    doc.Range.CheckSpelling
    doc.Range.Select
    WriteLog "GetLanguage. Paragraph count: " & doc.Paragraphs.Count
    For Each para In doc.Paragraphs
        para.Range.Select 'seems necessary to provoke automatic language detecion
        DoEvents
        LID = para.Range.LanguageID
        For i = 0 To L
            If LID = Languages(i) Then
                Paracounts(i) = Paracounts(i) + 1
                Exit For
            End If
        Next i
        If i > L Then
            ReDim Preserve Languages(i)
            ReDim Preserve Paracounts(i)
            Paracounts(i) = 1
            Languages(i) = LID
            L = i
        End If
    Next para
    For i = 0 To L
        If iMax < Paracounts(i) Then
            iMax = Paracounts(i)
            iMaxID = Languages(i)
        End If
    Next i
    WriteLog "Max language count: " & iMax & ", ID: " & iMaxID
    'End If
    Select Case iMaxID
    'Select Case doc.Range.LanguageID
        Case 2057, 1033, 3081
            GetLanguage = "English"
        Case 1036, 3084
            GetLanguage = "French"
        Case 1034, 3082
            GetLanguage = "Spanish"
        Case Else
            GetLanguage = "Language ID: " & iMaxID & " not yet encoded in this macro"
    End Select
    doc.Close wdDoNotSaveChanges
End Function


Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
    Dim strURL As String
   WriteLog "DocumentComplete: " & URL
   
   If (pDisp Is Web1) Then
        sh.Cells(r, 2).value = GetLanguage(Web1.Document.body.innerText)
        r = r + 1
        If sh.Cells(r, 1).value <> "" Then
            strURL = sh.Cells(r, 1).value
            WriteLog "Navigating to " & strURL
            Web1.navigate strURL
        Else
            If bNewWordApp Then
                wdApp.Quit
            End If
        End If
   End If
End Sub

Private Sub Web1_StatusTextChange(ByVal Text As String)
    WriteLog "StatusTextChange: " & Text
End Sub

Sub WriteLog(ByVal Text As String)
    Dim f As Integer
    Dim strFileName As String
    strFileName = "web" & Format$(Now, "MMy") & ".log"
    Text = Format$(Now, "HH:nn:ss") & " " & Text
    Debug.Print Text
    f = FreeFile
    Open ActiveWorkbook.Path & "\" & strFileName For Append As #f
        Print #f, Text
    Close #f
End Sub

Open in new window

0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 
LVL 11

Author Comment

by:bsharath
ID: 38732255
In excel i create a userform and paste this code?

If you have a working file can you please attach
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732291
My file has other things in it.

You need a Userform with one command button with a default name. Copy and paste the code into the Userform's code. Set the references, add your URLs to the sheets, and there you go.
0
 
LVL 11

Author Comment

by:bsharath
ID: 38732296
I get system call failed
        sh.Cells(r, 2).Value = GetLanguage(Web1.Document.body.innerText)
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38732329
That's a new one on me. Does the error message have a number?
0
 
LVL 11

Author Comment

by:bsharath
ID: 38732357
First a word blank is opened and then IE and site is loaded and then i get

Run time error '-2147417856 (80010100);: Automation error System Call failed
0
 
LVL 76

Accepted Solution

by:
GrahamSkan earned 500 total points
ID: 38732503
The only consistent think that I can find out about this error is that it happens when an object has disappeared for some reason.

You may have noticed some calls to a Writelog procedure. This creates a trace file and puts it into the same folder as the workbook. I have added a few more calls to it in this Sub, so I would be grateful if you could replace the current procedure (Web1_DocumentComplete) with this version, run the code, and send the log to me when it aborts. Today the file is called 'web12366.log'
Private Sub Web1_DocumentComplete(ByVal pDisp As Object, URL As Variant)
    Dim strURL As String
   WriteLog "DocumentComplete: " & URL
   
   If (pDisp Is Web1) Then
        WriteLog "DocumentComplete. Web1: " & Not (Web1 Is Nothing)
        WriteLog "DocumentComplete. Web1.Document: " & Not (Web1.Document Is Nothing)
        WriteLog "DocumentComplete. Web1.Document.body: " & Not (Web1.Document.body Is Nothing)
        sh.Cells(r, 2).value = GetLanguage(Web1.Document.body.innerText)
        r = r + 1
        If sh.Cells(r, 1).value <> "" Then
            strURL = sh.Cells(r, 1).value
            WriteLog "Navigating to " & strURL
            Web1.navigate strURL
        Else
            If bNewWordApp Then
                wdApp.Quit
            End If
        End If
   End If
End Sub

Open in new window

0
 
LVL 11

Author Comment

by:bsharath
ID: 38734016
I got this

---------------------------
Microsoft Visual Basic
---------------------------
Compile error:

Sub or Function not defined
---------------------------
OK   Help  
---------------------------

log has

16:36:06 Navigating to www.experts-exchange.com
16:36:06 StatusTextChange: Looking up www.experts-exchange.com
16:36:06 StatusTextChange: Connecting to site 64.156.132.150
16:36:07 StatusTextChange: Website found. Waiting for reply...
16:36:20 StatusTextChange: Start downloading from site: http://www.experts-exchange.com/
16:36:20 StatusTextChange: Downloading from site: http://www.experts-exchange.com/
16:36:20 StatusTextChange:
16:36:20 StatusTextChange:
16:36:20 StatusTextChange:
16:36:20 StatusTextChange: Waiting for http://www.experts-exchange.com/...
16:36:20 StatusTextChange: Waiting for http://www.experts-exchange.com/...
16:38:00 StatusTextChange: Done
16:38:04 DocumentComplete: http://www.experts-exchange.com/
17:41:19 Navigating to www.experts-exchange.com
17:41:19 StatusTextChange: Looking up www.experts-exchange.com
17:41:23 StatusTextChange: Connecting to site 64.156.132.150
17:41:23 StatusTextChange: Website found. Waiting for reply...
17:41:23 StatusTextChange: Start downloading from site: http://www.experts-exchange.com/
17:41:23 StatusTextChange: Downloading from site: http://www.experts-exchange.com/
17:41:24 StatusTextChange:
17:41:24 StatusTextChange:
17:41:24 StatusTextChange:
17:41:24 StatusTextChange: Waiting for http://www.experts-exchange.com/...
17:41:28 StatusTextChange: Waiting for http://www.experts-exchange.com/...
17:41:28 StatusTextChange: Waiting for http://www.experts-exchange.com/...
17:43:09 StatusTextChange: Done
17:43:13 DocumentComplete: http://www.experts-exchange.com/
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 38734152
You guys are mad.... And I mean that in a happy positive way, but this is not the right way to go about checking for spammers.

Sure, there are a few countries out there that are well known, but, spammers can happen from anywhere. Just as legitimate users can happen from anywhere.

What you should be doing is checking IP addresses and checking for multiple occurances against known members (known members being OK). Some free code I have used before is : http://www.nirsoft.net/utils/ipnetinfo.html

That whole messaging of "connecting to site" then "starting download" looks very very suspiscous and not just checking the country of origin.

You indicated that you were "manually" checking, but now you have a bot to log in and download and check - you will want to be very careful because detection and use of "bots" to access sites can be considered hacking or using techniques to avoid legitimite user access.

Be very careful where this thread leads - it will get shut down in a heart beat - already running very close to infringing Terms Of Use. Specifically :
5.2.The Services and Site may not be used by scripts, machines or automated Services without express written consent of the Site ownership.
Which is exactly what you are doing above...

The pity is you are using GrahamSkan to help build such a system, and one could argue for the commercial benefit of another (competitive) site.

Exactly what is your relationship with techyv ?
0
 
LVL 11

Author Comment

by:bsharath
ID: 38734426
Hi Mark,

Thanks for the message. I definitely agree with your comments but it so difficult to find such spammers and banning ip ranges and ip's have been done before by me but it does not stop them.

I have a spam module that blocks 90% of such comments and the 10% is what i end up missing few from deleting them while we do manual moderation. 99% is from China and Kenya and 99% are from the same people with 100's of same URL's from Nike etc.

My intentions are clear and no way to hacking etc as i am against the same as well.

Finding a websites PR or finding a websites rank with alexa is also something that you may feel is like hacking as we retrieve similar kind of information from such sites. Or something close.

I just need help with a script that can find if a website is English Yes/NO

As my website is English a NON English link will make no sense in my site and this is one way a helpful way to find if that comment that has this link is a spammers comment.

This will identify a way for me to delete them.

I am no way close to argument with you Experts and would love to hear a better approach if any

I have tried different ways
1.  Banning ip's and a huge range from China
2. Different spam modules
3. Manual Moderation
4. Admin moderation for each comment before approval
5. Admin moderation if a link or email id if found etc

But still some escapes such techniques and get posted
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38740882
Sorry, been tied up with other things.

What were the messages that you posted before the log text?

If you are reporting a new error, can you say where it occurs, please?
0
 
LVL 11

Author Comment

by:bsharath
ID: 38740898
Thanks Graham for getting back
I am getting the error as in
ID: 38734016

Can you please tell me how i can check where it errors
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38741603
When there is a compile error, the failing code should be highlighted in dark blue.
0
 
LVL 11

Author Comment

by:bsharath
ID: 38741777
Attached file
can you tell me if i set it up right?

The references and the button click etc

As for me now nothing changes
0
 
LVL 76

Expert Comment

by:GrahamSkan
ID: 38742137
bSharath,

You have asked more that 7000 questions on this site. Many of them had VBA macro code as an answer. It is hard to know what difficulty someone with that level of experience can be having.
 
The file that you posted seems OK except that the references to the Microsoft Internet Controls and the Microsoft HTML Object Library are not set.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This tutorial explains how to create a series of drop-down lists that are dependent upon prior selections to guide (“force”) the user to make the correct selection and reduce data errors within Microsoft Excel. Excel 2010 was used for this tutorial;…
This article will guide you to convert a grid from a picture into Excel format using Microsoft OneNote and no other 3rd party application.
Graphs within dashboards are meant to be dynamic, representing data from a period of time that will change each time the dashboard is updated with new data. Rather than update each graph to point to a different set within a static set of data, t…
Learn how to make your own table of contents in Microsoft Word using paragraph styles and the automatic table of contents tool. We'll be using the paragraph styles in Word’s Home toolbar to help you create a table of contents. Type out your initial …

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now