[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 367
  • Last Modified:

JQuery: How to count words on a page.

Dear Experts,

1. The html code below should count the number of words in the body of a page.

2. The screenshot shows that if I remove all of the white space from the html body some of the words are combined. (eg: "Herearesomedivtagsrow1col1row1col2row2col1row2col2")

3. How could I correct this?


<!DOCTYPE html>
<html>
<head>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js">
</script>
<script>
$(document).ready(function(){
        
        var numberOfMatches = $("body").text().match(/\w+/ig).length;
        console.log(numberOfMatches);
        
        var bodyText = $("body").text().match(/\w+/ig);
        console.log(bodyText);
        
});
</script>
</head>
<body>
<h1>Here is a heading.</h1><p>This is a paragraph.</p><p>This is another paragraph.</p><ul><li>Here is a bullet.</li><li>Here is another bullet.</li><li>Here is the last bullet.</li></ul><div>Here</div><div>are</div><div>some</div><div>div</div><div>tags</div><table width="200" border="0" cellspacing="0" cellpadding="0"><tr><td>row1col1</td><td>row1col2</td></tr><tr><td>row2col1</td><td>row2col2</td></tr></table>
</body>
</html>

Open in new window


Notice - Herearesomedivtagsrow1col1row1col2row2col1row2col2
0
AdrianSmithUK
Asked:
AdrianSmithUK
  • 7
  • 4
2 Solutions
 
Big MontySenior Web Developer / CEO of ExchangeTree.org Commented:
0
 
Steve KrileCommented:
This seemed to do the trick for me:

        //replace all HTML elements with a blank space - this makes sure there are spaces between every word and will be ignored by your MATCH statement
        var bodyHTML = $("body").html().replace(/<(.|\n)+?>/ig, " ");

        
        var bodyText = bodyHTML.match(/\w+/ig);

        console.log(bodyText);
        console.log(bodyText.length);

Open in new window


The key is to remember that the text() jquery function ignores all HTML tags and compresses all the contents of the BODY tag into one result.  Instead, use the .html() function and then a regex function to chop out all the html elements.
0
 
AdrianSmithUKAuthor Commented:
Many thanks chaps.
Kind Regards,
Adrian
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
AdrianSmithUKAuthor Commented:
PS: Out of interest, in the end I solved the issue by appending spaces after selected div tags.

<!DOCTYPE html>
<html>
<head>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>

<script>
$(document).ready(function(){
	
		//Add spaces after selected div tags.
        appendSpaces();
		
        var numberOfMatches = $("body").text().match(/\w+/ig).length;
        console.log(numberOfMatches);
        
        var bodyText = $("body").text().match(/\w+/ig);
        console.log(bodyText);
		
		console.log( $("p").text() );
		
});

function appendSpaces(){
		$("div").append(" ");
		$("td").append(" ");
		$("a").append(" ");
}

</script>
</head>
<body>

<h1>Here is a heading.</h1><a href="#">Link1</a><a href="#">Link2</a><a href="#">Link3</a><p>This is a paragraph.</p><p>This is another paragraph.</p><ul><li>Here is a bullet.</li><li>Here is another bullet.</li><li>Here is the last bullet.</li></ul><div>Here</div><div>are</div><div>some</div><div>div</div><div>tags</div><table width="200" border="0" cellspacing="0" cellpadding="0"><tr><td>row1col1</td><td>row1col2</td></tr><tr><td>row2col1</td><td>row2col2</td></tr></table>

</body>
</html>

Open in new window

0
 
Steve KrileCommented:
This line of my solution does the same thing but for ALL html elements:

var bodyHTML = $("body").html().replace(/<(.|\n)+?>/ig, " ");
0
 
AdrianSmithUKAuthor Commented:
Does it not destroy all the HTML elements and replace them with a space?
0
 
Steve KrileCommented:
Well, it creates a variable (using the .html() command) strips out any HTML, and then counts what is left making sure that there are white spaces between all the contents of the former HTML elements.  It doesn't "destroy" the HTML for the viewer.
0
 
AdrianSmithUKAuthor Commented:
I see. Definitely a good snippet. Many thanks. Adrian
0
 
AdrianSmithUKAuthor Commented:
Skrile

I re-factored the solution to use pure Javascript and your solution is beautiful. Here is the code.

<!DOCTYPE html>
<html>
<head>

<script>

window.onload = function(){
	
   var bodyHtml = document.getElementsByTagName('body')[0].innerHTML.replace(/<(.|\n)+?>/ig, " ");	
   var bodyText = bodyHtml.match(/\w+/ig);

   console.log(bodyText);
   console.log(bodyText.length);
}

</script>
</head>
<body>

<h1>Here is a heading.</h1><a href="#">Link1</a><a href="#">Link2</a><a href="#">Link3</a><p>This is a paragraph.</p><p>This is another paragraph.</p><ul><li>Here is a bullet.</li><li>Here is another bullet.</li><li>Here is the last bullet.</li></ul><div>Here</div><div>are</div><div>some</div><div>div</div><div>tags</div><table width="200" border="0" cellspacing="0" cellpadding="0"><tr><td>row1col1</td><td>row1col2</td></tr><tr><td>row2col1</td><td>row2col2</td></tr></table>

</body>
</html>

Open in new window

0
 
Steve KrileCommented:
Nice.

Also, a good discussion on the troubles with window.load() here:

http://stackoverflow.com/questions/6352789/cross-browser-compatible-way-to-bind-events-on-page-load
0
 
AdrianSmithUKAuthor Commented:
Very interesting and many thanks.

I'm developing a plugin for firefox and the DOMContentLoaded event will be much more suitable than the window.load event. Some websites take for ages to load their flash movies and banners.

https://developer.mozilla.org/en-US/docs/Mozilla_event_reference/DOMContentLoaded_(event)

Thanks Again :)
0
 
AdrianSmithUKAuthor Commented:
Much faster!

<!DOCTYPE html>
<html>
<head>

<script>

var listener = function(e)
{
    window.removeEventListener("DOMContentLoaded", listener, false);
    
	var bodyHtml = document.getElementsByTagName('body')[0].innerHTML.replace(/<(.|\n)+?>/ig, " ");	
	var bodyText = bodyHtml.match(/\w+/ig);

    console.log(bodyText);
    console.log(bodyText.length);
}

window.addEventListener("DOMContentLoaded", listener, false);

</script>
</head>
<body>

<h1>Here is a heading.</h1><a href="#">Link1</a><a href="#">Link2</a><a href="#">Link3</a><p>This is a paragraph.</p><p>This is another paragraph.</p><ul><li>Here is a bullet.</li><li>Here is another bullet.</li><li>Here is the last bullet.</li></ul><div>Here</div><div>are</div><div>some</div><div>div</div><div>tags</div><table width="200" border="0" cellspacing="0" cellpadding="0"><tr><td>row1col1</td><td>row1col2</td></tr><tr><td>row2col1</td><td>row2col2</td></tr></table>

</body>
</html>

Open in new window

0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 7
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now