Link to home
Start Free TrialLog in
Avatar of Chris Andrews
Chris AndrewsFlag for United States of America

asked on

php - get number of characters outside of <>

I've got a code that gets the number of characters in a post...

$num_chars = strlen(utf8_decode($content));

However, I'm having some trouble in that this count includes characters in html tags <img src="...">, <a href="...">, etc.

I use the character count to display a particular layout based on how long the post is, so I need the character count to be the number of characters displayed to a reader, and to not include the html code.

Any suggestions on the most efficient way to do that?

Thanks,

Chris
Avatar of Gary
Gary
Flag of Ireland image

strip_tags()

$num_chars = strlen(strip_tags(utf8_decode($content))); 

Open in new window

Is the data UTF-8?  If so, PHP has mb_strlen() that might give a more accurate count.  Strlen() assumes that a byte == a character, and that isn't true with UTF-8.  Strip_tags() is probably OK, but it has its quirks and is notoriously unreliable with malformed (or even some well-formed) tags.  Your exact results may be PHP-release dependent.  You might want to check the notes on the online man page.  If you set up a test case with some representative data I can show you how to test it.
It's not in UTF-8
utf8_decode
Gary: I didn't overlook that.  PHP is just getting into the 21st century with respect to multi-byte character sets, and UTF8_Decode() does not always work the way we wish it would.  Please see the note here:
http://php.net/manual/en/function.utf8-decode.php#104907

Some suggest using Iconv.  I don't have much experience with it.
Even if the string is not decoded properly it would (should) still be the same length.
(though I may need to double check that)

edit
But granted for use elsewhere it may not be the best method.
Avatar of Chris Andrews

ASKER

Is it utf-8.... is the content from a post using wordpress 3.9.1 utf-8?  I'm not sure.

I have PHP 5.3.3 on the server,

Testing...

Chris
is the content from a post using wordpress 3.9.1 utf-8?
I'm not sure either.  Some part of the answer may lie in what/whether the client copied and pasted from Word for Windows (thanks again, Obama).
Yes (or should be)
http://codex.wordpress.org/Converting_Database_Character_Sets

Why are you decoding the string to start with?
Ok, well I feel stupid, but I found I am asking the wrong question.

It's actually this function that is adjusting the layout, based on word count, not character count:

//for getting word count in single.php
function wcount(){
 ob_start();
 the_content();
 $content = ob_get_clean();
 return sizeof(explode(" ", $content));
}
 
Now... I tried changing the $content to this:

 $content = strip_tags(ob_get_clean());

and that caused a slight drop in the word count, but is still counting a lot in the html tags as words.
Post an example of the string.
str_word_count()

Splitting a string by a space isn't likely to give you a real word count
ASKER CERTIFIED SOLUTION
Avatar of Gary
Gary
Flag of Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you both very much for your help on this!