Thursday, April 14, 2011

Fast case counting

I have a large number of strings to process in php. I want to "fix" them to be title case (using ucwords(strtolower($str))) but only if they are all upper or all lower case already. If they are already mixed case, I'd just rather just leave them as they are.

What is the fastest way to check for this? It seems like foring through the string would be a rather slow way to go about it.

Here's what I have, which I think will be too slow:

function fixCase($str)
{
 $uc = 0;
 $lc = 0;
 for($i=0;$i<strlen($str);$i++)
 {
  if ($str[$i] >= 'a' && $str[$i] <= 'z')
   $lc++;
  else if ($str[$i] >= 'A' && $str[$i] <= 'Z')
   $uc++;
 }

 if ($uc == 0 || $lc == 0)
 {
  return ucwords(strtolower($str));
 }
}
From stackoverflow
  • just use a string compare (case sensitive)

    function fixCase($str)
    {
      if ( 
           (strcmp($str, strtolower($str)) === 0) || 
           (strcmp($str, strtoupper($str)) === 0) ) 
      {
        $str = ucwords(strtolower($str));
      }
    
      return $str;
    }
    
  • Wouldn't it be easier to check if the string = lowercase(string) or string = uppercase(string) and if so then leave it. Otherwise perform your operation.

  • There's not going to be any amazing optimization, because by the nature of the problem you need to look at every character.

    Personally, I would just loop over the characters of the string with this sort of algorithm:

    • Look at the first character in the string, set a variable indicating whether it was upper or lowercase.
    • Now examine each character sequentially. If you get to the end of the string and they've all been the same case as the first character, fix the string's case as you like.
    • If any character is a different case than the first character was, break the loop and return the string.


    Edit: actual code, I think this is about as good as you're going to get.

    // returns 0 if non-alphabetic char, 1 if uppercase, 2 if lowercase
    function getCharType($char)
    {
        if ($char >= 'A' && $char <= 'Z')
        {
            return 1;
        }
        else if ($char >= 'a' && $char <= 'z')
        {
            return 2;
        }
        else
        {
            return 0;
        }
    }
    
    function fixCase($str)
    {
        for ($i = 0; $i < strlen($str); $i++)
        {
            $charType = getCharType($str[$i]);
            if ($charType != 0)
            {
                $firstCharType = $charType;
                break;
            }
        }
    
        for ($i = $i + 1; $i < strlen($str); $i++)
        {
            $charType = getCharType($str[$i]);
            if ($charType != $firstCharType && $charType != 0)
            {
                return $str;
            }
        }
    
        if ($firstCharType == 1) // uppercase, need to convert to lower first
        {
            return ucwords(strtolower($str));
        }
        else if ($firstCharType == 2) // lowercase, can just ucwords() it
        {
            return ucwords($str);
        }
        else // there were no letters at all in the string, just return it
        {
            return $str;
        }
    }
    
  • Well I decided to do a test of the 2 proposed answers thus far and my original solution. I wouldn't have thought the results would turn out this way, but I guess native methods are /that/ much faster over all.

    Code:

    function method1($str)
    {
     if (strcmp($str, strtolower($str)) == 0)
     {
      return ucwords($str);
     }
     else if (strcmp($str, strtoupper($str)) == 0)
     {
      return ucwords(strtolower($str));
     }
     else
     {
      return $str;
     }
    }
    
    // returns 0 if non-alphabetic char, 1 if uppercase, 2 if lowercase
    function getCharType($char)
    {
     if ($char >= 'A' && $char <= 'Z')
     {
         return 1;
     }
     else if ($char >= 'a' && $char <= 'z')
     {
         return 2;
     }
     else
     {
         return 0;
     }
    }
    
    function method2($str)
    {
     for ($i = 0; $i < strlen($str); $i++)
     {
         $charType = getCharType($str[$i]);
         if ($charType != 0)
         {
             $firstCharType = $charType;
             break;
         }
     }
    
     for ($i = $i + 1; $i < strlen($str); $i++)
     {
         $charType = getCharType($str[$i]);
         if ($charType != $firstCharType && $charType != 0)
         {
             return $str;
         }
     }
    
     if ($firstCharType == 1) // uppercase, need to convert to lower first
     {
         return ucwords(strtolower($str));
     }
     else if ($firstCharType == 2) // lowercase, can just ucwords() it
     {
         return ucwords($str);
     }
     else // there were no letters at all in the string, just return it
     {
         return $str;
     }
    }
    
    function method0($str)
    {
         $uc = 0;
         $lc = 0;
         for($i=0;$i<strlen($str);$i++)
         {
                 if ($str[$i] >= 'a' && $str[$i] <= 'z')
                         $lc++;
                 else if ($str[$i] >= 'A' && $str[$i] <= 'Z')
                         $uc++;
         }
    
         if ($uc == 0 || $lc == 0)
         {
                 return ucwords(strtolower($str));
         }
    }
    
    
    function test($func,$s)
    {
     $start = gettimeofday(true);
     for($i = 0; $i < 1000000; $i++)
     {
      $s4 = $func($s);
     }
     $end = gettimeofday(true);
     echo "$func Time: " . ($end-$start) . " - Avg: ".sprintf("%.09f",(($end-$start)/1000000))."\n";
    }
    
    
    $s1 = "first String";
    $s2 = "second string";
    $s3 = "THIRD STRING";
    
    test("method0",$s1);
    test("method0",$s2);
    test("method0",$s3);
    
    test("method1",$s1);
    test("method1",$s2);
    test("method1",$s3);
    
    test("method2",$s1);
    test("method2",$s2);
    test("method2",$s3);
    

    Results:

    method0 Time: 19.2899270058 - Avg: 0.000019290
    method0 Time: 20.8679389954 - Avg: 0.000020868
    method0 Time: 24.8917310238 - Avg: 0.00002489 
    method1 Time: 3.07466816902 - Avg: 0.000003075
    method1 Time: 2.52559089661 - Avg: 0.000002526
    method1 Time: 4.06261897087 - Avg: 0.000004063
    method2 Time: 19.2718701363 - Avg: 0.000019272
    method2 Time: 35.2485661507 - Avg: 0.000035249
    method2 Time: 29.3357679844 - Avg: 0.000029336
    
    Chad Birch : Interesting, good to know. I'll take that note out of my answer about how it should be faster than the others. :b
    Sean Bright : FWIW, while method1 is the best way to go, you could have sped up your method (method0) by short-circuting out of your for loop once you knew you were mixed-case. (putting 'if ($uc > 0 && $lc > 0) break;' after '$uc++')
    jcinacio : on a side note, it's the first time i see gettimeofday() for benchmarking purposes, as opposed to microtime(), but guess it works out the same though...
  • Note that anything that looks only at [A-Z] will be incorrect as soon as there are accented or umlaut characters. Optimizing for speed is meaningless if the result is incorrect (hey, if the result doesn't have to be correct, it can write you a REALLY fast implementation...)

  • You could try the string case test function I posted here

    function getStringCase($subject)
    {
        if (!empty($subject))
        {
            if (preg_match('/^[^A-Za-z]+$/', $subject))
                return 0;   // no alphabetic characters
            else if (preg_match('/^[^A-Z]+$/', $subject))
                return 1;   // lowercase
            else if (preg_match('/^[^a-z]+$/', $subject))
                return 2;   // uppercase
            else
                return 3;   // mixed-case
        }
        else
        {
            return 0;   // empty
        }
    }
    
  • If the reason you want to avoid fixing already mixed-case strings is for efficiency then you are likely wasting your time, convert every string no matter its current condition:

    function fixCase($str)
    {
        return ucwords(strtolower($str));
    }
    

    I would be very surprised if it ran any slower than the accepted answer for strings the length of those you would generally want to title case, and it's one less condition you need to worry about.

    If, however, there is good reason to avoid converting already mixed-case strings, for example you want to preserve some intended meaning in the casing, then yes, jcinacio's answer is certainly the simplest and very efficient.

    SoapBox : Yeah I'm assuming if there is any existing mixed case then it was intended and I want to keep it. That is the case for most of the data.

0 comments:

Post a Comment