mb_substitute_character - PHP code example

The mb_Substitute_Character() Function in PHP

PHP itself does not know about encodings or Unicode. Instead, it uses strings which are encoded in a single byte-oriented encoding, like ASCII. This means that when a programmer wants to use a character from a different encoding, they have to convert it first by using a short string of bytes and then an integer representing its codepoint in the original encoding. This is complicated, slow and error-prone.

Fortunately, an extension called Mbstring was developed to handle the situation. It provides encoding-aware versions of many of PHP's standard string functions (substr(), strlen(), ereg() etc). It also provides a function to convert between different encodings, which is used by some external libraries and extensions like Imap or GNU Recode.

However, the Mbstring extension is quite large and it is not easy to install on typical PHP installs. Furthermore, its functions do not form a clean and consistent API, since they have no naming convention (e.g. some functions start with mb_, while others use str()). This is exacerbated by the fact that PHP's default encoding is ISO-8859-1, which is not compatible with many multi-byte encodings, so the extension has to be enabled explicitly in order to work properly with them.

What is needed is a separate Unicode text string type and a new API which offers a complete and consistent solution for dealing with encodings and Unicode in PHP. The mb_substitute_character() function is an example of the sort of functionality which should be added.

mb_substitute_character example

        // details.         if (function_exists('mb_convert_encoding')) {
            // mb library has the following behaviors:             // - UTF-16 surrogates result in false.             // - Overlongs and outside Plane 16 result in empty strings.
            // Before we run mb_convert_encoding we need to tell it what to do with             // characters it does not know. This could be different than the parent             // application executing this library so we store the value, change it             // to our needs, and then change it back when we are done. This feels             // a little excessive and it would be great if there was a better way.             $save = mb_substitute_character();
            mb_substitute_character('none');
            $data = mb_convert_encoding($data, 'UTF-8', $encoding);
            mb_substitute_character($save);
        }
        // @todo Get iconv running in at least some environments if that is possible.         elseif (function_exists('iconv') && 'auto' !== $encoding) {
            // fprintf(STDOUT, "iconv found\n");             // iconv has the following behaviors:             // - Overlong representations are ignored.             // - Beyond Plane 16 is replaced with a lower char.             // - Incomplete sequences generate a warning.

if (!function_exists('mb_strpos')) {
    function mb_strpos($haystack, $needle, $offset = 0, $encoding = null) { return p\Mbstring::mb_strpos($haystack, $needle, $offset, $encoding); }
}
if (!function_exists('mb_strtolower')) {
    function mb_strtolower($string, $encoding = null) { return p\Mbstring::mb_strtolower($string, $encoding); }
}
if (!function_exists('mb_strtoupper')) {
    function mb_strtoupper($string, $encoding = null) { return p\Mbstring::mb_strtoupper($string, $encoding); }
}
if (!function_exists('mb_substitute_character')) {
    function mb_substitute_character($substitute_character = null) { return p\Mbstring::mb_substitute_character($substitute_character); }
}
if (!function_exists('mb_substr')) {
    function mb_substr($string, $start, $length = 2147483647, $encoding = null) { return p\Mbstring::mb_substr($string, $start, $length, $encoding); }
}
if (!function_exists('mb_stripos')) {
    function mb_stripos($haystack, $needle, $offset = 0, $encoding = null) { return p\Mbstring::mb_stripos($haystack, $needle, $offset, $encoding); }
}
if (!function_exists('mb_stristr')) {
    function mb_stristr($haystack, $needle, $before_needle = false, $encoding = null) { return p\Mbstring::mb_stristr($haystack, $needle, $before_needle, $encoding); }
}
if (!function_exists('mb_strrchr')) {

if (!function_exists('mb_strpos')) {
    function mb_strpos(?string $haystack, ?string $needle, ?int $offset = 0, ?string $encoding = null): int|false { return p\Mbstring::mb_strpos((string) $haystack, (string) $needle, (int) $offset, $encoding); }
}
if (!function_exists('mb_strtolower')) {
    function mb_strtolower(?string $string, ?string $encoding = null): string { return p\Mbstring::mb_strtolower((string) $string, $encoding); }
}
if (!function_exists('mb_strtoupper')) {
    function mb_strtoupper(?string $string, ?string $encoding = null): string { return p\Mbstring::mb_strtoupper((string) $string, $encoding); }
}
if (!function_exists('mb_substitute_character')) {
    function mb_substitute_character(string|int|null $substitute_character = null): string|int|bool { return p\Mbstring::mb_substitute_character($substitute_character); }
}
if (!function_exists('mb_substr')) {
    function mb_substr(?string $string, ?int $start, ?int $length = null, ?string $encoding = null): string { return p\Mbstring::mb_substr((string) $string, (int) $start, $length, $encoding); }
}
if (!function_exists('mb_stripos')) {
    function mb_stripos(?string $haystack, ?string $needle, ?int $offset = 0, ?string $encoding = null): int|false { return p\Mbstring::mb_stripos((string) $haystack, (string) $needle, (int) $offset, $encoding); }
}
if (!function_exists('mb_stristr')) {
    function mb_stristr(?string $haystack, ?string $needle, ?bool $before_needle = false, ?string $encoding = null): string|false { return p\Mbstring::mb_stristr((string) $haystack, (string) $needle, (bool) $before_needle, $encoding); }
}
if (!function_exists('mb_strrchr')) {