utf8CountCodepoints

Returns the length of a supplied UTF-8 string literal in terms of unicode codepoints.

Function parameters

Parameters

s:[]const u8

Utf8View iterates the code points of a utf-8 encoded string.

Types

Utf8View: Utf8View iterates the code points of a utf-8 encoded string.
Utf8Iterator
Utf16LeIterator
Wtf8View: Wtf8View iterates the code points of a WTF-8 encoded string,
Wtf8Iterator: Asserts that `bytes` is valid WTF-8
Wtf16LeIterator

Returns how many bytes the UTF-8 representation would require

Functions

utf8CodepointSequenceLength: Returns how many bytes the UTF-8 representation would require
utf8ByteSequenceLength: Given the first byte of a UTF-8 codepoint,
utf8Encode: Encodes the given codepoint into a UTF-8 byte sequence.
utf8EncodeComptime
utf8Decode: Deprecated.
utf8Decode2
utf8Decode3
utf8Decode3AllowSurrogateHalf
utf8Decode4
utf8ValidCodepoint: Returns true if the given unicode codepoint can be encoded in UTF-8.
utf8CountCodepoints: Returns the length of a supplied UTF-8 string literal in terms of unicode
utf8ValidateSlice: Returns true if the input consists entirely of UTF-8 codepoints
utf16IsHighSurrogate
utf16IsLowSurrogate
utf16CodepointSequenceLength: Returns how many code units the UTF-16 representation would require
utf16CodeUnitSequenceLength: Given the first code unit of a UTF-16 codepoint, returns a number 1-2
utf16DecodeSurrogatePair: Decodes the codepoint encoded in the given pair of UTF-16 code units.
utf16CountCodepoints: Returns the length of a supplied UTF-16 string literal in terms of unicode
fmtUtf8: Return a Formatter for a (potentially ill-formed) UTF-8 string.
utf16LeToUtf8ArrayList
utf16LeToUtf8Alloc: Caller owns returned memory.
utf16LeToUtf8AllocZ: Caller owns returned memory.
utf16LeToUtf8
utf8ToUtf16LeArrayList
utf8ToUtf16LeAlloc
utf8ToUtf16LeAllocZ
utf8ToUtf16Le: Returns index of next character.
utf8ToUtf16LeImpl
utf8ToUtf16LeStringLiteral: Converts a UTF-8 string literal into a UTF-16LE string literal.
wtf8ToWtf16LeStringLiteral: Converts a WTF-8 string literal into a WTF-16LE string literal.
calcUtf16LeLenImpl
calcUtf16LeLen: Returns length in UTF-16LE of UTF-8 slice as length of []u16.
calcWtf16LeLen: Returns length in WTF-16LE of WTF-8 slice as length of []u16.
fmtUtf16Le: Return a Formatter for a (potentially ill-formed) UTF-16 LE string,
isSurrogateCodepoint: Returns true if the codepoint is a surrogate (U+DC00 to U+DFFF)
wtf8Encode: Encodes the given codepoint into a WTF-8 byte sequence.
wtf8Decode: Deprecated.
wtf8ValidateSlice: Returns true if the input consists entirely of WTF-8 codepoints
wtf16LeToWtf8ArrayList
wtf16LeToWtf8Alloc: Caller must free returned memory.
wtf16LeToWtf8AllocZ: Caller must free returned memory.
wtf16LeToWtf8
wtf8ToWtf16LeArrayList
wtf8ToWtf16LeAlloc
wtf8ToWtf16LeAllocZ
wtf8ToWtf16Le: Returns index of next character.
checkUtf8ToUtf16LeOverflow: Checks if calling `utf8ToUtf16Le` would overflow.
checkWtf8ToWtf16LeOverflow: Checks if calling `utf8ToUtf16Le` would overflow.
wtf8ToUtf8Lossy: Surrogate codepoints (U+D800 to U+DFFF) are replaced by the Unicode replacement
wtf8ToUtf8LossyAlloc
wtf8ToUtf8LossyAllocZ
calcWtf8Len: Returns the length, in bytes, that would be necessary to encode the

Error sets in this namespace

Error Sets

Utf16LeToUtf8AllocError
Utf16LeToUtf8Error

Use this to replace an unknown, unrecognized, or unrepresentable character.

Values

replacement_character: Use this to replace an unknown, unrecognized, or unrepresentable character.
replacement_character_utf8: = utf8EncodeComptime(replacement_character)

Source

Implementation

pub fn utf8CountCodepoints(s: []const u8) !usize {
    var len: usize = 0;

    const N = @sizeOf(usize);
    const MASK = 0x80 * (std.math.maxInt(usize) / 0xff);

    var i: usize = 0;
    while (i < s.len) {
        // Fast path for ASCII sequences
        while (i + N <= s.len) : (i += N) {
            const v = mem.readInt(usize, s[i..][0..N], native_endian);
            if (v & MASK != 0) break;
            len += N;
        }

        if (i < s.len) {
            const n = try utf8ByteSequenceLength(s[i]);
            if (i + n > s.len) return error.TruncatedInput;

            switch (n) {
                1 => {}, // ASCII, no validation needed
                else => _ = try utf8Decode(s[i..][0..n]),
            }

            i += n;
            len += 1;
        }
    }

    return len;
}