wtf8ToUtf8Lossy

Surrogate codepoints (U+D800 to U+DFFF) are replaced by the Unicode replacement character (U+FFFD). All surrogate codepoints and the replacement character are encoded as three bytes, meaning the input and output slices will always be the same length. In-place conversion is supported when utf8 and wtf8 refer to the same slice. Note: If wtf8 is entirely composed of well-formed UTF-8, then no conversion is necessary. utf8ValidateSlice can be used to check if lossy conversion is worthwhile. If wtf8 is not valid WTF-8, then error.InvalidWtf8 is returned.

Function parameters

Parameters

utf8:[]u8
wtf8:[]const u8

Utf8View iterates the code points of a utf-8 encoded string.

Types

Utf8View: Utf8View iterates the code points of a utf-8 encoded string.
Utf8Iterator
Utf16LeIterator
Wtf8View: Wtf8View iterates the code points of a WTF-8 encoded string,
Wtf8Iterator: Asserts that `bytes` is valid WTF-8
Wtf16LeIterator

Returns how many bytes the UTF-8 representation would require

Functions

utf8CodepointSequenceLength: Returns how many bytes the UTF-8 representation would require
utf8ByteSequenceLength: Given the first byte of a UTF-8 codepoint,
utf8Encode: Encodes the given codepoint into a UTF-8 byte sequence.
utf8EncodeComptime
utf8Decode: Deprecated.
utf8Decode2
utf8Decode3
utf8Decode3AllowSurrogateHalf
utf8Decode4
utf8ValidCodepoint: Returns true if the given unicode codepoint can be encoded in UTF-8.
utf8CountCodepoints: Returns the length of a supplied UTF-8 string literal in terms of unicode
utf8ValidateSlice: Returns true if the input consists entirely of UTF-8 codepoints
utf16IsHighSurrogate
utf16IsLowSurrogate
utf16CodepointSequenceLength: Returns how many code units the UTF-16 representation would require
utf16CodeUnitSequenceLength: Given the first code unit of a UTF-16 codepoint, returns a number 1-2
utf16DecodeSurrogatePair: Decodes the codepoint encoded in the given pair of UTF-16 code units.
utf16CountCodepoints: Returns the length of a supplied UTF-16 string literal in terms of unicode
fmtUtf8: Return a Formatter for a (potentially ill-formed) UTF-8 string.
utf16LeToUtf8ArrayList
utf16LeToUtf8Alloc: Caller owns returned memory.
utf16LeToUtf8AllocZ: Caller owns returned memory.
utf16LeToUtf8
utf8ToUtf16LeArrayList
utf8ToUtf16LeAlloc
utf8ToUtf16LeAllocZ
utf8ToUtf16Le: Returns index of next character.
utf8ToUtf16LeImpl
utf8ToUtf16LeStringLiteral: Converts a UTF-8 string literal into a UTF-16LE string literal.
wtf8ToWtf16LeStringLiteral: Converts a WTF-8 string literal into a WTF-16LE string literal.
calcUtf16LeLenImpl
calcUtf16LeLen: Returns length in UTF-16LE of UTF-8 slice as length of []u16.
calcWtf16LeLen: Returns length in WTF-16LE of WTF-8 slice as length of []u16.
fmtUtf16Le: Return a Formatter for a (potentially ill-formed) UTF-16 LE string,
isSurrogateCodepoint: Returns true if the codepoint is a surrogate (U+DC00 to U+DFFF)
wtf8Encode: Encodes the given codepoint into a WTF-8 byte sequence.
wtf8Decode: Deprecated.
wtf8ValidateSlice: Returns true if the input consists entirely of WTF-8 codepoints
wtf16LeToWtf8ArrayList
wtf16LeToWtf8Alloc: Caller must free returned memory.
wtf16LeToWtf8AllocZ: Caller must free returned memory.
wtf16LeToWtf8
wtf8ToWtf16LeArrayList
wtf8ToWtf16LeAlloc
wtf8ToWtf16LeAllocZ
wtf8ToWtf16Le: Returns index of next character.
checkUtf8ToUtf16LeOverflow: Checks if calling `utf8ToUtf16Le` would overflow.
checkWtf8ToWtf16LeOverflow: Checks if calling `utf8ToUtf16Le` would overflow.
wtf8ToUtf8Lossy: Surrogate codepoints (U+D800 to U+DFFF) are replaced by the Unicode replacement
wtf8ToUtf8LossyAlloc
wtf8ToUtf8LossyAllocZ
calcWtf8Len: Returns the length, in bytes, that would be necessary to encode the

Error sets in this namespace

Error Sets

Utf16LeToUtf8AllocError
Utf16LeToUtf8Error

Use this to replace an unknown, unrecognized, or unrepresentable character.

Values

replacement_character: Use this to replace an unknown, unrecognized, or unrepresentable character.
replacement_character_utf8: = utf8EncodeComptime(replacement_character)

Source

Implementation

pub fn wtf8ToUtf8Lossy(utf8: []u8, wtf8: []const u8) error{InvalidWtf8}!void {
    assert(utf8.len >= wtf8.len);

    const in_place = utf8.ptr == wtf8.ptr;
    const replacement_char_bytes = comptime blk: {
        var buf: [3]u8 = undefined;
        assert((utf8Encode(replacement_character, &buf) catch unreachable) == 3);
        break :blk buf;
    };

    var dest_i: usize = 0;
    const view = try Wtf8View.init(wtf8);
    var it = view.iterator();
    while (it.nextCodepointSlice()) |codepoint_slice| {
        // All surrogate codepoints are encoded as 3 bytes
        if (codepoint_slice.len == 3) {
            const codepoint = wtf8Decode(codepoint_slice) catch unreachable;
            if (isSurrogateCodepoint(codepoint)) {
                @memcpy(utf8[dest_i..][0..replacement_char_bytes.len], &replacement_char_bytes);
                dest_i += replacement_char_bytes.len;
                continue;
            }
        }
        if (!in_place) {
            @memcpy(utf8[dest_i..][0..codepoint_slice.len], codepoint_slice);
        }
        dest_i += codepoint_slice.len;
    }
}