Rust from_utf8_lossy Can Break Byte Offsets in Stream Readers

2026-06-13

rustutf8debuggingfile-watcherparsingstreaming

String::from_utf8_lossy is useful when you want to display imperfect bytes without failing the whole operation. It is dangerous when you use the resulting string to compute byte offsets back into the original buffer.

The rule is simple: if the offset will be used to seek, tail, resume, slice raw bytes, or update a file cursor, compute that offset on the raw bytes first. Decode only after the byte boundary is known.

The bug

This looks reasonable:

fn read_complete_lines(raw: &[u8], file_offset: &mut u64) {
    let text = String::from_utf8_lossy(raw);

    if let Some(last_newline) = text.rfind('\n') {
        let consumed = last_newline + 1;
        *file_offset += consumed as u64;

        let complete = &text[..consumed];
        process_lines(complete);
    }
}

The bug is that last_newline is an index in the lossy string, not a byte offset into raw.

When invalid or incomplete UTF-8 appears, from_utf8_lossy inserts the Unicode replacement character. That replacement character is not necessarily the same byte length as the original invalid sequence. Your cursor can move too far or not far enough.

Correct pattern: find byte boundaries first

Find the delimiter in the original bytes:

fn read_complete_lines(raw: &[u8], file_offset: &mut u64) {
    let Some(last_newline_byte_index) = raw.iter().rposition(|&byte| byte == b'\n') else {
        return;
    };

    let complete_byte_len = last_newline_byte_index + 1;
    let complete = String::from_utf8_lossy(&raw[..complete_byte_len]);

    process_lines(&complete);
    *file_offset += complete_byte_len as u64;
}

Now the offset always matches the bytes you actually consumed.

Why this matters for tail readers

Tail readers often work like this:

remember file_offset
read bytes from file_offset
process complete records
advance file_offset
keep incomplete trailing bytes for next read

If file_offset is wrong, the reader can:

skip bytes
re-read bytes
corrupt one record at a chunk boundary
parse duplicated lines
drift farther from the true file position over time

This is especially easy to miss because from_utf8_lossy does not throw. It returns a displayable string, so the bug looks like a parser issue later.

JSONL and delimiter parsing

The same rule applies to JSON Lines and delimiter-based formats:

fn split_complete_jsonl(raw: &[u8]) -> (&[u8], &[u8]) {
    match raw.iter().rposition(|&byte| byte == b'\n') {
        Some(index) => raw.split_at(index + 1),
        None => (&[], raw),
    }
}

Then decode only the complete part:

let (complete, trailing) = split_complete_jsonl(&buffer);
let text = String::from_utf8_lossy(complete);

for line in text.lines() {
    process_json_line(line);
}

buffer = trailing.to_vec();

The trailing bytes may contain an incomplete UTF-8 character, incomplete JSON, or both. Keep them as bytes until the next read.

Prefer strict UTF-8 when invalid bytes are not acceptable

from_utf8_lossy is for tolerant display or best-effort parsing. If invalid UTF-8 should be treated as corrupt input, use strict decoding:

let text = std::str::from_utf8(complete)
    .map_err(|error| format!("Invalid UTF-8 in log chunk: {error}"))?;

For logs and external tools, lossy decoding can be acceptable. For protocols, indexes, caches, and signed data, it usually is not.

Verification test

Test a chunk that ends mid-character:

#[test]
fn offset_is_computed_on_raw_bytes() {
    let raw = b"ok\nbad:\xE3\x81\nnext\n";
    let newline = raw.iter().rposition(|&byte| byte == b'\n').unwrap();
    let complete_len = newline + 1;

    assert_eq!(&raw[..3], b"ok\n");
    assert_eq!(complete_len, raw.len());
}

Then add a test where the final chunk has no newline and must not advance the cursor.

References

Summary

Lossy UTF-8 decoding is fine for display, but byte offsets belong to bytes. In stream readers, file tailers, and JSONL parsers, find delimiters on raw bytes, advance cursors by raw byte lengths, and decode only complete slices.