Rust from_utf8_lossy Can Break Byte Offsets in Stream Readers
String::from_utf8_lossy is useful when you want to display imperfect bytes without failing the whole operation. It is dangerous when you use the resulting string to compute byte offsets back into the original buffer.
The rule is simple: if the offset will be used to seek, tail, resume, slice raw bytes, or update a file cursor, compute that offset on the raw bytes first. Decode only after the byte boundary is known.
The bug
This looks reasonable:
fn read_complete_lines(raw: &[u8], file_offset: &mut u64) {
let text = String::from_utf8_lossy(raw);
if let Some(last_newline) = text.rfind('\n') {
let consumed = last_newline + 1;
*file_offset += consumed as u64;
let complete = &text[..consumed];
process_lines(complete);
}
}
The bug is that last_newline is an index in the lossy string, not a byte offset into raw.
When invalid or incomplete UTF-8 appears, from_utf8_lossy inserts the Unicode replacement character. That replacement character is not necessarily the same byte length as the original invalid sequence. Your cursor can move too far or not far enough.
Correct pattern: find byte boundaries first
Find the delimiter in the original bytes:
fn read_complete_lines(raw: &[u8], file_offset: &mut u64) {
let Some(last_newline_byte_index) = raw.iter().rposition(|&byte| byte == b'\n') else {
return;
};
let complete_byte_len = last_newline_byte_index + 1;
let complete = String::from_utf8_lossy(&raw[..complete_byte_len]);
process_lines(&complete);
*file_offset += complete_byte_len as u64;
}
Now the offset always matches the bytes you actually consumed.
Why this matters for tail readers
Tail readers often work like this:
remember file_offset
read bytes from file_offset
process complete records
advance file_offset
keep incomplete trailing bytes for next read
If file_offset is wrong, the reader can:
- skip bytes
- re-read bytes
- corrupt one record at a chunk boundary
- parse duplicated lines
- drift farther from the true file position over time
This is especially easy to miss because from_utf8_lossy does not throw. It returns a displayable string, so the bug looks like a parser issue later.
JSONL and delimiter parsing
The same rule applies to JSON Lines and delimiter-based formats:
fn split_complete_jsonl(raw: &[u8]) -> (&[u8], &[u8]) {
match raw.iter().rposition(|&byte| byte == b'\n') {
Some(index) => raw.split_at(index + 1),
None => (&[], raw),
}
}
Then decode only the complete part:
let (complete, trailing) = split_complete_jsonl(&buffer);
let text = String::from_utf8_lossy(complete);
for line in text.lines() {
process_json_line(line);
}
buffer = trailing.to_vec();
The trailing bytes may contain an incomplete UTF-8 character, incomplete JSON, or both. Keep them as bytes until the next read.
Prefer strict UTF-8 when invalid bytes are not acceptable
from_utf8_lossy is for tolerant display or best-effort parsing. If invalid UTF-8 should be treated as corrupt input, use strict decoding:
let text = std::str::from_utf8(complete)
.map_err(|error| format!("Invalid UTF-8 in log chunk: {error}"))?;
For logs and external tools, lossy decoding can be acceptable. For protocols, indexes, caches, and signed data, it usually is not.
Verification test
Test a chunk that ends mid-character:
#[test]
fn offset_is_computed_on_raw_bytes() {
let raw = b"ok\nbad:\xE3\x81\nnext\n";
let newline = raw.iter().rposition(|&byte| byte == b'\n').unwrap();
let complete_len = newline + 1;
assert_eq!(&raw[..3], b"ok\n");
assert_eq!(complete_len, raw.len());
}
Then add a test where the final chunk has no newline and must not advance the cursor.
References
Summary
Lossy UTF-8 decoding is fine for display, but byte offsets belong to bytes. In stream readers, file tailers, and JSONL parsers, find delimiters on raw bytes, advance cursors by raw byte lengths, and decode only complete slices.