If a Source Parser Strips Comments, Do Not Count Lines on the Stripped Text

2026-05-30

parserregexc-languagerustdebugging

Regex-based source parsers often remove comments before matching functions.

That can reduce false positives, but it introduces a subtle bug: offsets in the cleaned text are not offsets in the original source. If you count line numbers on the stripped text, multiline comments collapse the file and every function below them can point to the wrong line.

Use cleaned text for matching, but original text for locations.

The symptom: source viewer jumps to the wrong function

This usually appears in tools that connect parsed symbols back to source:

function list shows a correct name
clicking the function opens the wrong line
errors appear only below multiline comments
single-line fixtures pass
real embedded C files fail

The parser found the function. The location mapping is wrong.

Destructive comment stripping changes offsets

Consider this source:

/* Header
 * spanning
 * several lines
 */
void init(void) {
}

If comment removal replaces the whole block with one space, the cleaned text has fewer newline characters. A match offset from the cleaned string no longer points to the same line in the original string.

Counting lines here is wrong:

let cleaned = remove_comments(source);
let offset = regex.find(&cleaned).unwrap().start();
let line = byte_offset_to_line(&cleaned, offset);

The UI needs line numbers in the original file, not the cleaned working string.

Option 1: preserve newlines while stripping comments

At minimum, keep newline characters inside comments:

fn strip_comments_preserve_newlines(src: &str) -> String {
    let mut out = String::new();
    let mut chars = src.chars().peekable();

    while let Some(ch) = chars.next() {
        if ch == '/' && chars.peek() == Some(&'*') {
            chars.next();

            while let Some(c) = chars.next() {
                if c == '\n' {
                    out.push('\n');
                } else if c == '*' && chars.peek() == Some(&'/') {
                    chars.next();
                    break;
                }
            }
        } else {
            out.push(ch);
        }
    }

    out
}

This keeps line counts stable, but it still does not make cleaned offsets identical in every case.

Option 2: map matches back to the original source

A safer pattern is:

use cleaned text to avoid matching inside comments
extract the function name from the match
find the real location in the original source
count lines on the original source

fn byte_offset_to_line(src: &str, offset: usize) -> u32 {
    src[..offset.min(src.len())]
        .bytes()
        .filter(|b| *b == b'\n')
        .count() as u32 + 1
}

fn line_for_function(original: &str, function_name: &str) -> Option<u32> {
    let offset = original.find(function_name)?;
    Some(byte_offset_to_line(original, offset))
}

For production parsers, avoid a naive find if the same name can appear in comments or declarations above the definition. Use nearby tokens, file context, or a small scanner to locate the definition in the original text.

Add a multiline comment fixture

The regression test should include the exact failure shape:

#[test]
fn function_line_after_multiline_comment_is_original_line() {
    let source = "/* a\n * b\n * c\n */\nvoid init(void) {\n}\n";
    let cleaned = strip_comments_preserve_newlines(source);

    assert!(cleaned.contains("void init"));

    let line = line_for_function(source, "init").unwrap();
    assert_eq!(line, 5);
}

Do not test only one-line comments. That misses the offset collapse.

Watch out for strings and encodings

Comment stripping has more traps:

/* not a comment */ inside a string literal
CRLF vs LF line endings
Shift-JIS or other legacy encodings
duplicated function names in different files
declarations above definitions

If those matter, a real lexer or tree-sitter parser may be a better long-term solution. But even with regex parsing, line numbers must come from the original source.

Verification checklist

Check that:

multiline comments above a function do not change reported line numbers
line 1 functions report line 1, not 0
CRLF files pass
source viewer highlights the same function the list selected
tests use realistic source snippets, not only tiny one-line fixtures