If a Source Parser Strips Comments, Do Not Count Lines on the Stripped Text
Regex-based source parsers often remove comments before matching functions.
That can reduce false positives, but it introduces a subtle bug: offsets in the cleaned text are not offsets in the original source. If you count line numbers on the stripped text, multiline comments collapse the file and every function below them can point to the wrong line.
Use cleaned text for matching, but original text for locations.
The symptom: source viewer jumps to the wrong function
This usually appears in tools that connect parsed symbols back to source:
- function list shows a correct name
- clicking the function opens the wrong line
- errors appear only below multiline comments
- single-line fixtures pass
- real embedded C files fail
The parser found the function. The location mapping is wrong.
Destructive comment stripping changes offsets
Consider this source:
/* Header
* spanning
* several lines
*/
void init(void) {
}
If comment removal replaces the whole block with one space, the cleaned text has fewer newline characters. A match offset from the cleaned string no longer points to the same line in the original string.
Counting lines here is wrong:
let cleaned = remove_comments(source);
let offset = regex.find(&cleaned).unwrap().start();
let line = byte_offset_to_line(&cleaned, offset);
The UI needs line numbers in the original file, not the cleaned working string.
Option 1: preserve newlines while stripping comments
At minimum, keep newline characters inside comments:
fn strip_comments_preserve_newlines(src: &str) -> String {
let mut out = String::new();
let mut chars = src.chars().peekable();
while let Some(ch) = chars.next() {
if ch == '/' && chars.peek() == Some(&'*') {
chars.next();
while let Some(c) = chars.next() {
if c == '\n' {
out.push('\n');
} else if c == '*' && chars.peek() == Some(&'/') {
chars.next();
break;
}
}
} else {
out.push(ch);
}
}
out
}
This keeps line counts stable, but it still does not make cleaned offsets identical in every case.
Option 2: map matches back to the original source
A safer pattern is:
- use cleaned text to avoid matching inside comments
- extract the function name from the match
- find the real location in the original source
- count lines on the original source
fn byte_offset_to_line(src: &str, offset: usize) -> u32 {
src[..offset.min(src.len())]
.bytes()
.filter(|b| *b == b'\n')
.count() as u32 + 1
}
fn line_for_function(original: &str, function_name: &str) -> Option<u32> {
let offset = original.find(function_name)?;
Some(byte_offset_to_line(original, offset))
}
For production parsers, avoid a naive find if the same name can appear in comments or declarations above the definition. Use nearby tokens, file context, or a small scanner to locate the definition in the original text.
Add a multiline comment fixture
The regression test should include the exact failure shape:
#[test]
fn function_line_after_multiline_comment_is_original_line() {
let source = "/* a\n * b\n * c\n */\nvoid init(void) {\n}\n";
let cleaned = strip_comments_preserve_newlines(source);
assert!(cleaned.contains("void init"));
let line = line_for_function(source, "init").unwrap();
assert_eq!(line, 5);
}
Do not test only one-line comments. That misses the offset collapse.
Watch out for strings and encodings
Comment stripping has more traps:
/* not a comment */inside a string literal- CRLF vs LF line endings
- Shift-JIS or other legacy encodings
- duplicated function names in different files
- declarations above definitions
If those matter, a real lexer or tree-sitter parser may be a better long-term solution. But even with regex parsing, line numbers must come from the original source.
Verification checklist
Check that:
- multiline comments above a function do not change reported line numbers
- line 1 functions report line 1, not 0
- CRLF files pass
- source viewer highlights the same function the list selected
- tests use realistic source snippets, not only tiny one-line fixtures