音文同步技术方案思考

So I was running towards a problem while enhancing my project.

The goal was simple: Highlight the text in the browser as the audio plays.

Initially, I was quite proud of coming up with the idea of using binary search to find the matching relationship between the start/end times of the generated audio and text segments. However, I later encountered the problem of inaccurate matching. Why? Because the Markdown I initially rendered was the content returned by the AI. I used regular expressions to clean it up into segments usable by the TTS service. Therefore, the audio’s currentTime corresponds to the cleaned plain text, not the rendered Markdown text visible to the user. This makes using includes or indexOf for matching quite fragile.

Then I learned how other people did it.

Large companies would absolutely not use string matching (indexOf or includes) on the front end for highlighting, because it’s extremely unreliable (as I’ve discovered, punctuation, formatting, and code blocks can all break the matching).

The general solution is “ID mapping based on AST (Abstract Syntax Tree)”.

核心思路：不要匹配文本，匹配 ID
在后端处理 Markdown 时，不要直接用正则去清洗文本。使用 Markdown 解析器当调用 TTS 引擎生成音频时，保留这个 id。

前端在渲染 Markdown 时，必须使用自定义渲染器（Custom Renderer），将后端生成的那个 id 注入到 HTML 标签的 data-id 属性中。

但是由于我用的是React-markdown 直接由后端返回带有 ID 的 HTML/JSON 给前端渲染或许显得更稳妥