Skip to content

diffWords crashes with intlSegmenter when text contains orphaned combining marks #664

@pravbeseda

Description

@pravbeseda

Description

diffWords throws an error in replaceSuffix when both conditions are met:

  1. intlSegmenter option is provided
  2. Input text contains an orphaned combining mark (e.g. U+0300) preceded by whitespace

Minimal reproduction

import { diffWords } from 'diff';

const segmenter = new Intl.Segmenter(undefined, { granularity: 'word' });

diffWords(
  '* BHG, N 2029; \r\n* Λόγος εἰς τὸν ... \u0300Α next words',
  '* BHG, N 2029; \r\n* Λόγος εἰς τὸν ... \u0300Α changed text',
  { intlSegmenter: segmenter }
);
// Error: string "* BHG, N 2029; \r\n* Λόγος εἰς τὸν ... ̀"
//   doesn't end with suffix " "; this is a bug

Even simpler:

diffWords(
  'abc \u0300X def',
  'abc \u0300Y ghi',
  { intlSegmenter: segmenter }
);
// Error: string "abc ̀" doesn't end with suffix " "; this is a bug

Without intlSegmenter, both examples work fine.

Root cause

Intl.Segmenter treats " \u0300" (space + combining grave accent) as a single non-word segment:

const segments = [...segmenter.segment('abc \u0300X def')];
// "abc"  — isWordLike: true
// " ̀"   — isWordLike: false  ← space + combining mark merged
// "X"    — isWordLike: true
// " "    — isWordLike: false
// "def"  — isWordLike: true

This causes dedupeWhitespaceInChangeObjects in postProcess to fail: it calls replaceSuffix(startKeep.value, newWsPrefix, commonWsPrefix) expecting startKeep.value to end with a space, but the space was merged with the combining mark into a single token, so the keep-chunk ends with " \u0300" instead of " ".

Without intlSegmenter, the regex-based word splitter treats space and combining mark separately, so the bug doesn't occur.

Versions

  • diff: 7.0.0, 8.0.0–8.0.3 (all affected)
  • Node.js: v22.19.0
  • Also reproduced in Chrome 133

Workaround

Strip orphaned combining marks (combining mark preceded by whitespace) before diffing:

const clean = text => text.replace(/(\s)[\u0300-\u036F\u0483-\u0489]+/g, '$1');
diffWords(clean(oldStr), clean(newStr), { intlSegmenter: segmenter });

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions