Add -c option to split-sentences.perl by jelmervdl · Pull Request #25 · kpu/preprocess

jelmervdl · 2021-02-17T10:07:45Z

Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c option these lines can be completely ignored.

jelmervdl · 2021-02-17T10:08:50Z

moses/ems/support/split-sentences.perl

-	$text = $text.$words[$i];
+	if (scalar(@words) > 0) {
+		$text = $text.$words[$i];
+	}


It apparently also contains a fix for warnings caused by blank (or only whitespace) lines in the input.

kpu · 2021-02-17T20:45:08Z

Ideally we'd replace buffering then splitting with splitting on the fly. Then if there's something long and no split we throw it out. Here I'm a bit concerned we're throwing out stuff that would correctly split. I understand your immediate need though.

Add -c option to split-sentences.perl

e968869

Some documents contain extremely long lines of generated text (most often links to search page results) that take forever to parse with the regular expressions in split-sentences.perl. Using the -c option these lines can be completely ignored.

jelmervdl commented Feb 17, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add -c option to split-sentences.perl#25

Add -c option to split-sentences.perl#25
jelmervdl wants to merge 1 commit intokpu:masterfrom
jelmervdl:split-sentence-char-limit

jelmervdl commented Feb 17, 2021

Uh oh!

jelmervdl Feb 17, 2021 •

edited

Loading

Uh oh!

kpu commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jelmervdl commented Feb 17, 2021

Uh oh!

jelmervdl Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpu commented Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jelmervdl Feb 17, 2021 •

edited

Loading