膜拜大神

更好地混排东亚文字和西文 | Text Layout Requirements When Encountered East Asian Languages

Pandoc 确实有这么一个扩展： | Pandoc does have a relevant extension:

ignore_line_breaks: Causes newlines within a paragraph to be ignored, rather than being treated as spaces or as hard line breaks. This option is intended for use with East Asian languages where spaces are not used between words, but text is divided into lines for readability.

中：但这个扩展实际不可用，因为当我用东亚文字的时候我总也会用到英文。这样一来，如果不加这个扩展，合并行的时候东亚文字（比如中文）会多出很多空格，不美观；如果加入这个扩展，合并行的时候西文（如英文）会混作一团，不仅不美观，内容都变了。

En: But this extension cannot work as expected, for we also use some English when writing in East Asian languages. In that case, if we not turn on this extension, Asian character lines will be joined together with extra spaces, pretty ugly; But if turning on this extension, Western character lines will join into a mess (e.g. several pairs of words turned into one).

For example, there is a demo file demo.md with content:

## Case 1: only East Asian Characters

我能吞下玻璃，
而不伤身体。我能吞下玻璃
而不伤身体。

## Case 2: Only Western Characters

The quick brown fox,
jumps over the lazy dog. The quick brown fox
jumps over the lazy dog.

## Case 3: Blended

我能吞下玻璃而不伤身体，
the quick brown fox jumps over the lazy dog.

The quick brown fox jumps over the lazy dog,
我能吞下玻璃而不伤身体。

中文和
English 混合排版。

English blended with
中文.

Using pandoc to convert it to html:

pandoc -f markdown -s -S demo.md -o demo-ext-off.html
pandoc -f markdown+ignore_line_breaks -s -S demo.md -o demo-ext-on.html

Without extension: (red marks point out pitfalls, I highlighted spaces in browser simply with Control+F)

With extension:

I think Pandoc should be more intelligent so as to only insert space

between two western chars, e.g. apple\n + pie → apple pie,
between asian char and western char, e.g. 豆瓣\n + FM → 豆瓣 FM

and no extra spaces in others cases.

Or make it more simply:

Always add a space when join lines except when the previous line ends with an East Asia Character and this line starts with another.

Pandoc 作者 jgm (John MacFarlane) 的回复：

One approach would be to implement this option using an AST filter (internal to pandoc), instead of in the Markdown parser. The AST contains Space elements for spaces and soft line breaks (though it doesn’t currently distinguish between the two—that may change soon). The filter could look for and remove Space elements when they occur between two Chinese characters. Note that (unlike the current approach) this would also affect line-internal spaces – they would be collapsed too. Let me know if that’s not desirable.

Are spaces every used between two Chinese characters, or would it be safe for pandoc to avoid this by default?

我：

Better not “affect line-internal spaces”.

Spaces are not ever used between two Chinese characters.

Of course there would be someone in some cases to use “注意！！” (A T T E N T I O N ! ! !), but that’s not normal. And I recommend they use fullwidth space (i.e. “　”) instead of typical space (i.e. " “): 注意！！ → 注　意　！　！.

So it would be safe for pandoc to avoid this by default.

For your information, adding a space between Chinese character and western character is not adopted by everyone, its more like a common rule for those who care typesetting. (see https://github.com/sparanoid/chinese-copywriting-guidelines/blob/master/README.en.md#place-one-space-before--after-english-words).

But this: fox\n + jumps → foxjumps is bad, should be agreed by everyone.

额……我就回去睡了个觉，jgm 就把这个问题弄好了……： Implemented east_asian_line_breaks extension. · jgm/pandoc@44120ea

P.S. Emacs Org-mode 导出 HTML 的时候也有这个问题，Coldnew 给出了修正代码：¹

Coldnew 还是 pangu-spacing 的作者。

大赞我处女座。

↩

膜拜大神

2015-12-13

膜拜大神