Totally Spaced Out

Newsgroups: comp.std.internat
From: ojarnef@admin.kth.se (Olle Jarnefors)
Subject: Definitions of SP, NBSP, and SHY in ISO 8859 (and 10646?)
Content-Type: Text/Plain; charset=us-ascii
Summary: Current and proposed definitions of SP, NBSP, and SHY
 in ISO 8859 are criticized and better definitions proposed.
 Also a misuse of NBSP in sorting standards is exposed, and
 problems of representing hyphenation in plain text are
 described.
Content-Transfer-Encoding: 7bit
Organization: Royal Institute of Technology, Stockholm (Kungl Tekniska Hogskolan)
X-Also-Sent-By: smtp to: iso10646@JHUVM.HCF.JHU.EDU, iso8859@JHUVM.HCF.JHU.EDU, SC2WG3@dkuug.dk
Mime-Version: 1.0
Date: 02 Mar 1995 23:24:02 MET

(For widest distribution I have sent this message to three mailing-lists for character set issues, iso10646@JHUVM.HCF.JHU.EDU, iso8859@JHUVM.HCF.JHU.EDU, and SC2WG3@dkuug.dk for consideration by experts, and to the news group comp.std.internat for general discussion. I would prefer that expert discussion is confined to the iso10646 list.)

I quote here from Johan van Wingen's Text of the Final Draft of the Revised ISO/IEC 8859-1, available at

ftp://dkuug.dk/i18n/iso8859-1.jvw

The following comments aim at refining the definitions/descriptions of SP, NBSP, and SHY, so they also become fully relevant in the context of the much wider character repertoire of UCS (ISO 10646). (Personally, I wouldn't mind their inclusion also into ISO 10646 in a forthcoming amendment or revision.)

> 5.3.1   SPACE (SP)                                                      0221
>         A graphic character the visual representation of which consists 0222
>    of the absence of a graphic symbol. It causes the active position to 0223
>    be advanced by one character position.                               0224

Compared to the corresponding clause 6.3.1 in ISO 8859-1:1987 this text removes the ambiguity that SPACE can be regarded as a graphic character, a control character or both. This is an improvement in my opinion, graphic characters and control characters should be two disjoint sets.

However, in UCS we have no less than 12 different characters to which this description fits, not all of them with "SPACE" as a part of the character name:

0020 SPACE
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2007 FIGURE SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE

A legitimate question is what distinguishes SPACE from the others.

There is also a degenerate case of spacing character in UCS; perhaps it should not be regarded as a spacing character at all:

200B ZERO WIDTH SPACE

Another problematic part of the description is in the second sentence: "causes the active position to be advanced by one character position". Interpreted literally this is a property shared by all graphic characters of ISO 8859, so why emphasize it for SPACE? It is not noticed for NBSP and SHY.

The following is my own view on the function(s) of the SPACE character. Objections wellcomed!

+ 5.3.1   SPACE (SP)
+         A character with the dual function
+
+  a) to represent a suitable non-zero amount of unfilled space
+     between the surrounding characters when displaying them
+
+  b) to hold a place in a CC-data-element intended for replacement
+     by one graphic character.

(I was about to write "white space" instead of "unfilled space", but I remembered the cases of light text on a dark background, and text on colored background. Case a avoids reference to the use "spacing between successive words", since this is not the only use in display of text and because some languages don't use spacing between words.)

Case b is relevant for fixed-length fields in database records and for tables in plain text, intended to be displayed by a monospacing font.

For the other spacing characters 2000-200A I expect there can be given definitions of the type:

-- to represent an amount of unfilled space between the
   surrounding characters normally corresponding to the width of x

where x for 2003 EM SPACE, for example, could be "a LATIN CAPITAL LETTER M in the current font".

I also believe that these typographical spaces are inappropriate for function b of SPACE.

I'm not very knowledgeable about typographical matters so I can't propose good width measures for the different typographical space characters of UCS. I definitely would like to see these properties stated in an amendment to ISO 10646, though.

The enumeration of different typographical amounts of spacing embodied by the UCS characters 2000-200A seems to be heavily reliant on Anglo-American typographical conventions. In a Swedish project aimed at providing Swedish character names for a wide subset of UCS, we have consulted with Swedish typographical experts, who, however, have been unable to give translations for these terms:

2002   EN SPACE
2003   EM SPACE
2007   FIGURE SPACE
2008   PUNCTUATION SPACE

They evidently have no exact correspondences in Swedish lead typography (which was heavily influenced by German typography and had very little to do with British typography). See the comment by Clive Feather.

> 5.3.2   NO-BREAK SPACE (NBSP)                                           0226
>         A graphic character the visual representation of which consists 0227
>    of the absence of a graphic symbol, for use when a line break is to  0228
>    be prevented in the text as presented.                               0229

This text is identical to the corresponding clause 6.3.2 in ISO 8859-1:1987. I see two weaknesses here:

A graphic symbol is defined as a visual representation of (among other things) a graphic character, which includes a SPACE. The wording thus seems to indicate that the visual representation of NBSP is a _zero-width_ character, like 200B ZERO WIDTH SPACE in UCS.
The definition seems to _forbid_ the replacement of a NBSP by a line break when presenting the CC-data-element. It must be interpreted only as a strong recommendation, though. The column or window in which the text is to be displayed may be so narrow that the program _must_ split a word, containing a NBSP, on two lines. In that case it's of course better to split it at the NBSP, than at any other arbitrary place in the word.

A better description of the NBSP might be:

+ 5.3.2   NO-BREAK SPACE (NBSP)
+         The graphic character with the same visual representation
+    of a SPACE character and with the additional property that
+    replacing it with a line break when rearranging the
+    CC-data-element for display purposes should be avoided.

The formulation "rearranging for display purposes" here is intended also to cover the case of auto-wrapping of body text by wordprocessing programs.

A digression, especially aimed at those working with standardization of sorting orders:

In some existing and drafted sorting standards it is specified or suggested that the NBSP can be used to force treatment of a space as a character at the first level of sorting (as the first letter, before A), when the normal SPACE character is given almost no sorting effect as a character at the last sorting level, together with interpunctuation like "," and "'".
This usage effectively makes the normal SPACE a character within a word -- "file name" is for example sorted almost exactly like "filename" -- while the NBSP will act as a separator between two different words. We will get the well-sorted order:
      file
      file<NBSP>name
      filemot
      filename
      file name
      filet
As I understand it this is exactly the _opposite_ interpretation of SPACE and NBSP as the one prescribed in the character set standards: In ordinary text NBSP can be used when you don't want two consecutive words to be split between two lines, that means when they are more _closely_ connected than usually.

> 5.3.3   SOFT HYPHEN (SHY)                                               0231
>         A graphic character that is imaged by a graphic symbol identical0232
>    with, or similar to, that representing HYPHEN, for use when a line   0233
>    break has been established within a word.                            0234

This text is identical to the corresponding clause 6.3.3 in ISO 8859-1:1987.

Here the simplest correction might seem to be to change "HYPHEN" to "HYPHEN-MINUS", if we are only considering ISO 8859. But the definition of SHY ideally should be the same in all coded character set standards, and in ISO 10646 we also have

2010 HYPHEN
2027 HYPHENATION POINT
2043 HYPHEN BULLET

(There is even a 2011 NON-BREAKING HYPHEN.)

UCS also has a 2212 MINUS SIGN as well as several DASHes and a 2015 HORIZONTAL BAR. The reason to retain the 002D HYPHEN-MINUS in addition to these more well-defined characters must be that it is a _compatibility_ character. When you write new text directly coded in UCS, you have no need of the HYPHEN-MINUS. When existing text in 7-bit or 8-bit character sets is imported to a system using UCS, however, it's difficult or impossible for a program to decide if a certain 2D HYPHEN-MINUS in the original text should be interpreted as a HYPHEN or a MINUS SIGN or an EN DASH, so it is converted to the UCS character 002D HYPHEN-MINUS.

A SOFT HYPHEN doesn't have this kind of ambiguity, though. Normally, when not hidden, it should be displayed as a 2010 HYPHEN, although in some contexts perhaps a 2027 HYPHENATION POINT is better. Probably a more general term "hyphenation mark" could be used to cover both of these possibilities.

Another problem with the current definition of SHY is that it says nothing about how to image the character when it is within a word on a line and not at the end of the line (which should be more frequent than the other situation).

I would like to propose this description of SHY:

+ 5.3.3   SOFT HYPHEN (SHY)
+         A graphic character that is imaged by a hyphenation
+    mark when at the end of a line and is not shown otherwise.

A second digression, about the inadequacy of plain text for representing hyphenation behavior:

In general, hyphenation behavior is not adequately supported by only the NO-BREAK SPACE, SOFT HYPHEN and NON-BREAKING HYPHEN characters of UCS. Non-breaking versions of other characters at which hyphenation may occur, such as "/", "&" and the different DASHes, are needed. More importantly, in many languages irregular hyphenation patterns exist in which letters appear, change or disappear when a word is hyphenated. In Swedish the word "tillagning" is hyphenated
   till-
   lagning
while for example "tillerka%nna" is hyphenated regularly
   till-
   erka%nna
("a%" here stands for LATIN SMALL LETTER A WITH DIAERESIS).
This has to do with the abhorrence of triple consonants in Swedish orthography, even when it would be the result of combining a word ending in a double consonant, such as "till", with a word starting with the same consonant, such as "lagning". Strangely, triple vowels are acceptable.

--
Olle Jarnefors, Royal Institute of Technology, Stockholm <ojarnef@admin.kth.se>

Followup by Clive D.W. Feather

This page was last changed on Mar 03 1995, 12:19 by mfx@pobox.com. Comments and corrections welcome.