From: Erik Naggum <erik@naggum.no>
Newsgroups: comp.text.sgml
Subject: Re: PCDATA vs RCDATA
Date: 18 Mar 1995 04:51:08 MET
Organization: Naggum Software; +47 2295 0313

[kendall thomason shaw]

|   Can someone help me understand the difference between these two statements:
|   
|   <!ELEMENT lmnop - - (#PCDATA)>
|   
|   <!ELEMENT lmnop - - RCDATA>
Joe English has answered well, but a few minor points still need emphasis.

seen from a validation viewpoint, PCDATA (Parsed Character DATA) means that non-markup characters (a.k.a. data characters) are valid in the content, i.e., this is where SGML ceases to be concerned with the validity of the structure of the document.

seen from an application viewpoint, PCDATA is the meat on the markup bones.

if SGML is getting a space shuttle safely into orbit and back, PCDATA is the payload, the ultimate reason we're doing the exercise, but which is nothing without the space shuttle (structure) to support it.

now, there are still a few hairy points to consider. PCDATA is the kind of payload where you are allowed to hook straps around it and weld it to the shuttle and whatnot to keep it there -- i.e., markup within it will be recognized by the SGML parser as "its business". this applies to end-tags, entity references, processing instructions, comments, the works. RCDATA (and CDATA) is the kind of payload that you aren't allowed to touch, and you store it in a special container that protects it from shocks and such. only the RCDATA (and CDATA) containers are broken. Joe shows the proper way to package frail goods in space: marked sections. you never take those RCDATA/CDATA containers with you on real flights. they are there because sometimes you have to do dangerous things, such as releasing that cord that is, technically speaking, keeping you from discovering DS9 on your own. even if RCDATA and CDATA were fixed, you would not want to use them, because they look just like other boxes, and if they really are special, you should always using special marking tape. it's not sufficient to trust the guys who stuff things into the payload bay to know that containers from Frobozz, Inc, are fragile. you mark it with "FRAGILE" all over the place. SGML users and parsers can use the same redundancy to keep from clobbering important data.

also inclusions, which is like packing a sledgehammer with your test tubes, but you don't want to use inclusions. people who use inclusions are likely to pack their dry clothes where a broken thermos will do maximum damage. in space, and in SGML, that's the difference between getting a regular "welcome home" or your very own entry in the history books.

after having thought long and hard about advantages and disadvantages, I think it is preferable to have elements whose content is (#PCDATA) and _nothing_ else. HyTime talks about pseudo-elements that, among a few other things, are unnamed elements that contain only one piece of data. instead of an element containing data and sub-elements, it contains pseudo-elements among the sub-elements. thus, the "mixed content" metaphor breaks down, and there will be counter-intuitive results, to compound the counter- intuitive ways that #PCDATA interferes with parsing, especially in the treatment of whitespace. the ability to use #PCDATA in a content model is only a (very) convenient short-hand, and it is sometimes necessary, but should be viewed as a temporary hack. (it is so convenient that you will be taught several ways to abuse it in almost any book on SGML, but most of these books only tell you what's possible, not what's good practice, because the SGML community didn't have the experience to know the difference. this is changing.)

#<Erik>
-- 
the greatest obstacle to communication
is the illusion that it has already taken place