HTML, SGML, And All That Stuff
Everyone who publishes WWW pages must have a working knowledge about
HTML. It is even better to have a firm grasp of the standards that
currently exist, and the directions of development the Web will take in the
next months and years.
This deed is not made simpler by the fact that HTML is a rapidly moving
target. While HTML 2.0 is now out as an Internet Draft (i.e., it is fairly
fixed), many issues (tables, math, character sets) aren't part of it, and
HTML 3.0 is still pretty much in flux.
This collection of links grew out of my attempts to hunt down adequate
documentation. In the course of this undertaking, I got interested in SGML,
too. HTML is based on (or, an application of) SGML. Some knowledge of
SGML is needed to read and understand the DTDs (document type definitions)
that are used to describe all the different HTML variations.
Some Web developers want to make HTML-browsers more SGML-savvy, so that
advanced SGML techniques (e.g., DSSSL transformations) can be used to
extend HTML. HTML might even be the ``killer application'' that sells SGML
to the world. Other extension ideas like style sheets are orthogonal to
The following list of resources somehow related to HTML and/or SGML is
divided into the following parts:
Each link is tagged with the date I last verified it.
Starting out from humble, pragmatic beginnings, HTML has been standardized
(or is in the process of being standardized) as HTML 2.0. It's successor
will be called HTML 3.0 (what a surprise!). People have been working on
extensions to HTML, called HTML+. Some ideas from HTML+ have found their
way into HTML 2.0, some (tables, formulas) will be deferred until 3.0. More
vague or radical ideas (Unicode!) may have to wait until kingdom comes. A
prototypical browser for 3.0, called Arena, exists (and crashes in
the most interesting ways).
Since the standards are mostly still under discussion, it is a good idea to
go to the source when looking for authoritative answers. The source, in
this case, being a half-dozen organizations and people working there.
- Overview (26.01.95)
- The following text was snarfed in toto from the net. It
sums things up nicely:
There are three HTML definitions, none of which are RFCs. HTML 1.0 was the
original conception by Tim Berners-Lee et al. It has been superseded by
HTML 2.0, which was frozen and submitted as a draft RFC on or about 29
November 1994, and which (by intention) reflects the then-current usage of
HTML by WWW browsers. HTML 3.0 is currently under development; the Arena browser, published
by the Worldwide Web Organization (W3O) supports all of its features. The
home page for all three definitions is at CERN
(09.07.95). (Which is about the most ``canonical'' WWW info site
- HTML 2.0 Spec (10.07.95)
Pointers to the latest versions of the HTML
2.0 spec (versions of May 31 and June 16).
- HTML 2.0 DTD (13.1. / 8.2. / 10.07.95)
Earl Hood's hypertext version of the HTML 2.0
- URLs (05.02.95 / 25.03. / 10.07.95)
There are two RFCs that describe URLs: RFC 1630 "Universal Resource
Identifiers in WWW"
RFC 1738 "Uniform Resource Locators"
- Spaces (03.03.95)
HTML is based on the ISO 8859-1 character set, but has - or will have -
some extra character entities. Chief amongst them, different spaces. Now,
spaces are mysterious things - spaces and tabs and newlines are mostly
interchangeable, and multiple "spaces" are collapsed by most browsers. Olle Jarnefors has written an is article that summarizes and critizes the current
status of spaces and hyphens in HTML. (Includes 2 follow-ups on width details.)
A propos characters: Roman Czyborra has created a quite definite page on the
different Latin-[1-10] character sets.
- Draft HTML 3.0 (26.03. / 10.07.95)
The Draft HTML 3.0
courtesy of Dave Raggett
- Style Sheets (11.03. / 10.07.95)
The next big thing to come are style sheets,
which allow the definition of new logical tags and their default visual
interpretation. The details are still very much in the discussion stage,
with half a dozen proposals floating around. (DSSSL can be seen as a meta-stylesheet-language.)
The current (0.96) Arena supports some kind
- Mozillisms (13.1.95)
The currently very popular WWW browser NetScape (formerly
Mozilla) is notorious for its non-HTML-2.0/3.0-compatible
extensions, (<center>, <blink>), fondly called
(08.03.95) Netscape Communications got quite a bit of flak on these
issues, and has installed a page on ``Questions And
Answers About Netscape And Open Standards'' to define their
stance. It's mostly about a security protocol called SSL, which creates new
problems of its own (export licenses etc.).
Eventually, variants of many Mozillisms will become standard. This is one
of the reasons NetScape is scorned by standardizers: why invent new things
when generalized variants of said features are already considered in the
(26.08.95) Microsoft, not to be outdone by a start-up like Netscape,
includes its own set of HTML
extensions into its Internet
Explorer browser. Amongst them: a <font> tag with colour (oops,
color) and font attributes, and client-side imagemaps (no idea whether they
want to use <FIG> or something else; they refer to a December'94
paper, written by a Spyglass employee, that uses the same idea, but calls it
<MAP>. Might be a precursor). Of course, nowhere on that page is HTML
(11.10.95) Netscape proposes client-side Cookies.
With these, state-guided dynamic menus and such can finally be
implemented. No idea whether they are part of the 2.0 browser yet.
- Dan Connolly
- maintains the HTML Working and
Background Materials page at w3.org. See also his HTML
Design Notebook and his WWW
Research Notebook, both excellect starting points.
- Daniel LaLiberte's
- archive of
HTML resources. Daniel is the author of HyperNews, a mechanism for
adding annotations to HTML pages.
- Thomas Boutell
- has just (14.02.95) posted a RFD for a split of
comp.infosystems.www (of which he is the FAQ maintainer) into a dozen
subgroups. Currently under hefty discussion, but it should come through in
one form or the other.
- the mother of all standards, has a WWW server (the text version is much less glitzy)
which has a wide range of information on Standards and related sources of
information. The site even has details of committee meetings of the
various working groups as well a search capability over their whole catalog
(The interface is in English of French - your choice).
- has some pages on SGML resources and web resources. Both
are quite extensive, the SGML page is IMHO authoritative.
- Albert Lunde
- maintains a page with many pointers to WWW and internet
- Jutta Degener's
- page on
WWW resources is something for imagemap fans.
- The "HTML Writers Guild"
- maintains a list of HTML
resources with comments.
Ever so slowly, style rules for HTML evolve. The classical form such
rules take is the "Style Guide", of which there are a few.
- Earle Goodman's index of html markup tags gives
a handy, if unstructured, overview.
(Local copy made in early '94. I remember this with fondness, it
was my first HTML reference.)
Quick Reference is included as part of the htmlchek
HTML syntax and cross-reference checker. I heartily recommend the use of
this checker (or one of its relatives); too many documents are sprinkled
with casual HTML errors.
- (19.11.95) Webcraft
is not so much a style guide as a "communication 101". A bit heavily
hyperlinked, especially the introductory parts are a bit on the abstract
side. The "style guide" section is ok.
readability is not so much a critique of current web pages, as an essay
on what hypertext could/should be like. (One of the examples he cites is
paragraph from Finnegans Wake. I am awe-struck by the amount of work
that went into this piece of exegesis, even if i am not really convinced
that this kind of hypertext annotations add much clarity to the
annotations. Maybe he should have tried something simpler.)
Jutta's page on good hypertext
writing has recently been restructured. It contains links to more
guides / rants / essays on hypertext and writing in general.
- (13.1.95, 27.3.95)
The rather short HTML Bad Style
Page gives a list of "HTML Dont's" (or is that "don'ts"?) and links to
Tips for Web Authoring offers basically sound advice. (They advocate
imagemaps and don't mention b/w monitors when they warn against too much
color, but that's minor). The markup itself is rather horrible (all bold,
too many small documents cut up into page-sized chunks, ping-pong
navigation due to missing "next" links). The style guide list
is quite comprehensive and also covers things like DSSSL lite and style
- (16.02 / 10.07.95: URL changed)
I am not sure whether he wants to be funny, but Justin Hall's page on Hacking HTML
contains a lot of just plain bad advice. (the <H>-tags are supposed
to describe sizes, <li> can be used to create single bullets, and
- (09.09.95) David Siegel's Severe Tire
Damage is a rant/essay on the lack of graphical fine tuning on the
net. Coming from the typographical tradition, he has a point; but
w.r.t. net page design, he is IMHO totally wrong. When style sheets arrive
for the masses (i.e., for his Quadra), he will probably convert to HTML 3.0
pretty fast, though.
Abigail's paper on
The Myth of
Netscape and HTML 3.0 nicely explains the differences between
Netscape's dialect of HTML and HTML 3.0. Somewhat ironically, you should
use a Netscape >=1.1 browser to read the paper; otherwise the examples
- (04.08.95) Patrick Lynch's Yale C/AIM WWW
Style Manual, and a scathing review of it.
Most HTML documents are transmitted using the HTTP protocol (although these
are independent issues; you can also use FTP or gopher or plain e-mail to
transmit HTML documents). HTTP was introduced as a small, stateless,
low-overhead protocol to enable the efficient transmission of huge numbers
of relatively small documents (Contrast this with session-based ftp, where
documents tend to be large and the overall throughput is more important
that latency). Within HTTP, many document types are supported: HTTP
transports MIME documents, and HTML is just one MIME type.
- (27.03.95) There is also an Summary
page of WWW'94 workshops assembled by Bertrand Ibrahim.
- A lot of ideas about HTML+ can be found in the HTML+
Workshop Notes (09.07.95), also from WWW'94 (Notes taken by
Murray Maloney (email@example.com)). There, we find - amongst other pearls of
wisdom - the definitive answers to the gordian knot of HTML naming
To avoid any further confusion over which "HTML" everyone is really talking
about, Dave Raggett and Dan Connoly explain the new numbering scheme that
has been adopted to describe the earliest HTML (1.0) which was not valid
SGML, the HTML DTD (2.0) which is being written and tested by Dan Connoly,
and the so-called HTML+ which is being written by Dave Raggett and will be
referred to as HTML 3.0. In future, all claims of compliance with HTML will
require reference to a version number(s).
The most fun aspect HTML, and the thing that makes it useful in the long
run, is the fact that HTML organizes information, not mere
text. Thus, tools can be created to gather, filter, generate, or
transform HTML documents based on their information structure.
The canonical page on HTTP.
- (27.03.95) Local
- (27.03.95) A local (german-only) MIME-Overview
HTTP and HTML have no built-in facilities for the indexing of and searching
for documents. Existing search engines, ranging in sophistication from
grep to WAIS and propriatary data base engines, are adapted to WWW
sites by the means of CGI scripts, hacked servers, or hacked
clients. Especially the WAIS servers and indexing engines are sometimes
hard to install and expensive in the upkeep (_large_ indices that take
_long_ to generate), but still provide faster search and more options
(phonetic search, proximities...) than their much simpler cousins.
- (31.07.95) NCSA maintains an CGI
archive. CGI, the common gateway interface, is the mechanism by which
scripts can generate web documents on the fly. CGI scripts can be written
in any language, though they are mostly done in C or various scripting
languages (perl, sh). The archive contains programs submitted by the
- (31.07.95) w3.org's Tools directory contains
sub-pages on filters (xxx-to-html and html-to-xxx, with xxx any major text
format), HTML editing modes, stand-alone HTML editors, log file analyzers,
robots and spiders.
- (31.07.95) libwww-perl ``is
a library of Perl4 packages which provides a simple and consistent
programming interface to the World-Wide Web.'' Current version is 0.40.
- (28.09.95) Marc Andreessen's Mosaic
And WAIS Tutorial
sources and release
notes. The current version is 0.3.something, but those are to be
found at the same site.
- (28.09.95) SFgate
is one of the better-known WAIS-to-HTML CGI gatewais. IMHO, too complex
for its own good.
- (28.09.95) Glimpse is not so fast as
WAIS, but creates much smaller indices, and should be easier to use.
- (28.09.95) SWISH looks even simpler than
Glimpse. Installation is a charm, and the sources are delightful in
their simplicity - especially if you have just spent an evening trying to
understand the freeWAIS sources.
- (28.09.95) wwwwais is a
gateway that works with WAIS and SWISH. If SFgate is an elephant,
wwwwais is a mouse. Source consists of _one_ C file. Works more or less
- On 08.02.95, Erik Naggum explained the basics of SGML to a newbie in
this news article.
- (27.03.95/09.07.95) John Klensin's Micro-Introduction,
aka, SGML In Less Than Two Pages!
- (27.03.95/09.07.95) Tim BL's biased review of
- (09.07.95)A most excellent introduction to SGML
declarations (actually, three nicely HTMLd articles from
<TAG>: The SGML Newsletter's tutorial series)
- (27.03.95/09.07.95: seems to be gone) Terry Allen's SGML and
the Internet also gives a good intro, and even goes so far as to define
a small DTD. Also contains a few good links.
- A not very helpful FAQ, but the only
one. Seems it has been in the process of being replaced by something
bigger, better, etc. for quite some time now.
- The SGML Open Consortium's home
page features an embarrassingly large (aka slooow) imagemap. And the "text
only" button didn't work.
- The SGML Repository
(maintained by Erik Naggum)
- The Darmstadt SGML
Archive (mostly a subset of the previous item)
- The SGML Bible: Charles F. Goldfarb: The SGML Handbook; Oxford
University Press, 1990, ISBN 0-19-853737-9. This book also contains the
text of the SGML standard (ISO 8879:1986). Suggested by the FAQ. Have
browsed through it; it is a huge book, indicating that SGML is a monster of
a standard. I have the impression that 95% of the SGML standard is
concerned with writing DTDs for legacy data bases, and with introducing
shortcuts for data entry purpoeses; such are totally unimportant wrt the
'web, as HTML is fairly orthogonal and doesn't use much of the advanced
- Also from the FAQ: Robin Cover, et alia, produced the huge, 312-page
"Bibliography on SGML" (Tech Report 91-299, Queen's University, Kingston,
Ontario, Canada), an incredibly useful work. 312 pages of bibliography
for a standard that is barely 10 years old? Oof!
- (15.08.95) Information about: SGML, DSSSL,
HyTime frome the ZGDV Web
SGML is fine and dandy to mark up texts, but what does the mark-up
mean? No-one can tell, 'cause there ain't no semantics. A standard
is needed, and this standard is called Document Style Semantics and
Specification Language. Since nobody can remember this name, everyone
uses its acronym DSSSL. Of course, DSSSL is a large standard, so
there is also a ``DSSSL Lite''. Also of course, DSSSL is a late standard,
so there exists an entirely different standard (FOSI), that does mostly the
same. DSSSL Lite and FOSI are mostly concerned with defining how a SGML
document is to be displayed, while full DSSSL has a more general
scope (transformation rules, with SGML->layout being just one application
of such rules).
- Need an SGML parser? Try James
Clark's SP! (Locally held:
the announcement of version 0.3.) Written in C++;
you can either download the source or binaries for most popular
architectures and OSes. (24.04.95) Now up to version 0.4.
- (16.03.95) Announcement of the latest
ERCS proposals' online-availability. (``The proposed Extended Reference
Concrete Syntaxes (ERCS) address the issues of native-language tagging and
"highest-common-denominator" tagging for interchange between different
(04.01.96) There is now a December 1995
draft of the proposal.
- (13.04.95) Fred ``is an
ongoing research project at the Online
Computer Library Center, Inc. (OCLC) studying the manipulation of
tagged text. Fred includes tools to translate tagged text (SGML) to other
formats. Currently, OCLC uses Fred to translate from SGML to HTML, TeX
(PostScript), and ASCII.''
- The biggest problem in reading comp.lang.sgml is attaching any meaning to
all those acronyms. I learned a lot from reading this article on the history of DSSSL and FOSI.
- People are also thinking about using Adobe's PDF (Portable
Document Format) or Microsoft's RTF (Rich
Text Format) as an "alternative" to HTML. AFAIK, PDF is a kind of
editable PostScript (= a Good Thing the world has waited for), while RTF is
a Bletcherous Hack.
To The Max is ``A Manifesto for Adding SGML Intelligence to the
- If you want to look at a real-life example of SGML document types, you
mights be interested in the Davenport Group
archive. It is mostly about DocBook, a DTD for software
- The Text Encoding Initiative (TEI) has an archive at the Electronic Text
Center at Univ. of Virginia. The TEI has created an SGML DTD that can
be customized to a great extent, for the purpose of encoding old written
texts. Amongst the docs that explain this DTD is a so-called gentle
introduction to SGML (also available with fewer errors as text, (local copy)), a document on concurrent
hierarchies (a much-discussed optional feature of SGML), and a chapter
on characters and
- (17.01.95) On the current situation,
quoted from SGML Year
1994, written by Tommie Usdi and Yuri Rubinsky:
DSSSL, the Document Style Semantics and Specification Language, the
companion to SGML for formatting and transformation, has been largely
re-written, and is out for balloting.
Also, more relevant to the web:
Since the SGML Conference, an SGML Open technical committee, including
experts from the ISO DSSSL committee, has begun work on defining a minimal
subset of the formatting part of DSSSL such as would be appropriate for
online delivery including World Wide Web SGML and HTML browsers. That work
will be submitted to the HTML IETF Working Group and
relevant lists for discussion.
- (18.01.95) DSSSL
- (16.1.95) James Clark recommends: A collection of examples of
transformations using DSSSL, a more
complex formatting example and a corresponding
example input SGML document. This specifies some more complex kinds of
formatting, such as footnotes, figures and multiple columns, which are not
handled by DSSSL Lite.
- (17.01.95) DSSSL Lite Page,
maintained by Steve Pepper.
- (22.01.95) The DSSSL Lite Specification
Preliminary Draft isn't a semantically ``closed'' document; it refers
it the DSSSL DIS (no idea what DIS is ;-) in defining a subset of
it. Totally dense prose, IMHO.
- (22.01.95) An example is worth more than a thousands pages
documentation: Example DSSSL
Lite style sheet for HTML. (I don't get it -- i recognize many
details, but the whole concept doesn't get any clearer.)
- (16.1.95) you will find a PostScript version of the committee
copy of the DIS in Erik Naggum's SGML archive. The main
distribution point is infosrv1.ctd.ornl.gov:/pub/sgml/WG8/DSSSL,
run by Jim Mason, convenor of ISO/IEC JTC 1/SC 18/WG 8. These documents are
the official documents (sez Erik).
- (23.01.95) The Whirlwind
Guide to SGML Tools
- (23.01.95) Navy
- (23.01.95) Electronic Book Techonologies
- (24.01.95) yet another dsssl link
>Do you know of anyplace to get a decent explanation of DSSSL?
The DIS is the best explanation I've found so far.
It's heavy on formal specifications but light
on examples and rationale, which may be off-putting.
The whole thing is available in SGML and PostScript
(formatted for A4, but prints OK on US letter) at:
James Clark's home page has some good introductory
tutorial information, some STFP and STTP examples
(*very* useful) and documentation about DSSSL-Lite.
Speaking of DSSSL Lite, there's the archives at
The discussions on the comp-std-sgml mailing list
have focussed on DSSSL lately; James Clark has
posted several helpful messages to the list.
The list is archived at
The usual SGML repositories have some data:
(this is mostly a list of links to other sites)
(a more comprehensive list of links)
(a mirror of the DIS)
I have to mention
I don't know of any other network-accessible resources.
There are undoubtedly back-issues of <TAG> (to which
I really ought to subscribe one of these days)
with more comprehensive information.
I don't think any books on DSSSL have been published yet.
The Virtual Reality Markup Language is not a dialect of
HTML. The current spec (1.0) describes 3-D still lifes with links,
i.e. scenes composed of 3-D objects (complete with all information need for
successful rendering), some of which can be selected as ordinary (URL-type)
hyperlinks. VRML is a pure ASCII format, derived from SGI's Open Inventor
format. There are no interaction features yet; that can of worms has been
deferred to a version 2.0, along with scripts and reactive objects.
Supposedly, there are already browsers for VRML out there (from SGI, no
wonder). (Hey, lets piggyback Java onto it! 3-D animation and local
browser interaction specification for free!)
- (07.09.95) Tom Boutell's PNG Spec, in HTML, tenth
draft, last revised on 1995-5-5. PNG is a non-copyrighted successor to GIF,
i.e. a machine-independent lossless compressed image format. Supposed to
improve on GIF in a lot of aspects: compression factor, error checking,
dithering, colour management (true colours, gamma correction), transparency
(a full alpha channel). Cool features: extensibility, two-dimensional
interlacing. Free reference implementations and the very latest standard
can be found at the canonical
- (27.08.95) Top 10 Netscape Anagrams
- (17.07.95) One of the best arguments I've yet seen in the
contents-vs-presentation debate: A Dream is a short
story about the why of the 'web.
- (18.08.95) The Bandwidth Conservation
Society offers on-line tutorials on how to reduce the size of
- (18.05.95) NCSA, Synex and SoftQuad have announced Panorama, a ``freeware viewer
for full SGML on the Web. In addition, Panorama employs some of the linking
capabilities made possible through HyTime [..]. It also supports TEI
pointers for linking, CALS tables, and use of an SGML DTD for
stylesheets.'' Currently for Windows only. This could be the first step
towards SGML on the net. Hopefully they release Mac and X versions soon,
otherwise the whole thing might die in the cradle.
- (24.01.95) snarfed from the net: a visit reporting on a meeting
on Portable documents: Acrobat, SGML and
TeX. Quite readable.
- (23.01.95) Interleaf
- (25.01.95) at the totally far-out end of markup languages is
the multimedia crowd. To get an idea of how multimedia and sgml might
interact, take a look at the brief overview of the proposed Standard Multimedia Scripting Language (SMSL).
- (31.03.95) Once upon a time, there was a project called Xanadu, created by Ted Nelson. In his
newsletter, he writes about Xanadu's current status (in the limbo of
projects whose sponsoring has run out).
Watch out for those
.wrl files and x-world/x-vrml MIME messages!
Last changed on 1996-01-06 18:52. Comments&corrections to firstname.lastname@example.org.