HTML, SGML, And All That Stuff

Everyone who publishes WWW pages must have a working knowledge about HTML. It is even better to have a firm grasp of the standards that currently exist, and the directions of development the Web will take in the next months and years.

This deed is not made simpler by the fact that HTML is a rapidly moving target. While HTML 2.0 is now out as an Internet Draft (i.e., it is fairly fixed), many issues (tables, math, character sets) aren't part of it, and HTML 3.0 is still pretty much in flux.

This collection of links grew out of my attempts to hunt down adequate documentation. In the course of this undertaking, I got interested in SGML, too. HTML is based on (or, an application of) SGML. Some knowledge of SGML is needed to read and understand the DTDs (document type definitions) that are used to describe all the different HTML variations.

Some Web developers want to make HTML-browsers more SGML-savvy, so that advanced SGML techniques (e.g., DSSSL transformations) can be used to extend HTML. HTML might even be the ``killer application'' that sells SGML to the world. Other extension ideas like style sheets are orthogonal to SGML.

The following list of resources somehow related to HTML and/or SGML is divided into the following parts:

Each link is tagged with the date I last verified it.

HTML Definitions

Starting out from humble, pragmatic beginnings, HTML has been standardized (or is in the process of being standardized) as HTML 2.0. It's successor will be called HTML 3.0 (what a surprise!). People have been working on extensions to HTML, called HTML+. Some ideas from HTML+ have found their way into HTML 2.0, some (tables, formulas) will be deferred until 3.0. More vague or radical ideas (Unicode!) may have to wait until kingdom comes. A prototypical browser for 3.0, called Arena, exists (and crashes in the most interesting ways).
Overview (26.01.95)
The following text was snarfed in toto from the net. It sums things up nicely:

There are three HTML definitions, none of which are RFCs. HTML 1.0 was the original conception by Tim Berners-Lee et al. It has been superseded by HTML 2.0, which was frozen and submitted as a draft RFC on or about 29 November 1994, and which (by intention) reflects the then-current usage of HTML by WWW browsers. HTML 3.0 is currently under development; the Arena browser, published by the Worldwide Web Organization (W3O) supports all of its features. The home page for all three definitions is at CERN (09.07.95). (Which is about the most ``canonical'' WWW info site that exists.)

HTML 2.0 Spec (10.07.95)

Pointers to the latest versions of the HTML 2.0 spec (versions of May 31 and June 16).

HTML 2.0 DTD (13.1. / 8.2. / 10.07.95)

Earl Hood's hypertext version of the HTML 2.0 DTD.

URLs (05.02.95 / 25.03. / 10.07.95)

There are two RFCs that describe URLs: RFC 1630 "Universal Resource Identifiers in WWW" (Ohio, local) and RFC 1738 "Uniform Resource Locators" (Ohio, local, local marked-up version)

Spaces (03.03.95)

HTML is based on the ISO 8859-1 character set, but has - or will have - some extra character entities. Chief amongst them, different spaces. Now, spaces are mysterious things - spaces and tabs and newlines are mostly interchangeable, and multiple "spaces" are collapsed by most browsers. Olle Jarnefors has written an is article that summarizes and critizes the current status of spaces and hyphens in HTML. (Includes 2 follow-ups on width details.)

A propos characters: Roman Czyborra has created a quite definite page on the different Latin-[1-10] character sets.

Draft HTML 3.0 (26.03. / 10.07.95)

The Draft HTML 3.0 specification and DTD, courtesy of Dave Raggett (local copy).

Style Sheets (11.03. / 10.07.95)

The next big thing to come are style sheets, which allow the definition of new logical tags and their default visual interpretation. The details are still very much in the discussion stage, with half a dozen proposals floating around. (DSSSL can be seen as a meta-stylesheet-language.)

The current (0.96) Arena supports some kind of style sheets.

Mozillisms (13.1.95)

The currently very popular WWW browser NetScape (formerly Mozilla) is notorious for its non-HTML-2.0/3.0-compatible extensions, (<center>, <blink>), fondly called Mozillisms.

(08.03.95) Netscape Communications got quite a bit of flak on these issues, and has installed a page on ``Questions And Answers About Netscape And Open Standards'' to define their stance. It's mostly about a security protocol called SSL, which creates new problems of its own (export licenses etc.).

Eventually, variants of many Mozillisms will become standard. This is one of the reasons NetScape is scorned by standardizers: why invent new things when generalized variants of said features are already considered in the standard?

(26.08.95) Microsoft, not to be outdone by a start-up like Netscape, includes its own set of HTML extensions into its Internet Explorer browser. Amongst them: a <font> tag with colour (oops, color) and font attributes, and client-side imagemaps (no idea whether they want to use <FIG> or something else; they refer to a December'94 paper, written by a Spyglass employee, that uses the same idea, but calls it <MAP>. Might be a precursor). Of course, nowhere on that page is HTML 3 mentioned.

(11.10.95) Netscape proposes client-side Cookies. With these, state-guided dynamic menus and such can finally be implemented. No idea whether they are part of the 2.0 browser yet.

Persons, Organizations, Archives

Since the standards are mostly still under discussion, it is a good idea to go to the source when looking for authoritative answers. The source, in this case, being a half-dozen organizations and people working there.
Dan Connolly
maintains the HTML Working and Background Materials page at w3.org. See also his HTML Design Notebook and his WWW Research Notebook, both excellect starting points.
Daniel LaLiberte's
archive of HTML resources. Daniel is the author of HyperNews, a mechanism for adding annotations to HTML pages. (08.02.95)
Thomas Boutell
has just (14.02.95) posted a RFD for a split of comp.infosystems.www (of which he is the FAQ maintainer) into a dozen subgroups. Currently under hefty discussion, but it should come through in one form or the other.
ISO,
the mother of all standards, has a WWW server (the text version is much less glitzy) which has a wide range of information on Standards and related sources of information. The site even has details of committee meetings of the various working groups as well a search capability over their whole catalog (The interface is in English of French - your choice). (15.02.95)
SIL
has some pages on SGML resources and web resources. Both are quite extensive, the SGML page is IMHO authoritative.
Albert Lunde
maintains a page with many pointers to WWW and internet standards. (25.01.95)
Jutta Degener's
page on WWW resources is something for imagemap fans.
The "HTML Writers Guild"
maintains a list of HTML resources with comments.

HTML Quick References

The Good, The Bad, and The Ugly

Ever so slowly, style rules for HTML evolve. The classical form such rules take is the "Style Guide", of which there are a few.

Workshop Notes

HTTP

Most HTML documents are transmitted using the HTTP protocol (although these are independent issues; you can also use FTP or gopher or plain e-mail to transmit HTML documents). HTTP was introduced as a small, stateless, low-overhead protocol to enable the efficient transmission of huge numbers of relatively small documents (Contrast this with session-based ftp, where documents tend to be large and the overall throughput is more important that latency). Within HTTP, many document types are supported: HTTP transports MIME documents, and HTML is just one MIME type.

Web Tools

The most fun aspect HTML, and the thing that makes it useful in the long run, is the fact that HTML organizes information, not mere text. Thus, tools can be created to gather, filter, generate, or transform HTML documents based on their information structure.

Searching and Sorting

HTTP and HTML have no built-in facilities for the indexing of and searching for documents. Existing search engines, ranging in sophistication from grep to WAIS and propriatary data base engines, are adapted to WWW sites by the means of CGI scripts, hacked servers, or hacked clients. Especially the WAIS servers and indexing engines are sometimes hard to install and expensive in the upkeep (_large_ indices that take _long_ to generate), but still provide faster search and more options (phonetic search, proximities...) than their much simpler cousins.

SGML: Introductory Material

SGML: FAQs and Archives

SGML: Miscellanea

DSSSL

SGML is fine and dandy to mark up texts, but what does the mark-up mean? No-one can tell, 'cause there ain't no semantics. A standard is needed, and this standard is called Document Style Semantics and Specification Language. Since nobody can remember this name, everyone uses its acronym DSSSL. Of course, DSSSL is a large standard, so there is also a ``DSSSL Lite''. Also of course, DSSSL is a late standard, so there exists an entirely different standard (FOSI), that does mostly the same. DSSSL Lite and FOSI are mostly concerned with defining how a SGML document is to be displayed, while full DSSSL has a more general scope (transformation rules, with SGML->layout being just one application of such rules).

Etc.

VRML

The Virtual Reality Markup Language is not a dialect of HTML. The current spec (1.0) describes 3-D still lifes with links, i.e. scenes composed of 3-D objects (complete with all information need for successful rendering), some of which can be selected as ordinary (URL-type) hyperlinks. VRML is a pure ASCII format, derived from SGI's Open Inventor format. There are no interaction features yet; that can of worms has been deferred to a version 2.0, along with scripts and reactive objects. Supposedly, there are already browsers for VRML out there (from SGI, no wonder). (Hey, lets piggyback Java onto it! 3-D animation and local browser interaction specification for free!)

Watch out for those .wrl files and x-world/x-vrml MIME messages!


Last changed on 1996-01-06 18:52. Comments&corrections to mfx@cs.tu-berlin.de.