\documenttype{article} % Use gellmu-latex-faq % This is GELLMU source for the didactic "article" document type. % Revised from the submitted preprint \surtitle{W. F. Hammond: Presentations: Bridge from LaTeX to XML} \latexcommand{\bsl;hyphenation\{gell-mu mark-up new-com-mand\}} \newcommand{\gellmu}{\abbr{GELLMU}} \newcommand{\gnu}{\abbr{GNU}} \newcommand{\info}{\softw{Info}} \newcommand{\html}{\abbr{HTML}} \newcommand{\pdf}{\abbr{PDF}} \newcommand{\sgml}{\abbr{SGML}} \newcommand{\self}{% http://math.albany.edu:8000/math/pers/hammond/Presen/tug2001} \newcommand{\xml}{\abbr{XML}} \newcommand{\txi}{\softw{Texinfo}} \newcommand{\href}[2]{\anch[href="#1"]{#2}} \title{GELLMU: A Bridge for Authors\\ from \latex; to XML} \subtitle{Presentation at TUG 2001, University of Delaware} \author{William F. Hammond} \address{Department of Mathematics \& Statistics\\ University at Albany\\ Albany, New York 12222 (USA)\\ \eaddr{hammond@math.albany.edu}\\ \urlanch{http://www.albany.edu/\tld;hammond/} } \date{August 2001\\ (slightly revised: January 2003)} \nobanner \begin{document} \begin{abstract} \gellmu, which stands for \quophrase{Generalized Extensible \latex;-Like Markup}, is a system for using \latex;-like markup, though not \latex; itself, to write consciously for a markup language in the \sgml category or in its popular \xml subcategory. The \emph{basic} level of \gellmu offers a way to use \latex;-Like notation together with a \latex;-Like \emph{newcommand} (with arguments) macro facility to write web pages. The \emph{advanced} level of \gellmu enables one additionally to incorporate certain \latex;-Like features, such as the use of a blank line for a new paragraph, in writing for an \sgml language. The didactic \gellmu production system provides an \quophrase{article} \xml language, with some resemblance to \latex; itself, that is a rigorous domain for translation to other formats. \end{abstract} \tableofcontents \section{Author Level Markup} Inasmuch as the World Wide Web is becoming an important library resource, one wants one's publications to be accessible online, and one wants web-crawling robots to be able to catalogue them properly. Despite the popularity of Adobe's Portable Document Format (\pdf) the distribution of \pdf reading software is not as widespread as the distribution of web browsing software, and web-crawling robots often do not scan the contents of \pdf documents. What is available for the \latex author toward this end? More specifically, consider the following situations: \begin{description} \item[Online publication archives] Specifically, I would like to cite the \tex;/\latex;-based e-print archive begun at Los Alamos in the early 1990's by Paul Ginsparg, now known as \href{http://www.arxiv.org/}{\quophrase{Arxiv}}, which is now a participant in the \href{http://www.openarchives.org/}{Open Archives Initiative}. While in its early time the term \emph{e-print} was understood to mean \quophrase{electronic pre-print}, ArXiv has more recently become a repository for established journals including now, for example, the highly regarded \href{http://www.math.princeton.edu/\tld;annals/}{\emph{Annals of Mathematics}}, which was founded in 1884 by Ormond Stone of the University of Virginia, and Ginsparg now tells us that the term \emph{e-print} denotes \quophrase{self-archiving by the author} under a new overall academic publication\footnote{ Paul Ginsparg, \quophrase{Electronic Clones vs. the Global Research Archive}, \urlanch{http://arXiv.org/blurb/pg00bmc.html}. } design. \item[Course handouts] How can a college teacher prepare course handouts for both paper and online distribution? If the teacher writes \latex, some manual intervention will likely be needed in order to obtain correct \html. If the teacher writes \html, then the paper distribution\footnote{If the \html is correctly written, then robust translation to \latex is possible.} will be limited by what can be expressed in \html, which is not as rich a markup as \latex. \item[TUG articles] Before preparing a TUG 2001 article an author is asked to read \href{http://www.tug.org/TUGboat/Contents/contents20-4.html}{\emph{Preparation of documents for multiple modes of delivery}} by Ross Moore, which is available on the web only as a two-column \pdf printed page image. From this article one might conclude that carefully prepared \latex; may be suitable for translation to \html although no \html version of the article seems to be available. \item[GNU documentation] While working for TUG on the \tex; Directory System (\abbr{TDS}) guidelines --- see \href{http://ctan.tug.org/tex-archive/tds/standard/}{\path{/tds/standard/}} at \abbr{CTAN} --- in January 1998, Ulrik Vieth produced a \latex; document and a tailored program for translating that document into \txi, the language of the \href{http://www.gnu.org/}{GNU} documentation system. \txi is a \tex;-based system that pre-dates \html. Its original purpose was to provide both print and (early online hypertext) \info versions of \gnu software project documentation. When \html came along, it was possible to provide fairly reliable translation from \txi to \html because \txi is a well-structured markup. In fact, \txi is very nearly equivalent to an \sgml language, and, Daniele Giacomini in August 2000 came up with an effort in that direction: \href{http://master.swlibero.org/\tld;daniele/software/sgmltexi/ }{\softw{Sgmltexi}}. \end{description} Although programs are available for translating carefully structured \latex; into \html and sometimes into \xml extensions of \html, this method of generating online content for the basic level of the web sometimes requires manual intervention. A more direct approach to the world of \sgml offers better prospects for long-term access to new web formats without sacrificing access to the quality of print typesetting that is available through \latex;. \section{The basics of \emph{basic} \gellmu} In looking over Vieth's set-up for the \abbr{TDS} document in the late spring of 1998, I arrived at the idea of using \latex;-like notation for conscious writing in document languages under \sgml and I have written a program in the GNU Emacs Lisp language, the \gellmu syntactic translator, for converting this \latex;-like markup to \sgml markup. The advantage of \sgml markup is that each markup language (formally document type) under the \sgml umbrella constitutes a structured domain for the application of automatic processors that are easy to create under any of a number of structured processing frameworks. There are frameworks accessible in standard computing languages, and there is also a recent framework \href{http://www.dcarlisle.demon.co.uk/xmltex/}{\softw{xmltex}} by David Carlisle for writing \tex; typesetting routines for \xml document types. The root idea in using \latex;-like markup for the conscious writing of markup under \sgml is the simple syntactic correspondence between markup such as \begin{verbatim} some \em{emphasized} text \end{verbatim} on the one hand, and the markup \begin{verbatim} some emphasized text \end{verbatim} on the other. Most \latex; commands are analogous to \sgml elements. Moreover, the attribute list associated with an \sgml element can be made to correspond with a \latex; command option. For example, \begin{verbatim} \a[href="http://foo.dom/" ]{The Foo Domain} \end{verbatim} matches \begin{verbatim} The Foo Domain \end{verbatim} \section{Basic \gellmu enhanced with \emph{\bsl;newcommand}} The idea of using \latex;-like syntax for conscious writing under an \sgml document type gains power when one realizes that although the notion of \sgml entity provides, among other things, simple macro expansions, there is no provision under \sgml for macros that take arguments. Moreover, there is no obvious method of extending \sgml systems to accommodate macros with arguments apart from the idea of extending a document type\footnote{ While document type extensions require enough work that they cannot be spontaneous, they provide a sound way to avoid the tangles that can arise working with \tex; or \latex; when attempting the simulataneous use of conflicting macro packages. }. \gellmu provides a \latex;-like meta-command\footnote{ In \gellmu while a \emph{command} corresponds to an \sgml element, a \emph{meta-command} is something having the same syntax as a command that does not correspond to an \sgml element and instead receives resolution into other \sgml markup under the syntactic translator. } called \emph{newcommand} that may be invoked with arguments. For example, if one writes \begin{verbatim} \newcommand{\afoo}[2]{% \a[href="http://www.foo.org/#1"]{#2}} \end{verbatim} then a subsequent invocation \begin{verbatim} \afoo{tex-archive/tds/}{TDS at Foo} \end{verbatim} will yield (without line breaks):\footnote{ The syntactic translator maintains line number alignment between its input and its \sgml output so that line numbers used by \sgml parsers in flagging errors match those in source markup. } \begin{verbatim} TDS at Foo \end{verbatim} This \emph{newcommand} markup differs from that of \latex; in that it is classical macro substitution rather than vocabulary expansion. Since the syntax of a \emph{newcommand} invocation is very similar to that of an \sgml element, the use of \emph{newcommand} can, apart from its on-the-fly convenience, be a help in the development of \sgml document type extensions. A new name in a test document can be moved from being that of a macro to being that of an element simply with the removal of a \emph{newcommand} definition. \section{\sgml vs. \xml} Basic \gellmu as enhanced by its macro facility is as far as one can sensibly go toward conscious writing under a language in the restricted subfamily of \sgml document types known as \xml. From one viewpoint the differences between \sgml and \xml are not very important since most correct documents under the larger category can, if correct, be automatically translated into equivalent documents under \xml. For example, classical \html that passes validation can be translated into the newer \xml form of \html using either James Clark's classical \sgml library \href{http://www.jclark.com/sp/}{\softw{SP}} or Dave Raggett's program \href{http://www.w3.org/People/Raggett/tidy/}{\softw{tidy}}. However, the rules for \xml were designed to make things easy for processors rather than for humans, and for that reason an author writing toward an ultimate \xml document type usually is well-advised to write for a version of the document type under more author-friendly \sgml rules. For example, if in an \sgml language a forced linebreak is represented by the defined-empty element \emph{brk}, then the markup \quostr{} is sufficient, whereas under the more restrictive \xml version of the same language, either the markup \quostr{} or its abbreviated form \quostr{}\footnote{ Moreover, some confusion may arise from the fact that under the \sgml syntax (formally \sgml declaration) specified for \html neither of these \xml forms of markup would be permitted. } must be used. For \gellmu this means that the markup \quostr{\bsl;brk} and the markup \quostr{\bsl;brk\{\}} are interchangeable under \sgml, except for the case of \quostr{\bsl;brk} abutting a following character without intervening whitespace, but not equivalent under \xml. \gellmu provides the form \quostr{\bsl;brk;} to represent the abbreviated form \quostr{} of an element that is defined as empty. \section{Advanced \gellmu} Basic \gellmu deals with markup languages more or less at the level of syntax without getting to the level of grammar. Advanced \gellmu may be used to roll language-independent grammatical concepts into the picture. The first of these is \latex;-like multiple argment/option syntax. For example under advanced \gellmu the markup \begin{verbatim} \frac{a x + b}{c x + d} \end{verbatim} is converted in syntactic translation to \begin{verbatim} a x + bc x + d \end{verbatim} That is, a chain of \quochar{\{}, \quochar{\}} pairs and \quochar{\lsb}, \quochar{\rsb} pairs following a command without intervening white space between the command name and the first delimiter nor between a close delimiter and the next open delimiter in the chain, constitutes an \sgml element whose content begins with a sequence of generic positional arguments (tag name \emph{ag0}) and options (tag name \emph{op0}). Without knowledge of the document type it cannot be determined if a name used with multiple argument/option syntax has only \emph{ag0}, \emph{op0} content. The syntactic translator provides a list variable consisting of names that have only this type of content and that, therefore should be given close tags after the sequence of arguments and options. Absent that, the author must provide a close tag unless an \sgml parser can infer it, and even in that case, if the element can appear in the mixed content model of another element such as, for example, a paragraph, then the parser's automatic placement of a close tag could lead to the unwanted collapse of a word boundary similar to that which occurs in \latex; when an author's careless markup \display{|\TeX benchmark|} gets typeset as \quophrase{\tex;benchmark} instead of as \quophrase{\tex; benchmark}. If multiple argument/option syntax is used, then there is ambiguity on the nature of the first pair of chained delimiters if it is the pair \quochar{\lsb}, \quochar{\rsb} --- whether it represents an \emph{op0} or an attribute list. Therefore, in this case it is required that it is an attribute list if the first character after its \quochar{\lsb} opening delimiter is a colon (\quochar{:}). In basic \gellmu the following four of the ten \latex;-special characters are special: \display{\quostr{\bsl\ \ \lbr\ \ \rbr\ \ \pct}\ .} Additionally, the character \quochar{\hsh} is special when used in the definition of a \emph{newcommand}, the characters \quochar{\lsb} and \quochar{\rsb} are special when used for \latex;-like option syntax, and the character \quochar{\amp} is special when followed immediately by a letter since then it is the introducer for \sgml entity invocations. Advanced \gellmu provides for the possibility of giving traditional \latex; meaning to \quochar{\amp} when not followed immediately by a letter and also to the other four \latex;-special characters, which are \display{\quostr{\und;\ \ \crt;\ \ \dol;\ \ \tld;}\ .} Additionally, it provides for the possibility of giving traditional \latex;-like meaning to other short forms of markup such as \qquostr{\bsl;( . . . \bsl;)} for inline math, \qquostr{\bsl;\lsb; . . . \bsl;\rsb;} for displayed math, \qquostr{\hyp;\hyp;} for a range-dash, \qquostr{\hyp;\hyp;\hyp;} for a punctuation-dash, \qquostr{\bsl;\spc;} for an inter-word space, \qquostr{\bsl;,} for a short space, and others including also, if desired, the use of blank lines, as appropriate, for paragraph boundaries. \section{The \gellmu didactic production system} The conversion of both \emph{basic} and \emph{advanced} \gellmu source markup to \sgml is performed by my program called the \emph{syntactic translator}. If one wishes to write consciously for a public document type such as \html or the \href[http://www.tei-c.org]{Text Encoding Initiative}'s \abbr{TEI} using \gellmu's \latex;-like syntax, the syntactic translator is the only part of \gellmu that will be of interest. The optional features of advanced \gellmu described above can only be used when one is writing for a document type that provides markup in which the corresponding concepts have representation. For example, \latex;-like use of the character \quochar{\tld;} for non-breaking space requires a markup that provides non-breaking space. Moreover, if blank lines are going to be paragraph boundaries, then the syntactic translator will need a list of element names before which a new paragraph does not make sense, and, since there is no separate provision for a list of names after which a paragraph must end, the document type cannot be \xml. The \gellmu didactic production system provides such a document type and also provides tools, which can be used as inter-changeable components, for working with that document type. The didactic production system consists of the syntactic translator and the following additional components: \begin{enumerate} \item An \sgml document type called \quophrase{article}. \item Its \xml cousin, also called \quophrase{article}. \item A program for translating \sgml article to \xml article. \item A program for translating \xml article to \html. \item A program for translating \xml article to \latex. \end{enumerate} The document type is intended to be comfortable for authors with past experience in \latex;. The document type and the components are didactic. They are intended to illustrate how such a system can be assembled from inter-changeable components. They are not finished in any sense, and each has shortcomings. They do serve, I hope, to demonstrate to the community of \latex; authors that it will find no limitations in this approach to document production. At the same time it is intended to provide a whole new way of thinking about the subjects of package design and class design. Its unfinished nature is intended to make it relatively easy for those who are so inclined to move in various ways to finish such a system that fits their needs. \section{Production of this Document} This document and the slides used during its presentation were prepared with the \gellmu didactic production system. Pre-publication versions of the sources and various automatic formattings are available in \href{\self/}{the author's web}. Subsequent to the \gellmu run on this document a copy of its \latex output was manually modified for conformance with TUG guidelines. If I were going to submit a number of such TUG articles, it would be worthwhile to make another variant of the \latex formatter for TUG. \end{document}