Using SGML for EDI: An answer

From: aray@nmds.com (Arjun Ray)
Newsgroups: comp.text.sgml
Subject: Re: A novice needs help or at least pointers
Date: 1 May 1999 02:06:02 -0500
Organization: FUDGE Dispersal Systems
Lines: 90
Message-ID: <3767a036.1228868880@news1.newscene.com>
References: <7gd3tj$dfg$1@nnrp1.dejanews.com>
Reply-To: aray@interactrx.com
X-Newsreader: Forte Agent 1.5/32.452
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

In <7gd3tj$dfg$1@nnrp1.dejanews.com>, DOlivastro@ChiResearch.Com wrote:


| A data supplier used to send us data in a proprietary format. [and 
| now] has decided to change to SGML. [...] For example, in the old
| days, the file said:
| 
| Number:    05704062
| Author:    Olivastro; Dominic
| 
| and so on.  Now I get something like this:
| 
| <ENTDOC><SDOBI>
|  <B100><DNUM><PDAT>05704062</PDAT></DNUM></B100>
|  <B200><AUT><NAM>
|    <FNM>Dominic</FNM>
|    <SNM>Olivastro</SNM></NAM></AUT></B200>
| 
| and so on (and on and on and on).


Yep, looks nasty... (Is there some EDI directory lurking in the background?)

Basically, the data in this format represents a tree-like hierarchy of named containers, with text at the "leaf" level.


| Are there tools for this? 


Not directly for conversion. But it may not be too difficult to roll something for your particular needs.


| Ideally, I want a program that will just take this file and change
| it to something like the first file.


Makes sense. So what you need is something that will parse this format and use a set of rewriting rules to generate the old familiar format.

1. If the data supplier is serious about this being SGML (as opposed to just looking like it), he'll have a DTD (Document Type Definition.)

You will need this as a plain text file: it will have an accurate description of the structure of the data.

2. <URL:http://www.jclark.com/sp/> has the source and some binaries for the pre-eminent SGML parser in the free software world. The 'nsgmls' program will take the DTD and the data file and convert it to another format, called ESIS, which will look like this:


(ENTDOC
(SDOBI
(B100
(DNUM
(PDAT
-05704062
)PDAT
)DNUM
)B100
(B200
...


Not much of an improvement, except that now the format is strictly line-oriented, and thus suitable for practically any text processing tool. (The first character in each line corresponds to a "parse event" such as "start element", "text data", "end element", etc.)

3. Depending on how regular/predictable the structure is, you might get away with a simple Perl script that simply "waits" for a list of interesting lines and does the right thing each time. A better approach might be to get the SGMLS.pm Perl module from CPAN, and use its 'sgmlspl' driver script to control the translation in a more structured fashion: basically, you write Perl subroutines for the "events" you're interested in and the driver arranges to invoke them at the correct points. The subroutines can do anything you like: print text directly, save stuff to variables for later processing, and so on.

The basic reason arguing against a one-off perl script in a while(<>) loop is that it's already clear that some buffering will be needed. For instance, in the original format, the last name preceded the first name, and now the order is reversed. So you need logic like this


 On {firstname} -> Save {firstname}
 On {lastname}  -> Print "Author: " {lastname} "; " {firstname}
    

sgmlspl is good for things like that.

Good luck.


:ar