Old features

Here's a link to some info at cern: cern stuff

New features

Here's a picture: --attachment Content-id: HTTP://info.cern.ch/hypertext/WWW/TheProject.html Content-type: message/external-body ;access-type=X-HTTP ;host="info.cern.ch" ;name="/hypertext/WWW/TheProject.html" Content-Type: text/X-HTML --attachment Content-id: part3 Content-type: image/gif Content-transfer-encoding: base64 R0lGODdhdQAvAIQAAL9/v3+ff39/f/+/f/+ff/9/f3+fv///v39/v//fv/+/v/+fv/9/v7+/ f7+ff79/f//////f/7/fv7+/v7+fvwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ACwAAAAAdQAvAAAF/uARHQkkiuMoQmZUnixUvkfNHjKbivM75riEMIa7xYiQXYvH/CVIMuhz 5gK6nsuabUYawqamcCQJvOGKQZnz9oues9rge/wOy4asWdQpj6r1EWNVRSVKT3RLYXWEbIot jjw6J21jZEdaOycleVCTQYc9PXQ0nSMvXC4qg46QPzA4bWdKNXpvtFpPK4g+d4W9vUIKXHk7 p0m2KSYJV0JrejmNrJp7dkLNwtbLPKmXLc3DUd8JCuTl2MzZh37ariuHnmVAUCZFpndXRi48 2MvLKdauikWyBsyauXIUFEwgN65cQyFSfHhyZ+RFHlaPSFCEJgtWM45hapUhEWjIpn4N/oUp VEBhAAUCAwgsaKCA5sIJFG4qGLOpkx0Vyqz4PHkCVtBOIm040iNxEDeSmnz463dlwoKFORU4 WKlw5oKtE6Z4wYelip0iZ6GdFIO27TGeOVyFzNiOjj01Upb0c5gym06rDVbhQRrU0MW5i5Ag MiJtLq83hfLeeZTr3x9xKJlhU9il54rPSNLY8JkmIxpCtogo4ZiGhpF5RSMXY4oys8FxFFIB HC3nSzXVae1AE8721cUaldrClT0pRw/nk9/ppjrB6gCNASNLYmURdOq0trQfOU4GeVBIqV85 +SeqC3aDCgjEPzbxVEd7HefA+u72LO/XZ40CHVyGrOFJMTxd/rENVgoM8E5ZjyTiVizHAFiG RcGJIdx+QEGiiEVS4SWRP51dRoICCwxwlR9u5GJcLvRU5OGGMrYSIy0WMreDXVTkJ6IK7EBE RUMyOcCDdG8hh9YmErIhmlJoIMaYDq/JlVhHWeiD1xc7bpPNSgkJ0weM2mmkDGmJ/FQljegh AxlbUgKyhWXNHPlgkAmkqNUC0WmixYXdGKfknyKxiKFbUe6SRkC2UHRGT2Bcho5VFDjQkgPt VTFIJLck8ligH+boKYBbvIYlPCGGs9wU+qRCQaUvvYqpfVxwOlx3WWLnnRngNdYTIrKMtFQW JgVV7CEKTUCApcwye4U+AGWjV16H/oYWbJSjMZZtRo29WQsWRFVhEjsrESArrLAaKaQJKFpa XVilZYtKKRk1VaFo5GXbh3429lBUbMoksJC7lprrwAMUPGBpOuM4gJO7K42DGT1AFdWhLONI AA8MHu7nJKOZKOkGxaxilVPBsCr8KsJ9yVRpdczilFN1yQ4k7pH2CPHuTVigGa9cuHbHcRyP WuTCw+mmvLLCCDfoNLqWNoA0sy/Di9SJEryrCtIUNEABROAx2jHQzrkYEooviSnwAklbirAD cK+csDkGc02wrEizw0N1FFw1E82VKuQ13z2j5uG8cqiyhdksQV3pV0knXCnCABwMt8IDKDv1 yefGTfXL/sHgxrnMr5KeU+kJ/Qfjf1Iq2U9nOLBkaeSOU90sAJI33XnblRpAQQAUCHDu6TjJ 7O7LXb/ssAMzvUwOLl5gzJAxKOnE8OxUK9xsyg4AgPADlTsgQPcBPDCAVo6nW3nllTpggO1J TzA7uvQTD6uYLqCoQOzk8L3QMhErHur6Foy2+Q5q7Gtf+QLgAAR072DB413cHvC+CiLMgQAA nqXeF7m4tc948nscTqRmDpqhDXtVW96r+gY/k1EtAOH7HaxgKIAHIIB93useBR8APqXNTng6 lJzjfGcp4B1wfkjcnfzmF0JYTcBrJxPg/CQoK/g9MGEG+F7CPNe98TXwAeUD/iMYPRc8ZgmP gb9DIdyKWEXgyfBzn6Pf5wY3Oyh2MH27U58G35i0ytkQbgCoIQQF4L1Aeq+MwXsb7oRnQAo4 EIu4S5oR4WjF4U0RfrWrIhwrlcGEZfCQ32ug+PwYSB0igIfiCyMiEfa7DKrsbZy8ovvWeMZ0 MYuIllwhCtu3Sypu8FwMrBwG3Za9B76NgoAU4xdPaUOVCW98fsQiGWdJQ08iIJadFGbc0IjG 9qXxm7ubpi7f98AsUqBy5oTbI2uITMkdLIuBBKP3Tmk5dhLScq7M4vuaCUMdhg+Z6mQjGGG1 PsmFz3dunJ3ChIcw4T2SlQzEYOX6Wal+ds+Bf6TA/vu8Vz6GRpOH5fsiIeUpgGdCkGl/dFv5 3nfKLNqQoeVjX0oxSk7fKSyHNJVVJyf4PoqykX3Q/N4+J+c9kK7vAeMjZPlOGc8dEpKdBzvl +FoaVXkmbHyT46A8F7rTg17wpuaUXBYP9smLihJ3DwRl+XL3O6YlMosRjar4PGlP8fEwizW0 5x+dejBBjlQAwEPqU+PJUbSCT56n3KKlxoewiKqskwpzIAQP1lD33bStZbzgR78HPsZWTpB7 DWQikeq9GkoVqZ1N7D1hyEOGXm6WMFWqDi8HQ3MeM4ZbJSbu5DlUhY21sgAY61zHh9cyltK0 zcwrakva2uX+0ZVVHaU8/jsaSGYmN4el/aIB/OhMZDbznAEFIjQvV6lhljV86wviTQfpSaYx NpFF7SxqmancZlZVnmSdnDq3ODkHCg+90y2lJzXoRQwmDIMMdSA92SdLoJLPrPoEpPscqdLK PhW1RWXnXW+KRmO6koKHPKkiiXlK1qZ1h/784hbXy1/wVVim+nzlA8mrxdemNWHr5eFV7dpM t57UdkB2p051O1nyNfam2luwVcUYQe2prFI1HKtMNRpZAPzzqmjl6EKv+sca4ni0hy3yiitV ADIr9sprbB8pYfjlTt7ze/1cMgTZ7FJWTtbFUrZcA10JZaTqcKzDPCxyUVliPx/WXMtiAHkn KchJNbO1sf/87YjLqsXLehmVS0urA0yMSmEyjZQPZOxhF2rovuZWACEAADs= --attachment-- --cut-here And here's the DTD for WEB-NODE documents: --cut-here "> "> --cut-here And here's a perl script to convert an HTML document into a multipart/X-HYPERTEXT MIME body part: --cut-here #!/usr/local/bin/perl $boundary = "attachment"; print "Content-Type: multipart/X-HYPERTEXT; boundary=$boundary\n\n"; print "--$boundary\n"; print "Content-Type: text/SGML\n\n"; print "; # read whole file $_ = join('', @html); $out = ''; sub fix_anchor{ local($name, $href, $type); # What exactly is the syntax of an SGML attribute value? while(s/^(\w+)\s*=\s*((\"[^\"]*\")|([^\s>]+))\s*//){ local($v) = ($3 || $4); local($a) = $1; $href = $v if $a =~ /^href$/i; $name = $v if $a =~ /^name$/i; $type = $v if $a =~ /^type$/i; } s/[^>]*>//; $out .= "//){ $out .= ""; }elsif(s/^H(\d)>//){ local($n) = $1; while($n<=$header){ $out .= "

"; $header--; } while($n>$header){ $out .= "

"; $header++; } $out .= ""; }else{ $out .= '<'; } } $out .= $_; foreach(keys %anchor){ local($ent) = $anchor{$_}; print "\n"; } print "]>\n", $out; foreach(keys %anchor){ local($access_type); print "\n\n--$boundary\n"; print "Content-id: $_\n"; print "Content-type: message/external-body\n"; $access_type = $1 if s/^(\w+)://; if(s/#([^#]+)$//){ print "\t;x-anchor=\"$1\"\n"; } if($access_type =~ /file/i){ if(&hostport){ ¶m('access-type', "ANON-FTP"); }else{ ¶m('access-type', 'LOCAL-FILE'); } ¶m('name', $_); print "\nContent-Type: application/octet-stream\n\n"; }elsif($access_type =~ /http/i){ ¶m('access-type', 'X-HTTP'); &hostport; &unescape; ¶m('name', $_); print "\nContent-Type: text/X-HTML\n\n"; }elsif($access_type =~ /news/i){ ¶m('access-type', 'X-NEWS'); &unescape; if(/@/){ ¶m('message-id', $_); }else{ ¶m('group', $_); } print "\nContent-Type: message\n\n"; }elsif($access_type =~ /telnet/i){ ¶m('access-type', 'x-telnet'); &unescape; ¶m('user', $1) if s/^(.*)@//; ¶m('port', $1) if s/:(.*)$//; ¶m('site', $_); print "\nContent-Type: X-TELNET\n\n"; }elsif($access_type =~ /gopher/i){ ¶m('access-type', 'x-gopher'); &hostport; ¶m('type', $1) if s-^/(\d+)/--; &unescape; ¶m('selector', $_); print "\nContent-Type: @@@@\n\n"; }elsif($access_type =~ /wais/i){ ¶m('access-type', 'x-wais'); &hostport; if(m-^/-){ ¶m('type', $1) if s-^/(\w+)--; ¶m('size', $1) if s-^/(\d+)--; &unescape; ¶m('path', $_); }else{ &unescape; ¶m('words', $1) if /\?(.*)/; } $type = "image/$type" if $type =~ /gif|tiff/i; $type = "application/postscript" if $type =~ /PS/i; print "\nContent-Type: $type\n\n"; }elsif($access_type eq ""){ ¶m('access-type', 'x-relative'); &unescape; ¶m('name', $_); print "\nContent-Type: message\n\n"; }else{ warn "unknown access type: $access_type in $_"; } } print "--$boundary--\n"; sub unescape{ s/%(\w\w)/sprintf("%c",hex($1))/ge; } sub param{ local($p, $v) = @_; # quote tspecials in parameter values $v = '"'.$v.'"' if $v =~ m-[\s()<>@,;:\\\"\/\[\]?\.=]-; print "\t;$p=$v\n"; } sub hostport{ if(s-//([^:/]+)--){ ¶m('host', $1); ¶m('port', $1) if s/:(\d+)//; 1; }else{ 0; } } --cut-here-- ====================================================================== From: Dan Connolly To: www-talk@nxoc01.cern.ch Subject: HTML is not SMGL Date: Sun, 07 Jun 92 00:12:55 CDT My grandiose scheme to convert HTML to MIME and SGML works fine. Now I'm going back to the idea of writing a DTD for the existing HTML format. I can't seem to do it. HTML has so little rigid structure that I'm running into mixed content problems (I have to allow #PCDATA almost anywhere, hence mixed content, which screws up everything). How much extant HTML is really out there? And how much of it is generated on the fly by gateways and servers? This MIME/SGML stuff sure seems like the way to go. Now if I make it possible to create such documents with FrameMaker and a perl script, I bet it will catch on. I suspect I'll get some resistance against abandoning UDI's, but I don't think they work. Dan ====================================================================== From: jfg@dxcern.cern.ch (Jean Francois Groff) To: www-talk@nxoc01.cern.ch Subject: Re: HTML is not SMGL Date: Mon, 8 Jun 92 01:01:02 +0200 Dan asked: > How much extant HTML is really out there? And how much of it is > generated on the fly by gateways and servers? Our hypertext documentation is certainly the largest quantity of HTML you can find in the world. Besides, we know all the people who have produced their own, so making the Big Change would be relatively simple for them (esp. given your impressive perl script). Gateways can be changed easily too. But all the browsers must be updated before, and that will take more time !!! (There are thousands of copies installed...) > I suspect I'll get some resistance against abandoning UDI's, but I > don't think they work. Well, you still use them internally, don't you ? ;^) Jean-Francois ====================================================================== From: Edward Vielmetti To: jfg@dxcern.cern.ch (Jean Francois Groff) Cc: www-talk@nxoc01.cern.ch Subject: Re: HTML is not SMGL Date: Sun, 07 Jun 92 20:26:48 EDT The UDI vs. MIME argument is a non-arguement. MIME is sufficiently flexible that if you construct an appropriate Content-type and define its semantics appropriately it will accept UDI's and work accordingly. "Simple matter of programming" :). Explicit "attribute=value" tags are more flexible than the W3 approach to turn the entire document ID into a big long string. I guess it depends on whether you believe you are dealing with a big database or a big file system. Both approaches have their place. Again as a simplified case you have "udi=//host:port/path" as a MIME identifier and all is well. I expect that MIME will be available in many e-mail products over the next 3-5 years. Since the only application that has anywhere near universal appeal on the net is e-mail, it strikes me as only appropriate that hypertext systems try to get as much leverage from mail as they possibly can. --Ed ====================================================================== From: Dan Connolly To: Edward Vielmetti Cc: jfg@dxcern.cern.ch (Jean Francois Groff), www-talk@nxoc01.cern.ch Subject: Re: HTML is not SMGL Date: Sun, 07 Jun 92 22:29:44 CDT >The UDI vs. MIME argument is a non-arguement. MIME is sufficiently >flexible that if you construct an appropriate Content-type and define >its semantics appropriately it will accept UDI's and work accordingly. >"Simple matter of programming" :). > >Explicit "attribute=value" tags are more flexible than the W3 approach >to turn the entire document ID into a big long string. I guess it >depends on whether you believe you are dealing with a big database >or a big file system. Both approaches have their place. Again as >a simplified case you have "udi=//host:port/path" as a MIME identifier >and all is well. > The problems is that the syntax of a UDI doesn't fit into the syntax of a MIME parameter (or an SGML attribute value...) because a UDI might be arbitrarily long, and it cannot contain any whitespace (so it can't be split across lines). So these 200+ character UDI's for WAIS documents can't be mailed around safely (even SGML has limits on the length of an attribute value). Heck, my WWW client (perhaps it's not the latest version, but still...) can't even retrieve wais documents due to these problems. Dan ====================================================================== From: Dan Connolly To: www-talk@nxoc01.cern.ch, wais-talk@think.com Subject: MIME for global hypertext Date: Sun, 07 Jun 92 22:49:51 CDT [This was posted to several newsgroups, but someone from wais-talk suggest I forward it there also.] The WAIS, gopher, and world-wide-web projects are all client/server information retrieval systems. All three deliver plain text information quite well, and they each have evolving mechanisms for delivering other forms of information. The MIME RFC defines a system for processing multi-part, multimedia messages on the internet. I would like to see these systems, along with USENET news and internet mail, interoperate with MIME as the substrate. The clients for these systems go something like this: 0 user invokes client (and chooses a starting point) 1 client displays user's request 2 user reads page, chooses a reference to more info 3 user informs client of choice (e.g. "show me item #1," or "search for googoo") 4 go to step 1 These systems often consist of a hierarchy of menus with text files at the leaf nodes. The system allows the user to interactively navigate the menus and browse leaf nodes. But 1) the format of the menus is particular to the system (USENET newsgroups/articles, unix directories/files, WAIS source/database/document). And 2) once a user is at a leaf node, the system can no longer interactively follow references. The novel aspect of hypertext is that the distinction between the menu pages and the text pages disappears. In the world-wide-web, text documents have machine-readable links inside them, and all menus are represented as hypertext documents. The WWW format works well, but it would benefit from use of MIME's features. For a common hypertext document format, I propose we define a subtype of the MIME multipart message: X-HYPERTEXT. The first part of a multipart/X-HYPERTEXT message is the content of the document, and the remaining parts are multimedia attachments and links to other documents. The content part contains references (by Content-ID) to the attachments and links. The client software allows the user to interactively choose references to display/follow. The remaining parts may be attached image/audio/video using MIME's various types and transfer encodings (text attachments would work too) or they may be references to information accessible elsewhere using MIME's message/external-body type. The parameters to the external-body content-type provide the same information as WWW's Universal Document Indentifier. (MIME only defines ANON-FTP, FTP, TFTP, LOCAL-FILE and AFS. The remaining access-types (WAIS, gopher, etc) would be experimental (X-WAIS, X-GOPHER) until standardized.) The emerging standard for structured, platform-independent text is SGML. The WWW project defines an SGML document type with traditional elements (title, heading, paragraph, list) and new hypertext elements (anchor). Soon it will have multimedia elements (image, audio). The current design places external document references (to files, WWW servers, WAIS documents, gophers, etc.) inside the SGML as attributes. There are lexical incompatibilities, and the design is under strain. I suggest that we implement references as as SGML entities that identify message/external-body parts by content-id. Representing document content in SGML allows the same information to be accessed using different user interface paradigms (e.g. dumb terminals vs. curses style vs. x windows point-and-click). Short of full SGML parsing, we could adopt the MIME text/richtext format, with the addition of a ... tag. In fact, any representation that allows the user to interactively indicate one of the attached body parts by content-id will do. For example, plain text with one-line descriptions would do. The Andrew ez data stream would also work, but only Andrew sites could parse it. This brings up the issue of format negociation. No one format is optimal for all information. Clients are likely to be able to process information in several formats, and servers are likely to be able to provide different representations. The various formats can be enclosed in a MIME multipart/alternative message. And rather than including the data for all formats in the message, the data could be in message/external-body parts. The client chooses the type of data it likes and retrieves the corresponding external-body. This (modified) example from the MIME rfc may help explain: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=42 --42 Content-Type: message/external-body; name="BodyFormats.ps"; site="thumper.bellcore.com"; access-type=ANON-FTP; directory="pub"; mode="image"; Content-type: application/postscript --42 Content-Type: message/external-body; name="/u/nsb/writing/rfcs/RFC-XXXX.ez"; site="thumper.bellcore.com"; access-type=AFS; Content-type: application/x-ez --42 Content-Type: message/external-body; name="BodyFormats.txt"; site="thumper.bellcore.com"; access-type=ANON-FTP; directory="pub"; Content-type: text/plain --42-- The client can choose between postscript, ez, and plain text, and retrieve the corresponding message body. The question then becomes: how do these systems interoperate? By making information available as multipart/X-HYPERTEXT MIME messages. The WWW client interfaced to the other systems by defining "addressing schemes" and implementing the various protocols and translating the data into HTML. Gopher has a similar typing scheme -- one character is reserved to indicate the access type and the data type. WAIS clients have yet another method of resolving types, though they only support one protocol. The NewsGrazer application has its own encapsulation mechanism. This is becoming a mess. In the short term, global hypertext viewers will have to support the access-type and content-type of each system with which it interoperates (so we have X-WAIS, X-HTTP, X-GOPHER, X-NNTP, as well as X-WAIS-SRC, X-HTML, X-GOPHER-1 thru X-GOPHER-9). Some of the access types will become standard, and some will die out. But all the data types should be encapsulated in MIME messages. Any data that has machine-readable pointers to other data should be made into a multipart/X-HYPERTEXT message. For example, a WAIS question should have attachments for each of the result documents (the content part can stay application/x-wais-question, or it could be converted to a text type, or both), at least in the case where those documents are available by some standard access method. [I wrote a perl script that will change an HTML document into a MIME message with attachments.] Leaf documents, i.e. documents with no external links, can stay in single part types. e.g. Plain text files become MIME messages by simply adding a blank line at the beginning (to separate the headers (none) from the body). Under this model, a mail message can point to a news article which references a WAIS document which contains several drawings and pointers to several more available by FTP, and a user could just point-and-click between them. The only need for protocols like gopher and HTTP is to encapsulate data that's not already MIME compliant. This is clearly a pipe dream, but it's the kind of thing we can work towards today. Dan ====================================================================== From: mitra@pandora.sf.ca.us () To: connolly@pixel.convex.com, www-talk@nxoc01.cern.ch, wais-talk@think.com Subject: MIME for global hypertext Date: Mon, 8 Jun 92 13:11:15 PDT Dan, Thanks for that proposal. I must admit to not having read the MIME RFC, being mostly concerned with text rather than multimedia, so I wasnt aware of the hypertext implications of it. My question is on a fairly minor point of your document, you mention that a MIME document typically consists of a content and then the pointers, with the hypertext links being references to the pointers. In Wais, it is quite possible to return part of a document (by byte position), and if the pointers are part of the document itself then they may not be returned at the time the user chooses to try and follow a link? My concerns are around doing these things for users on low-speed (2400 baud) modems. For them, protocols need to be easy to handle at slow speed, and need to be meaningfull BEFORE the whole document has been received. As the Internet extends out to more and more users beyond the high-speed links currently assumed the need for protocol designers to consider those users becomes more important. - Mitra ------------------------------------------------------------------ Mitra - technical director, Pandora Systems mitra@pandora.sf.ca.us ====================================================================== From: Dan Connolly To: mitra@pandora.sf.ca.us () Cc: wais-talk@think.com, www-talk@nxoc01.cern.ch Subject: Re: MIME for global hypertext Date: Mon, 08 Jun 92 15:50:17 CDT >My question is on a fairly minor point of your document, you mention that >a MIME document typically consists of a content and then the pointers, >with the hypertext links being references to the pointers. Well, this is not typical, but it's the model I'm proposing for hypertext. Typically MIME message bodies are either single part text/image/audio, or multipart. The standard multipart types are mixed, meaning "show these one after the other," parallel, meaning "show these at the same time," or alternative, meaning "these all represnt the same info. Take your pick." The "content and then list of pointers [or attachments]" model is my own proposed format for hypertext. > In Wais, it >is quite possible to return part of a document (by byte position), and >if the pointers are part of the document itself then they may not be >returned at the time the user chooses to try and follow a link? > I would suggest that the WAIS server interpret the byte positions as offsets into the content part of the hypertext. So the structure remains in tact. Byte offsets into a MIME multipart message don't mean much. Transport systems may mess with the headers and trailing whitespace on body lines. Line offsets may be meaningful inside text body parts, as long as none of the lines have to be split due to line length constraints. Keep in mind that this multipart structure is only necessary for hypertext (i.e. contains links) and hypermedia (i.e. contains multimedia attachments) documents. Traditional documents can be simple single part bodies. For example, A plain text file starting with a new-line will be interpreted as a body part with no headers, which defaults to the type "text/plain; charset=US-ASCII" ,i.e. plain old text. >My concerns are around doing these things for users on low-speed (2400 baud) >modems.... ====================================================================== From: connolly@pixel.convex.com (Dan Connolly) To: www-talk@nxoc01.cern.ch Cc: enag@ifi.uio.no Cc: Subject: Re: using NOTATIONs inline Date: Mon, 8 Jun 92 00:17:48 -0500 In article <23177A@erik.naggum.no> you write: >Dan Connolly writes: >| >| The WWW group is attempting to define a multimedia interchange >| format called HTML. . . . > >Why not use HyTime? > Eric: Partyly because of ignorance (we've heard of HyTime, but we don't know the details). I'd expect a HYTIME engine to be quite a bit of work to implement. And partly because, as I understand it, HYTIME doesn't go as far as to perscribe a DTD. The WWW project needs one particluar language, not a whole architecture. I'd certainly like to know more about HYTIME's techniques for addressing documents, esp. elements of documents. Now for the WWW gang: >: >| That is, is it possible to put an arbitrary 8 bit binary stream >| _inside_ an SGML document? My guess is: no. But if we use >| CDATA, can we include anything that doesn't contain the closing >| tag in full? > >If you by "the closing tag in full" mean the entire end-tag, complete >with etago, generic identifier, and tagc, as in "", this is not >the way SGML does it. CDATA and SDATA are terminated by a etago >"delimiter-in-context", which is an etago (end-tag open, "followed by a name start character, or a grpo (group open, "(") >delimiter if concurrent document types are allowed. In the reference >concrete syntax, this means that the regular expression "matches the end of CDATA and SDATA elements. > >You can also use marked sections, with a CDATA status keyword, in which >case the CDATA is terminated by the mse delimiter (marked section end, >"]]>"). > >: >| Someone made the point that an SGML document is only allowed to >| include SGML characters as specified by the SGML declaration, and if >| we're going to use the default SGML declaration, we have to stick to >| the characters blessed by it. > >Blessed and blessed. The SGML declaration is supposed to reflect the >reality of the document, not enforce arbitrary limits on them. So you >write an SGML declaration which fits the document. > >| That's not my understanding. I thought that inside CDATA (or SDATA, >| I think) you could put _anything_ but the closing tag in full. > >As said above, the etago delimiter-in-context terminates the data, >regardless of whether it's a legal end-tag in that context. > >You should be aware that the SGML parser will parse the contents of the >"binary" content, and ignore record start, and treat record ends >different from other characters. In addition, it's an error for an SGML >entity to contain characters with any of the numbers listed in the >SHUNCHAR part of the SYNTAX declaration. This is _not_ what you want >with binary data. > >| What's the scoop? Do we have to use external entities for raw data? > >Yes. An external entity that is not an SGML text entity requires a >notation identifier, so you only need to list the entities in the DTD, >with notation, and refer to them by name in the document instance. > >If this is not satisfactory, you should declare the objects to be CDATA, >and use a binary to text-only transformation scheme. There are several >such schemes. Among them, base64 is the preferred encoding in my view, >since it's available as part of the new Multipurpose Internet Mail >Extensions (MIME) RFC-to-be. (The latest draft is available for >anonymous FTP as ftp.ifi.uio.no:/pub/SGML/MIME.6.ps and MIME.6.txt for >two weeks from today. Section 5.2 which concerns the base64 encoding is >also available as ftp.ifi.uio.no:/pub/SGML/base64.txt.) Transformation >back to the binary form from the text-only form may be done on the fly >by the application before sending the data to the notation interpreter. > My idea is to use MIME encodings, but put these attachments _outside_ the SGML text, in an attached (or external) body part. >In addition to being much easier to deal with in SGML, this also makes >SGML documents containing such content robust with respect to file >transfer, etc. > >Hope this helps, > Thanks. Mostly it confirms my suspicions, but it should also provide a somewhat authoritative answer (no references to ISO 8879 here :-) to the WWW project. >-- >Erik Naggum | +47-295-0313 | ISO 8879 SGML | Memento, >Naggum Software | "fuzzface" | ISO 10744 HyTime | terrigena. >Boks 1570, Vika | | JTC 1/SC 18/WG 8 | Memento, >0118 OSLO, NORWAY | | SGML UG SIGhyper | vita brevis. ====================================================================== From: davis@willow.tc.cornell.edu (Jim Davis) To: www-talk@nxoc01.cern.ch Subject: HTML terseness/verbosity Date: Mon, 8 Jun 92 09:28:20 EDT Re the recent comments on terseness of UDIs and the extra verbosity in Dan Connolly's proposal to use Mime for WWW documents: My understanding is that nobody should have to type "naked" SGML (or HTML or Mime-language) anyway. We should have programs like WYSIWYG editors manipulating the markup for us. (Now of course at present we do have to type HTML, at least I do here, but hopefully this will not persist). If that's right, then the more explicit and simple the document structure is, the easier to parse and manipulate by programs, the better we are. One thing I like about Dan's proposal - it makes it possible to collect a hyperdocument into a single file (by embedding the docs within one mime file) which will make transporting easier ====================================================================== From: timbl@zippy.lcs.mit.edu (Tim Berners-Lee) To: connolly@pixel.convex.com, enag@ifi.uio.no, www-talk@nxoc01.cern.ch Cc: timbl@zippy.lcs.mit.edu Subject: MIME, SGML, UDIs, HTML and W3 Date: Thu, 11 Jun 92 12:22:56 -0400 I have printed off the recent discussion on the new HTTP, HTML and MIMe and UDIs and done what I can to disentangle it all in my mind. I will reply in one message, becase many of the points are linked. I know this should be hypertext, with references but (a) I am away from home and (b) we don't yet have a universal mail/news archive server running to link to. HTTP and HTML First of all, Jean-Francois points out very properly that the enhaced HTTP protocol and the enhanced HTML spec are quite separate things, and should be specified separatedly. I agree wholeheartdly about all this, and I aplogize for muddling the levels up till now. (As a small aside, I would point out that wheras a HTERR file is not very useful, a HTFWD file IS. It is like a hypertex soft link. But I am happy to leave that as a separate type of file. It should certainly get a different extension so that it gets a different icon) HTTP: SGML vs ASN/1 Let's look at the HTTP protocol first. Carl is mapping out the requirements for this, and assuming that SGML would be a reasonable representation for it in practice. And so it is. When the requirements are clear, it would certainly be interesting to look at mapping them onto a z39.50 - style ASN/1 implementation. This would be useful for two reasons. First, the comparison would point out to us things in z39.50 which we might not have thought of which would b useful for HTTP. Second, the comparison might give a nice short or at least well-defined things which the WAIS guys might like to take into account in the next version of their protocol. (I demod W3 to Brewster who hadn't seen it before live, and was very keen that WAIS and W3 should merge, changing the WAIS protocol if necessary. There is no reason why we shouldn't try both protocols. If they map well onto each other, its just a question of having two separate prasers at the low level, building the same internal structures. When we're talking about an SGML representation, and describe a file to come later down the link, I don't think we have to use the NOTATION= attribute with a notation type, because we won't in fact be talking about the notation of an SGML element. The format in this case is not something which the SGML parse is aware of. I must admit I was disappointed to learn that SGML didn't allow for any way of including 8 bit data. Thanks Eric for your explanations. MIME and SGML Dan rightly points out the relevance of the coming MIME standards. There are several things which we must separate here, though: 1. The MIME classification of data formats 2. The MIME format for multi-part messages 3. The MIME format for rich text. 4. The MIME formal for external document addresses (MIME UDIs) 1. MIME classification of data formats We must do the same disentangling job which JF did on HTML to MIME. First of all, the MIME job of classifying data formats is a useful job which is ideally done by just one bunch of people. Ther has been some suggestion that the MIME classifications are not well enough defined, but they seem to be the best effort yet and one can only assume they will eveolve in the right direction. So I'd back the use of these for W3. 2. The MIME format for multi-part messages This is necessary for sending a multi-part document over a mail link. We have to ask ourselves whether it is reasonable to use over a binary link. Personally, my initial impression is that the MIME stuff, using as it does terminators such as --xxx-- separated by blank lines, looks more horrible to work with in this respect than SGML! Still we have the problem of restrictions on the content: Must not contain delimiters, limited 7 bit character set, line orientation, in fact all the things which email carries as a restriction. This is really taking on board a legacy of all the mail which has evolved over the years. Do we need that for our new ultra-fast hypertext access protocol? [Compare the MIME format with the rather cleaner NeXT Mail format which is as far as I understand simply a uuencoded compressed tar file of all the bits, where uuencoding is designed as an optimal way of getting over mail transport restrictions, compress does what it says and tar is a multipart wrapper designed for that only. Not standard outside unix, perhaps, but cleaner in that the mail formatting is done at the last minute and doesn't affect the other operations] If course, with HTTP2, multipart/alternative shouldn't be needed. Multipart for hypetext? Now, Dan not only suggests the use of this for multipart messages, but also suggests that a hypetext document shoudl necessarily contain many parts, one on SGML and one for each link as a MIME external document. This means that an SGML hypertext document can never stand on its own! An SGML parser will always need to have a MIME parser sitting just outside. I don't like this: I feel we have to separate these two things. Suppose that an SGML document does want to be sent in a MIME message and does want to refer to other parts of that MIME message. In that case, it seems reasonable to have a format for that. However, when an SGML document is seen by itself, and refers to a news message for example, then there is no resaon for it not to be able to contain a complete reference within itself. When SGML documents include other files, then the SYSTEM value is typically a file name. It is a reeference to something outside. The precedent is set that SGML documents are allowed to refer to things outside. I think part of you objection, Dan is based on a dislike of the UDI syntax -- which I'll come to later. 3. The MIME format for rich text. Here, I am not so impressed. Basically, the MIME people are at the same level that we were before we started this cleanup, that they have SGML-LIKE stuff which isn't SGML. As its not difficult to make it SGML, they should do that. Comparing MIME's rich text and HTML, I see that we lack the characetr formatting attributes BOLD and ITALIC but on the other hand I feel that our treatment of logical heading levels and other structures is much more powerful and has turned out to provide more flexible formatting on different platforms than explicit semi-references to font sizes. This is born out by all the systems which use named styles in preference to explicit formatting, LaTeX or other macros instead of TeX, etc etc. So technically, HTML has some things to give MIME's rich text. Are the MIME people still open to additions? If not, I would suggest we add BOLD and ITALIC (or two emphasis styles for characters), and keep HTML separete from MIME's rich text, proposing it as a MIME text standard. (HP0 and HP1 were in the HTML spec but as unimplemented) 4. The MIME format for external document addresses (MIME UDIs) As Ed says, this is a bit of a non-issue, as MIME addersses and currnet style UDIs map onto each other. However, we have to agree on a "concrete syntax" (or two... :-) in the end. It's like the difference between an x400 style mail address generated from an internet address, and that internet address. Which do you prefer timbl@zippy.lcs.mit.edu where the sections of the domain name are defined to have no semantics at all, or S=timbl; HO=zippy; OU=lcs; O=MIT; SECTOR=edu (this is not real x400 - don't use it!) or user=timbl host=zippy group=lcs organization=mit sector=education You say, Dan, that you "don't think [UDIs] work". Do you mean people don't use them in all correspondance? Well, what DO they use? They use ange-ftp addresses for FTP (like info.cern.ch:/pub/www/doc/*.ps), which are even more terse than UDIs! They use news message-ids which are UDIs. Let me say that I personally don't much care about the arbitrary punctuation. There are a few things, though, which are important: - The thing should be printable 7-bit ASCII. Unlike arbitrary document formats, UDIs must be sendable in the mail - White space should not be significant. I would accept the presence of some arbitrary white space as a delimiter, but one cannot distinguish between different forms and quantities of white space. This is because things get wrapped and unwrapped. Dan, you object to UDIs because they don't contain white space. But that is purely so that to CAN wrap them onto several lines and still recuperate them. You can put white space in but it shouldn't mean anything. (This is not possible in W3 as is but it is in the UDI document) I don't see why you say they can't be put as an SGML attribute. They are just text strings. They will be quoted of course (Yes, I know the old NeXT browser doesn't quote them) Is that not allowed? What are the problem characters? If there SGML problem characters in the UDI spec, they probably are ruled out of SGML for a reason. (I recently saw in a galley proof of an article in which our mail adress had been hypernated! UDIs must be squeezable into 2 inch columns.) There is a sematic difference between a tagged list and a punctuation-divided set, and that is that the former has defined semantics but the latter doesn't and can therefore be extended more easily. I suggest that tagging could be used for the four bits of an address that must be separable by all sides, which are limited in number (4). Within those bits, the string should be transparent as the protocol does not require every party to understand the innards. The bits are MIME Used by name space: ACCESS Used by client server details: HOST, PORT used by client, protocol-dependent local doc id: PATH used by server only anchor id: (none) used by presntation application only It seems useful to maintain the ability to work out which bits are seen by whom. I only used punctation to separate these parts in the W3 UDI because people like internet addresses and mail addresses and filenames and telephone numbers and message-ids and room numbers and zip codes which don't have tags and do make do with punctuation. If the groundswell of opionion on this list is that tags are better, then let's use tags! Whatever we sue, it should be as quotable in an SGML attribute as in a MIME external reference as in a scribbled note or a link-pasteboard or whatever. (The U is for Universal, NOT Unique!) PHILOSOPHY In the W3 world, the model is of a dynamic world of documents which generally have some "home" or (or several), which can be found using sufficient intelligence and the help of ones friends given the UDI. A mail message has no home, and so in principle the parts of it have no home. When a hypertext multipart message (really consisting of multiple hypertext documents) has links between its parts they refer to each other within a completely isolated conetext. There are now two possibilites when the message is in fact archived and made readable. One is we say that the parts are then addressed as parts ofthe message, wherever it may be. The other is to say that the parts of the message are very likely things which had some original home. In that case, the message is just giving the reciever a copy to save him the (perhaps insurmountable) trouble of retrieving it. In this case the parts should be identified with thier original UDIs so that the receiver is not confsed with multiple documents which are in fact the same thing. I think that's all the comments I have on what I've read so far.. Tim ________________________________________________________________ Tim Berners-Lee World-Wide Web initiative CERN, 1211 Geneva 23, Switzerland timbl@info.cern.ch Visiting MIT: NE43-513, (617)234 6016 timbl@zippy.lcs.mit.edu ====================================================================== From: Dan Connolly To: timbl@zippy.lcs.mit.edu (Tim Berners-Lee) Cc: enag@ifi.uio.no, www-talk@nxoc01.cern.ch Subject: Re: MIME, SGML, UDIs, HTML and W3 Date: Thu, 11 Jun 92 20:31:08 CDT Now my comments on your comments: >There is no reason why we shouldn't try both protocols. >If they map well onto each other, its just a question >of having two separate prasers at the low level, building >the same internal structures. > On the other hand, I'd like to keep a telnet based protocol around -- maybe gopher is good enough. >When we're talking about an SGML representation, >and describe a file to come later down the link, >I don't think we have to use the NOTATION= attribute with a notation >type, because we won't in fact be talking about >the notation of an SGML element. >The format in this case is not something which the SGML >parse is aware of. > I don't believe this is true. From the horse's mount (Erik Naggum, that is): ---- | What's the scoop? Do we have to use external entities for raw data? Yes. An external entity that is not an SGML text entity requires a notation identifier, so you only need to list the entities in the DTD, with notation, and refer to them by name in the document instance. ---- >1. MIME classification of data formats > > So I'd > back the use of these for W3. > Yeah!! > >2. The MIME format for multi-part messages > > This is necessary for sending a multi-part > document over a mail link. We have to ask ourselves > whether it is reasonable to use over a binary link. > Personally, my initial impression is that the MIME > stuff, using as it does terminators such as > --xxx-- separated by blank lines, looks more horrible > to work with in this respect than SGML! The algorithm to separate a MIME multipart message into its parts is simply: search the data stream for CRLF--boundary--CRLF. It can be done by a finite state machine. Even the simplest SGML documents require a pushdown automaton to parse. > Still we have > the problem of restrictions on the content: > Must not contain delimiters, limited 7 bit character set, > line orientation, in fact all the things which email > carries as a restriction. This is really taking on board > a legacy of all the mail which has evolved over the years. > Do we need that for our new ultra-fast hypertext access > protocol? > No, we don't. MIME _allows_ transfer of data over 7 bit ASCII channels, but it hardly requres it. The Content-transfer-encoding can be: 7 bit (default): line oriented 7 bit data 8 bit : line oriented 8 bit data binary : raw 8 bit data, no CRLF's required base64: uuencode standardized quoted-pritable: text with escape sequences The MIME standard explicitly supports expansion to 8 bit transport mechanisms. > [Compare the MIME format with the rather cleaner NeXT > Mail format which is as far as I understand simply > a uuencoded compressed tar file of all the bits, where > uuencoding is designed as an optimal way of getting over > mail transport restrictions, compress does what it says > and tar is a multipart wrapper designed for that only. Not > standard outside unix, perhaps, but cleaner in that the > mail formatting is done at the last minute and doesn't > affect the other operations] > It was a requirement of MIME that the structure of the document be accessible without decoding or uncompressing data, especially since MIME messages are recursive and complex messages might otherwise go through more than one encoding. Compression was not addressed by the MIME standard, and uuencode doesn't make it though some gateways. > If course, with HTTP2, multipart/alternative shouldn't > be needed. > What does HTTP2 define that obviates the multipart/alternative type? > Multipart for hypetext? > > Now, Dan not only suggests the use of this for > multipart messages, but also suggests that a hypetext > document shoudl necessarily contain many parts, > one on SGML and one for each link as a MIME external document. > This means that an SGML hypertext document can never stand > on its own! That's exatly the point. Anything besides text should be handled as an external entity to be resolved by the parsing system. I just suggested that a portable way to resolve SGML external entities is to refer to MIME attachments. > An SGML parser will always need to have > a MIME parser sitting just outside. I don't like > this: I feel we have to separate these two things. > Well, it has to have something sitting outside. The SGML parsers I've seen resolve system entities using the file system. I proposed we use a MIME message like a mini file system, with links to other file systems. > Suppose that an SGML document does want to > be sent in a MIME message and does want to > refer to other parts of that MIME message. In that case, > it seems reasonable to have a format for that. > However, when an SGML document is seen by itself, and > refers to a news message for example, then there is > no resaon for it not to be able to contain a > complete reference within itself. > OK, I can see that we should be able to resolve the lexical issues and put the whole UDI/MIME access specification inside the SGML document. But what about multimedia web nodes? SGML describes text and references to other texts just fine. But if we want a format that can include more than just text, I don't think we should try to fit it _inside_ SGML. I think SGML should be used to convey text and document structure. But I still like the idea of wrapping it in a MIME message for multimedia interoperability. >3. The MIME format for rich text. > > Here, I am not so impressed. Nor am I. >4. The MIME format for external document addresses (MIME UDIs) > > As Ed says, this is a bit of a non-issue, > as MIME addersses and currnet style UDIs map onto > each other. However, we have to agree on a "concrete > syntax" (or two... :-) in the end. > Exactly. And why not the MIME concrete syntax? > Let me say that I personally don't much care about the > arbitrary punctuation. There are a few things, though, > which are important: > > - The thing should be printable 7-bit ASCII. > MIME: check. > Unlike arbitrary document formats, > UDIs must be sendable in the mail > MIME: check. > - White space should not be significant. I would > accept the presence of some arbitrary white space > as a delimiter, but one cannot distinguish between > different forms and quantities of white space. > This is because things get wrapped and unwrapped. > MIME: check. > Dan, you object to UDIs because they don't > contain white space. But that is purely so that > to CAN wrap them onto several lines and still > recuperate them. You can put white space > in but it shouldn't mean anything. (This is not possible > in W3 as is but it is in the UDI document) > I must not have read the UDI document closely. I certainly got the impression that a UDI should look like one word when "written on the back of an envelope." > I don't see why you say they > can't be put as an SGML attribute. They are just > text strings. The WAIS UDIs are huge. An SGML declaration defines a maximum for the length of an attribute value. The default value is ... oh. ahem. it's 960. I think the MIME 72 character line length is a little more restrictive than that :-) > They will be quoted of course > (Yes, I know the old NeXT browser doesn't quote them) > Is that not allowed? What are the problem characters? > If there SGML problem characters in the UDI spec, they > probably are ruled out of SGML for a reason. > Good question. These are the things we should research before we go _any_ further implementing this stuff. > Whatever we sue, it should be as quotable in an SGML > attribute as in a MIME external reference as in a > scribbled note or a link-pasteboard or whatever. > (The U is for Universal, NOT Unique!) > Here's an idea for a quoting strategy for the four parts: Either a) it'a a quoted string delimited by "" with \" allowed in the middle, or b) it's a base-64 representation of an arbitrary binary stream. Just an idea. I'm late for an appointment. Gotta go. Dan ======================================================================

Internal links

External links

XYZ

Old features

New features