Journal Archiving and Interchange Tag Suite

Frequently Asked Questions

What is the relationship between the NLM DTDs and the NISO JATS?
Will the NLM DTD website remain available?
Are there XSL style sheets for the suite?
What is this tag suite all about?
Why not use W3C XML Schema?

What is the relationship between the NLM DTDs and the NISO JATS?

The Journal Article Tag Suite (JATS) project is a continuation of the NLM DTD project. JATS version 1.0 is an update to and is fully backward-compatible with version 3.0 of the NLM Archiving and Interchange Tag Suite (NLM DTDs).

For more information, please see JATS and the NLM DTDs.

Will the NLM DTD website remain available?

Yes. Tis website (dtd.nlm.nih.gov) and all of the documenation and resources herein will continue to be available.

Are there XSL style sheets for the suite?

XSL style sheets are available for previewing content in the Jounal Publishing DTD. There are separate style sheets for content tagged in version 3.0 and content tagged in all previous versions. All are available on the FTP site and through the Tools section of this documentation.

What is this tag suite all about? (2)

WHAT IS XML (Extensible Markup Language)

XML, in the narrowest sense, is a data format that uses "tags" (machine and human-readable codes) to identify pieces of content in a datastream. That datastream may be a journal article, a letter to the editor, an invoice, an email message, a novel, or almost any other document.

A tag is a word prefixed by the symbol "<" and followed by the symbol ">" or "/>". Tags always come in pairs: a start tag tells where the piece of content starts and an end tag says where it ends. Here are a few tags with their text content.
```
<phone-number>911</phone-number>
<surname>Jones</surname>
<article> .. full text of article here ...</article>
```
The XML hype calls this "self-describing data".

Once the component pieces in your data have been "tagged", computer programs can be written to do things with the data: format it for printing on paper, publish it to a webpage, count things, sort things, list things, extract subsets, rearrange the order, etc.
SHARED TAG SETS

Groups of people (industries, vertical markets, standards associations, etc.) make up sets of XML tags to describe their data. There are 100s of these. The eBusiness/eCommerce folks have tags for invoices, quantity-ordered, and trust-levels. Pharmaceutical folks have tags for dosage, side-effect, and drug-name. The scientific and technical journal publishers have tags for genus-species, footnote, and authors-affiliation. The recipe folks have tags for ingredient, cooking-time, and cooking-method. The smoke stack emissions folks have tags for CO2-level, mean-temperature, and particle-size. The travel agents have smoking-room-preference, elite-membership-status, and surname.
DTDs (Document Type Definitions) AND SCHEMAS

With everyone making up the tags they need to describe their business rules, this could be chaos. But it's not. When you devise a tag set, you also devise a set of "rules" that a document must follow. You don't just say "I have 7 tags: novel, price, chapter, paragraph, title, authors-name, and ISBN"; you make rules for how those tags may be used -- for the relationships among them. You would say:
- A novel must contain the following tags in order:
  1. title
  2. authors-name
  3. ISBN
  4. price (optional)
  5. chapter (which may repeat)
- A chapter must contain
  1. title
  2. paragraph (which may repeat)
  3. title, author-name, ISBN, price, and paragraph contain text.
Such sets of rules are expressed in XML syntax in DTDs and schemas (these are different styles of expressing such rules).

Why make such rules? To communicate. I can now tell you what all my tags are and how they are used correctly. You can now know how to search and how to write software to process my tagged data to extract just the parts you need (for example, the book name, price , and ISBN for your online book catalog). I can use the single tagged XML source to make web pages, printed articles, and other output products.

If all the journal articles in an online repository such as PubMed Central use the same tags and the same rules, you can search them consistently, write ONE set of software to deal with them, and preserve the intellectual content of the journal articles for years to come.
JOURNAL PUBLISHING DTDS

Journal publishers were among the first to realize the usefulness of SGML (XML's parent language) and XML. Some of them have been tagging articles since the late 1980's. DTDs are currently the schema language of choice in journal publishing, for a variety of technical and historical reasons. There are several sources for tags sets and DTDs for journal publishers:
1. publisher-written (Elsevier, Wiley, Blackwell, et al.)
2. consortium written (AAP, ISO 12083, DocBook-Lite, et al.)
3. repository and aggregator-written
These tags sets usually describe the structure of a journal article (sections, titles, paragraphs, footnotes, sidebars, etc.) and the metadata (data *about* the article) such as the author's name, the name of the journal in which it was published, the issue number, the starting page for print journals, the publication date, etc. The exact name of each component, what the content can be, and the order of components vary WIDELY from one tag set to another. (Paragraph could be named <paragraph>, <p>, <para>, <text-unit>, or other, and a paragraph may include within its text figures, lists, genus-names, tables, or none of the above. Journal A may identify genes, Journal B map coordinates.)
The New Public DTDs

Many DTDs are proprietary. Some of the ones available for public use (e.g., ISO 12083) have not been updated to meet modern practice. There are dozens that a publisher could choose.

The intent of the new Archiving and Interchange DTD is to make a single DTD that reflects the current practices of the XML journal publishing community. The idea is that it be easy to convert from the XML and SGML that journal publishers have now into a single repository format. (1)

Why not use W3C XML Schema?

W3C Schema versions of the Archiving and Interchange DTD and the Journal Publishing DTD are now available.

The DTD authors considered writing the "master" version of these tag sets in W3C XML Schema before we wrote the DTD. We understand that many people want to use XML schemas, and that there are XML tools that use only W3C XML Schema.

We chose to express the "master" version of these tags sets in DTD form for several reasons (described below). This does not mean that there won't be versions in W3C XML Schema, and in RELAX NG, and perhaps in other schema languages as well.

Remember: Our double mandate for the archival DTD was

to maintain the intellectual content of extant XML- and SGML-tagged documents and
to make it easy for publishers and archives to transform their documents from XML and SGML formats into our XML tagset.

Among the reasons we chose to express the model in DTD syntax first are:

MODULARITY - We modularized the DTD in ways that allow, for example, multiple Table Models to be used. The table models are in self-contained modules and may be swapped in and out, you may use the CALS table model, the XHTML table models, or any third table model by adding the appropriate module. Although the CALS table, for example, exists as a W3C XML Schema, it cannot be modularized for swap-out in a similar way in W3C Schema syntax.
FACILITATING DEVELOPMENT OF RELATED TAG SETS - the archiving tag set was designed with the expectation that other tag sets (such as the publishing tag set) would be created from it. Our experience is that this is easier in DTD form than in any other form of model description.
LITTLE NEED FOR DATA TYPES OR TYPING - while one can certainly express the models in these DTDs in most (perhaps all) of the extant schema languages, the real strengths of schema languages are not relevant to this data:
- Data Types - For b2b or transaction data, both the data types and the data typing of schemas are useful. There is very little in a journal textual DTD that can benefit from either. There are almost no small, type-identifiable components, even in the metadata.
- Data Typing - Most journal structures are textual, with highly diverse content models. W3C Schema can only type when the content models are the same or can be derived.
- Enforcement - Our goal is preservation, not enforcement; types and typing enforce and help to exclude through error-checking, which we do not wish to do.
Thus, while we do schema versions, there are no advantages to using a schema as the primary format for this model.
PRESERVING EXTANT STRUCTURES - For example, we added the modules for MathML, since some journals want tagged, executable math. MathML is currently only available as DTD modules, not as schema modules. If we had written a schema, we would have needed to exclude Math (or converted it to a schema ourselves - a very complex endeavor that require extensive mathematical knowledge).
USER COMMUNITY - DTDs and schemas have been called opposite ends of a continuum, on one end being power and on the other being ease of use, reading, and maintenance. We chose our end deliberately. Our target audience, including journal publishers, conversion vendors, and aggregation houses, do not use schemas now, they use DTDs. They understand and can read DTDs. This DTD is for them.

Some of the reasons they remain with DTDs can be found in Alex Brown's excellent article on his experiences trying schemas in journal publishing: https://ssl.bnt.com/idealliance/papers/xmle02/dx_xmle02/papers/03-01-02/03-01-02.html.

References
	1.	Lapeyre, DA. What This Public DTD Suite Is All About. [email] Message to: Archive-dtd Mailing List. 2003 June 17, 4:26 pm.
	2.	Lapeyre, DA. W3C XML Schema versus DTD. [email] Message to: Archive-dtd Mailing List. 2003 June 2, 2:09 pm.

National Center for Biotechnology Information
U.S. National Library of Medicine
8600 Rockville Pike, Bethesda, MD 20894
Copyright, Disclaimer, Privacy, Accessibility