NLM Journal Archiving and Interchange Tag Suite Version 2.0

Version 2.0

[Updated versions of the Tag Suite have been released. Current version information is available here.]

There are extensive changes between Version 2.0 of the NLM Journal Archiving and Interchange Tag Suite (hereafter Tag Suite) and earlier Tag Suite versions 1.0 and 1.1. For this reason, all modules in the Tag Suite advance to Version 2.0 dated 08/30/2004.

Although the changes are fully backwards compatible for XML documents (document instances), the new Tag Suite may not be backwards compatible for all previous customizations.

The major changes to the content models and attribute lists in version 2.0 include:

Incorporation of AIT Working Group suggestions from the June 2004 meeting and June/July 2004 follow-up list discussions; and
General loosening of the Tag Suite to further the goals of preservationist archives.

Major changes have also been made in the Parameter Entities and modularization. The Tag Suite Version 2.0 permits more modifications than previous versions and changes the way in which customizations are made. The changes to the modularization include:

Division into three base Tag Sets instead of two: a preservation-oriented Archiving Tag Set, a regularizing Publishing Tag Set, and a smaller strict Authoring Tag Set;
Remodularization of the Tag Suite to meet new (and newly articulated) customization requirements and to make it more obvious how to make a new Tag Set from the Suite. As part of this modularization, all the default classes have been moved out of the individual element-definition modules, new Suite default class and default mix modules have been set up, and single-module-customization is no longer considered best practice.

Changes to the Base Suite

The section below lists the changes to the base Suite which is used to build the Archiving, Publishing, and Authoring Tag Sets. Most of the old class modules (%list.ent;, %references.ent;, etc.) look much the way they used to, although the default classes and mixes have been moved out of the individual modules into the new class-specific and mix-specific modules. A few elements have moved from one module to another, particularly to the common module, as their usage increased. There are more Parameter Entities to make it possible to over-ride even more content models and attribute lists.

Global changes include:

Changed the version number of every module to reflect a Version of 2.0 and a date of 08/30/2004;
Changed the formal public identifier of each module to that version and date, and
Changed the “dtd-version” in all the DTDs to #FIXED attribute to “2.0”.

Rationale and Philosophy for Changes

Regularizing Versus Strict Preservation

Two base Tag Sets were originally constructed from the NLM Journal Archiving and Interchange Tag Suite: the Journal Archiving and Interchange Tag Set (nicknamed Green) and the Journal Publishing Tag Set (nicknamed Blue). The Publishing Tag Set was to be prescriptive, to facilitate authoring. The Archiving Tag Set was to be moderately loose, to serve as a basis for interchange and repositories.

An understated, but very real, goal of the Archiving Tag Set was regularization of the archive itself. At the time of conversion from a publisher’s original DTD to the Archiving Tag Set, a number of changes would be made to make all articles more alike. An alternate goal for the Tag Set might have been (but was not) preservation of a publisher’s content as exactly as possible. (As a small illustration of the difference such a design makes in a Tag Set, consider the ”article-type” attribute values for the <article> element. An archive-regularizing Tag Set would make this a closed list and would change the “research article” of one publisher, the “research paper” of another, and the “letter” of a journal such as Nature into a single value during conversion. In contrast, a preservationist approach might make this attribute a CDATA open list, to capture whatever the publisher had called the article.)

Three DTDs From Two

Many requests from the AIT Working Group were from preservationists asking that the Archiving Tag Set loosen up to allow representing a wider range of publishers’ input. As the Archiving Tag Set loosened, archives wanting to regularize the archive migrated to the Publishing Tag Set. They found it a little too tight for their needs and requested loosening changes. At the same time, vendors wanting to establish an authoring environment felt that Blue needed tightening, to make it easier for authors. The solution has been to create three base Tag Sets from the Suite:

Journal Archiving (Green) Tag Set	a preservationist archival Tag Set (the current Archiving Tag Set, made even more flexible and non-enforcing
Journal Publishing (Blue) Tag Set	an archive regularization and interchange Tag Set (the current Publishing Tag Set loosened as necessary)
Authoring Tag Set (Not available in 2.0)	a tight, small subset that concentrates on best practice as an aid to authoring

Remodularization

New Customization Requirements

When the Archiving and Interchange Tag Suite was written, it was assumed that the major use of the Suite would be to make entirely new and distinct Tag Sets, so the modularization was done to make that convenient. All customization was concentrated in one module, and Parameter Entities were defined just before they were used. But very few original Tag Sets have been developed in the last year and a half. New Tag Sets are being created from the Suite not by development but by modification. Most organizations seem to want most of their models to be the same as the Suite, plus a few changes. People modifying the Tag Suite have also requested more guidance in how to modify it properly. In light of actual Suite usage, we have observed the following Tag Suite customization requirements.

The published modules should act as a more complete model of how to build a modified DTD from the Suite. The modularization of the base Archiving and Publishing DTDs should provide a sample of best practice for making and modularizing new DTDs.
Customization modules should be small and list only what has changed from the Suite defaults.
It should be relatively easy to compare new Tag Sets developed from the Suite.
Parameter Entities that are not over-ridden by a new Tag Set should not need to be declared by it (so that everything in the customization module is a change from the base).
It is still useful to developers to have all the element classes listed in one place, grouped functionally (in functions that match the Suite modules) so that it is easy to see what in the Suite is designed to be changed easily and how to change it.

New Class and Mix Modules

Best practice for how to make a new Tag Set from the Suite has changed, and the Suite has been remodularized for the new style. Two Suite modules have been created to hold the declarations that were formerly part of each Tag Set’s customization. In place of the former single customization module, there are smaller function-specific modules:

Default definition of element classes (the module %default-classes.ent;); and
Default definition of element mixes (the module %default-mixes.ent;).

Modeling Conventions and PE Naming

These Tag Set and Suite modules have used a series of design and naming conventions consistently. While parsing software cannot enforce these Parameter Entity usage or naming conventions, these conventions can make it much easier for a person to know how the content models work. Version 2.0 of the Archiving Tag Set (and the entire Verison 2.0 Archiving and Interchange Tag Suite) use the following usage and naming conventions.

Classes —Classes are functional OR-groups of elements. All class names end in the suffix “.class”. For example:
```
<!ENTITY % list.class "def-list | list">
```
Classes cannot be made empty; the class should just be removed from all models where you do not want the elements included.
Mixes —Mixes are OR-groups of classes. All mixes must be declared after all classes, since mixes are composed of classes. (Mixes should never contain element names directly.) Mix names have no set suffix. Some mixes are inline to be intermingled with #PCDATA and some mixes grouping of block-level elements. All inline mixes begin with an OR bar. For example:
```
<!ENTITY % rendition-plus                             
"| %emphasis.class;  | %subsup.class;" >
```
Content —Content models and content model over-rides use mixes and classes for all OR groups. Only sequences are made up of element names directly. Content models over-rides are of two types, defined separately to preserve the mixed-content or element- content nature of the models as an aid to interchange.
- -models —The over-ride of a complete content model will be named with a suffix “-model”. The over-ride includes the entire content model,including the enclosing parentheses, for example:
```
<!ENTITY % kwd-group-model
"(title?, (%kwd.class; | %x.class;)+ )" >
```
- -elements —A grouping of elements to be mixed with #PCDATA inside a content model will be named with a suffix “-elements”. For example “access-date-elements” would be used in the models for the elements <access-date>. All “-elements” over-rides begin with an OR bar, so that a model may exclude all elements and be reduced to #PCDATA . For example:
```
<!ENTITY % access-date-elements 
"| %date-parts.class; | %x.class;" >
```
  Could be replaced by
```
<!ENTITY % access-date-elements ""              >
```
Attribute lists — Attribute lists for a particular element are named with the name of the element followed by the suffix “-atts”, so, for example, the attributes for the abstract element would be named “abstract-atts”. Such lists are not reused as frequently as they might be in many DTDs, to provide maximum flexibility. Attribute lists for different elements were rarely tied together. The Parameter Entities contain at least one complete line of an attribute list, not including the ATTLIST Declaration.
```
<!ENTITY % rendition-plus                             
"| %emphasis.class;  | %subsup.class;" >
```

New Classes

The ideal situation in a Tag Set is that mix OR-groups and OR-groups within content models do not name elements; they name classes. This makes Tag Set-customization easier and makes maintenance over time significantly easier. A few new classes were created to facilitate this:

%app.class;
%back.class;
%caption.class;
%corresp.class;
%date.class;
%date-parts.class;
%def.class; (used in both <def-item> and <abbrev>
%degree.class;
%fig-display.class;
%fn-link.class;
%front.class;
%front-back.class;
%id.class;
%just-base-display.class;
%just-para.class; (used in, for example, <author-comment>, <bio>, <def>, <caption>, <statement>, <fig>),
%just-table.class;
%kwd.class;
%name.class;
%ref-list.class;
%sec-back.class;
%table-foot.class; and
%tbody.class;.

Content models were rewritten to use the newly created classes. This rewriting did not lead to Tag Set changes, except in the following elements:

Modified %doc-back-matter-mix; (formerly named %doc-back-matter-elements;) to correct the historical error that had this Parameter Entity calling a mix (%sec-level;) and not a class (%sec.class;). Since there was nothing in %sec-level; but <sec>, this has no effect on the Tag Sets as delivered, but it may change existing customizations.
%front-matter-model; rewritten to use new class Parameter Entity %front-back.class; and to use %list.class; rather than just <def-list>. This widens the model of <front-matter> by adding <list>.
Paragraph-related changes:
- %inside-para; was renamed %p-elements;
- Deleted %para.class; In the definition of the Paragraph <p> element, its place will be taken by %p-elements; .In other mixes, such as %para-level;, %para.class; was replaced by the combination of %just-para.class; and %rest-of-para.class;. (No Tag Set Changes)
- Inside the content model for Paragraph <p>, %rest-of-para.class; was renamed %p-elements;. The content model for <named=content> and the model for %p-elements; itself still use %rest-of-para.class;.
In %named-content-elements;, replaced the Parameter Entity %emphasized-text; with its constituent classes
In %copyright-statement-elements;, replaced the mix %rendition-plus; with its constituent classes
In %citation-elements;, replaced the mix %simple-text; with its constituent classes. This causes the apparent, but not real, deletion of %address-link.class;. These links are also in %references.class; so there were no Tag Set changes.)

Link Classes

The link classes were reorganized to make future modification easier. Three classes were deleted; new link classes were added for a total of four, and everywhere the link classes were used was modified as follows:

All occurrences of %ext-links.class; were replaced with the new class %address-links.class;.
All occurrences of %link.class; were replaced with some combination of the new link classes named below.

The following link-related classes were deleted:

%link.class;
%inpara-address;
%ext-links.class;

The new link classes are:

%address-link.class; (external links used in addresses)
%fn-link.class; (footnote alone)
%simple-link.class; (the internal links, just as it used to be)
%article-link.class; (links for journal articles)

Specific changes this encompassed include:

Replaced the link PEs in %emphasized-text;, %inside-cell;, %p-elements;, %product-elements;, %simple-phrase;
In %aff-elements;, replaced %link.class; with %address-link.class;, %simple-link.class;, %article-link.class; (directly in the Suite base module and via the use of %all-phrase; in the Archiving customization) (No Tag Set Change)
In <author-notes>, “(corresp | fn)+” was replaced with the %fn-link.class; and the new class %corresp.class;
In %collab-elements;, replaced %ext-links.class; with %address-link.class; and deleted %inpara-address; (directly in the Suite base module and via the use of %all-phrase; in the Archiving customization) (No Tag Set Change)
In %copyright-statement-elements;, replaced %inpara-address; with %address-link.class; (directly in the Suite base module and via the use of %all-phrase; in the Archiving customization)
<email> is considered to be just another type of external link, as <ext-link> is, so it was added to: %collab-elements; and %copyright-statement-elements;.
In %inside-para; (which had been modified and renamed %p-elements;) (No Tag Set Change since %address-link.class; covers it.), deleted the PE %inpara-address;
In %named-content-elements;, replaced %link.class; with %address-link.class;, %article-link.class;, and %simple-link.class; (directly in the Suite base module and via the use of %all-phrase; in the Archiving customization) (No Tag Set Change)
In <related-article>, deleted %ext-links.class; because %references.class; has the needed links. (No Tag Set Change)

Rename Existing Parameter Entities

In order to make customization and maintenance easier, the names of several existing Parameter Entities were changed to bring them in line with the naming practices. This entailed changing the PE declaration and every mix or context model that used the PE.

%address-elements; ==> to %address.class;
%author-notes-elements; ==> %author-notes-model;
%block-math; ==> %block-math.class;
%citation-model; ==> %citation-elements;
%contrib-info; ==> %contrib-info.class;
%copyright-statement-model; ==> %copyright-statement-elements;
%def-item-elements; ==> %def-item-model;
%display-back-matter; ==> %display-back-matter.class;
%doc-back-matter-elements; ==> %doc-back-matter-mix;
%inline-math; ==> %inline-math.class;
%list-item-elements; ==> %list-item-model;
%related-article-model; ==> %related-article-elements;
%sec-back-matter-elements; ==> %sec-back-matter-mix;

Inline Mix OR-Bars

INLINE MIX OR-BAR — All inline mixes begin with an OR bar. While not strictly conformant, all modern XML parsers tested allow this variant. This technique allows the PE to be set to the null string, cancelling out any element inclusions and leaving a model of #PCDATA. This could also have been accomplished by over-riding the entire content model with a PE. The disadvantage of that method is that it makes it very easy to change mixed-content models to block-level element-content models. Since that is a major barrier to interchange, keeping the level the same is the one area where this Tag Suite attempts to enforce consistency.

Changed the following inline-mix Parameter Entities to use the OR-bar-first mechanism. This requires changing not only the Parameter Entity to add the OR-bar, but changing all content models that use the entity to remove the OR bar: %all-phrase;, %emphasized-text;, %inside-para; (Now renamed %p-elements;and used only inside the Paragraph element <p>), %just-rendition;, %preformat-elements;, %related-article-elements;, %rendition-plus;, %simple-phrase;, %simple-text;, and all the Parameter Entities with the suffix “-elements”, if they did not already start that way.

Model Over-rides Permitted

To make the Tag Sets more flexible and allow additional over-riding, the following new PEs were added:

%access-date-elements;
%chem-struct-elements;
%copyright-statement-elements;
%degrees-elements;
%display-formula-elements; and %display-formula-model;.
%edition-elements;
%etal-elements;
%ext-link-elements;
%fax-elements;
%font-elements;
%given-names-elements;
%gov-elements;
%history-model;
%issn-elements;
%issue-elements;
%issue-title-elements;
%just-para.class
%kwd-group-model;
%label-elements;
%long-desc-elements;
%on-behalf-of-elements;
%p-elements;
%patent-elements;
%phone-elements;
%prefix-elements;
%publisher-name-elements;
%publisher-loc-elements;
%role-elements;
%self-uri-elements;
%series-text-elements;
%series-title-elements;
%std-elements;
%string-date-elements;
%string-name-elements;
%suffix-elements;
%surname-elements;
%time-stamp-elements;
%uri-elements;
%verse-line-elements;
%volume-elements; and
%volume-id-elements;.

New attribute Parameter Entities were added as well:

%article-id-atts; for <article-id>
%date-atts; for <date>
%object-id-atts; and “object-id-type”
%pub-date-atts; for <pub-date>
%pub-id-atts; for <pub-id>
%sub-article-atts; for <sub-article>
%volume-id-atts; for <volume-id>

Tag Sets

These Tag Sets are availble in version 2.0:

National Center for Biotechnology Information
U.S. National Library of Medicine
8600 Rockville Pike, Bethesda, MD 20894
Copyright, Disclaimer, Privacy, Accessibility

Last updated: September 14, 2012