Journal Articles

Crossref allows you to perform batch registration for journals and the volumes, issues, and articles being registered within the journal. Within a journal instance you may register articles from a single issue by adding details to journal_issue. Crossref allows you to register items from more than one issue but to do so you must use multiple journal instances within the XML file you submit to Crossref.

The documentation below describes what Crossref expects and our approach to batch registration of DOIs for journal articles.

Anatomy of Crossref Journal Metadata

Crossref provides excellent documentation about the anatomy of a batch registration in its schema documentation.

This section describes details regarding what Crossref expects for journals so that it is easy to understand why we create the various components in our generated XML described in later sections. Please note that the details here may only include details about the sections we currently use. See schema documentation for more details about unused tags.

Full Description as XML

Below is a description of the anatomy of a Journal object down to its ground children tags. Please note that the cardinality attributes in children tags are not a part of the Crossref schema but instead included to make it easier to understand the cardinality expectations of each tag. Please see the schema documentation for details on the anatomy of grand children tags and nodes.

<journal xmlns="http://www.crossref.org/schema/5.3.1">
  <journal_metadata language="" reference_distribution_opts="none" cardinality="{1, 1}">
      <full_title>{1,10}</full_title>
      <abbrev_title>{0,10}</abbrev_title>
      <issn media_type="print">{0,6}</issn>
      <coden>{0,1}</coden>
      <archive_locations>{0,1}</archive_locations>
      <doi_data>{0,1}</doi_data>
  </journal_metadata>
  <journal_issue cardinality="{0, 1}">
      <contributors>{0,1}</contributors>
      <titles>{0,1}</titles>
      <publication_date media_type="print">{1,10}</publication_date>
      <journal_volume>{0,1}</journal_volume>
      <issue>{0,1}</issue>
      <special_numbering>{0,1}</special_numbering>
      <archive_locations>{0,1}</archive_locations>
      <doi_data>{0,1}</doi_data>
  </journal_issue>
  <journal_article language="" publication_type="full_text" reference_distribution_opts="none" cardinality="{0,unbounded}">
      <titles>{1,20}</titles>
      <contributors>{0,1}</contributors>
      <jats:abstract abstract-type="" xml:base="" id="" xml:lang="" specific-use="">{0,unbounded}</jats:abstract>
      <publication_date media_type="print">{1,10}</publication_date>
      <acceptance_date media_type="print">{0,1}</acceptance_date>
      <pages>{0,1}</pages>
      <publisher_item>{0,1}</publisher_item>
      <crossmark>{0,1}</crossmark>
      <fr:program name="fundref">{0,1}</fr:program>
      <ai:program name="AccessIndicators">{0,1}</ai:program>
      <ct:program>{0,1}</ct:program>
      <rel:program name="relations">{0,1}</rel:program>
      <archive_locations>{0,1}</archive_locations>
      <scn_policies>{0,1}</scn_policies>
      <doi_data>{1,1}</doi_data>
      <citation_list>{0,1}</citation_list>
      <component_list>{0,1}</component_list>
  </journal_article>
</journal>

Journal Metadata

A journal_metadata tag is required for each journal in a batch registration request. The journal_metadata element allows one attribute, @language. @language is an optional attribute but must be one of the language codes listed in the schema in ISO 639 format. The docs don’t mention this, but it looks like the schema technically expects ISO 639-1.

While most sub-elements are optional, a journal_metadata tag must always have 1-10 full_title tags. The contents of this tag should be a full title by which a journal is commonly known or cited.

A journal_metadata tag may also have 0-10 abbrev_title tags. These should contain common abbreviation or abbreviations used when citing a journal. It is recommended that periods be included after abbreviated words within the title.

A journal_metadata tag may also have 0-6 issn tags that describe the ISSN(s) assigned to the title being registered. The @media_type attribute is optional and used to describe whether the ISSN is for the electronic or print. If not included, Crossref will assume the ISSN refers to the print.

The doi_data section includes information related to the DOI that refers to the full journal. The doi tag includes the DOI for the entity being registered with Crossref, while resource includes the URI associated with a DOI.

A completed journal_metadata section may have other components but may look like this:

<journal_metadata>
    <full_title>National Quail Symposium Proceedings</full_title>
    <full_title>Quail</full_title>
    <full_title>National Quail Symposium proceedings</full_title>
    <full_title>Proceedings of the ... National Quail Symposium</full_title>
    <full_title>Proceedings of the National Quail Symposia</full_title>
    <full_title>Gamebird : a joint conference of Quail and Perdix</full_title>
    <full_title>NQSP</full_title>
    <abbrev_title>NQSP</abbrev_title>
    <issn media_type="print">2573-5667</issn>
    <issn media_type="electronic">2573-5683</issn>
    <doi_data>
      <doi>10.7290/nqsp</doi>
      <resource>https://trace.tennessee.edu/nqsp/</resource>
    </doi_data>
</journal_metadata>

Journal Issue

A journal_issue tag is required for each journal in a batch registration request.

While there are many allowed sub-elements, a journal_issue must always have 1-10 publication_date tags that describe the date of publication. Multiple dates are allowed to allow for different dates of publication for online and print versions. If you have separate dates, you must use a @media-type attribute to describe whether the date refers to the print or electronic. Each publication_date must have exactly one year but can also have 0-1 month or day tags. Only use the optional tags if you know the exact date.

At UTK, we also try to describe known editors and reviewers in the contributors section. Each contributor must have one of the following roles: author, editor, chair, reviewer, review-assistant, stats-reviewer, reviewer-external, reader, translator. We do not put authors in this section but instead in the articles section. Each contributor can have various metadata elements. See schema docs for more information.

Each journal_issue can have 0-1 titles tag which acts as a container for the title and original language title elements. Only title is required here unless it is a translation in which original_language_title also becomes required.

Finally, a journal_issue can have 0-1 journal_volume tags which acts as a ontainer for the journal volume and DOI assigned to an entire journal volume. You may register a DOI for an entire volume by including doi_data in journal_volume. If included, this element must have 0, 1 volume tags which include the volume number.

A completed journal_issue section may have other components but may look like this:

<journal_issue>
    <contributors>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Frank R.</given_name>
        <surname>Thompson</surname>
        <suffix>III</suffix>
        <affiliations>
          <institution>
            <institution_name>USDA Forest Service</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Roger D.</given_name>
        <surname>Applegate</surname>
        <affiliations>
          <institution>
            <institution_name>Tennessee Wildlife Resources Agency</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Leonard A.</given_name>
        <surname>Brennan</surname>
        <affiliations>
          <institution>
            <institution_name>Texas A&amp;M University-Kingsville</institution_name>
            <institution_department>Caesar Kleberg Wildlife Research Institute</institution_department>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>C. Brad</given_name>
        <surname>Dabbert</surname>
        <affiliations>
          <institution>
            <institution_name>Texas Tech University</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Stephen J.</given_name>
        <surname>DeMaso</surname>
        <affiliations>
          <institution>
            <institution_name>U.S. Fish and Wildlife Service</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Kenneth</given_name>
        <surname>Duren</surname>
        <affiliations>
          <institution>
            <institution_name>Pennsylvania Game Commission</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>James A.</given_name>
        <surname>Martin</surname>
        <affiliations>
          <institution>
            <institution_name>University of Georgia</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Kelly S.</given_name>
        <surname>Reyna</surname>
        <affiliations>
          <institution>
            <institution_name>Texas A&amp;M University-Commerce</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Evan P.</given_name>
        <surname>Tanner</surname>
        <affiliations>
          <institution>
            <institution_name>Texas A&amp;M University-Kingsville</institution_name>
            <institution_department>Caesar Kleberg Wildlife Research Institute</institution_department>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Theron M.</given_name>
        <surname>Terhune II</surname>
        <affiliations>
          <institution>
            <institution_name>Orton Plantation</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="first" contributor_role="editor">
        <given_name>Molly K.</given_name>
        <surname>Foley</surname>
        <affiliations>
          <institution>
            <institution_name>National Bobwhite &amp; Grassland Initiative</institution_name>
          </institution>
        </affiliations>
      </person_name>
    </contributors>
    <titles>
      <title>Quail 9: National Quail Symposium</title>
    </titles>
    <publication_date>
      <year>2022</year>
    </publication_date>
    <journal_volume>
      <volume>9</volume>
    </journal_volume>
</journal_issue>

Journal Article

The journal tag can have 0 - “unbounded” journal_article tags that acts as a container for all information about a single journal article. Each journal_article must have 1-20 titles, 1-10 publication_date, and 1-1 doi_data tags.

The rules for each of these are the same as described in previous elements above, and we use them in the same way here.

In addition to the required elements, we also add authors using the contributors tag. Each person_name in this section is assigned the author role.

A completed journal article should look something like this:

<journal_article publication_type="full_text">
    <titles>
      <title>Northern Bobwhite and Fire: A Review and Synthesis</title>
    </titles>
    <contributors>
      <person_name sequence="first" contributor_role="author">
        <given_name>David A</given_name>
        <surname>Weber</surname>
        <affiliations>
          <institution>
            <institution_name>University of Georgia</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="additional" contributor_role="author">
        <given_name>Evan P</given_name>
        <surname>Tanner</surname>
        <affiliations>
          <institution>
            <institution_name>Caesar Kleberg Wildlife Research Institute</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="additional" contributor_role="author">
        <given_name>Theron M.</given_name>
        <surname>Terhune</surname>
        <suffix>II</suffix>
        <affiliations>
          <institution>
            <institution_name>Tall Timbers</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="additional" contributor_role="author">
        <given_name>J. Morgan</given_name>
        <surname>Varner</surname>
        <affiliations>
          <institution>
            <institution_name>Tall Timbers</institution_name>
          </institution>
        </affiliations>
      </person_name>
      <person_name sequence="additional" contributor_role="author">
        <given_name>James A.</given_name>
        <surname>Martin</surname>
        <affiliations>
          <institution>
            <institution_name>University of Georgia</institution_name>
          </institution>
        </affiliations>
      </person_name>
    </contributors>
    <publication_date>
      <year>2022</year>
    </publication_date>
    <doi_data>
      <doi>10.7290/nqsp09V0ju</doi>
      <resource>https://trace.tennessee.edu/nqsp/vol9/iss1/63</resource>
    </doi_data>
</journal_article>

Using CSV as the DOI Source

By default, it is assumed that the DOIs we intend to mint are included in the metadata record for each work. Optionally, a CSV can be supplied as the source (see below for more details).

If you are supplying a CSV, the CSV must have 2 columns: url and doi. These columns can appear anywhere, and there can be n number of additional columns. The only requirement is that url and doi appear in row one and the headings be lowercase.

The value of the url column fields should be the url to the work in Digital Commons (e.g. https://trace.tennessee.edu/nqsp/vol8/iss1/24).

The value of the doi column fields can be a DOI that starts with https://doi.org/ or 10.7290. Crossref expects the value to be formatted as 10.7290/xxxxxx so code exists in the scripts to remove https://doi.org/ if it is included:

def __build_doi_object(self):
    if self.doi:
        return {
            "doi": self.doi.replace("https://doi.org/", ""),
            "resource": self.coverpage,
            "timestamp": str(arrow.utcnow().format("YYYYMMDDHHmmss"))
        }
    else:
        return None

Creating Metadata about the Journal, Issue, and Deposit

Additional metadata beyond what is found in the article level metadata is needed for deposit and DOI registration.

This metadata is added in a human-readable way using yaml. These yaml files should include everything needed to generate the missing elements for deposit.

The path property describes where the XML containing article level metadata can be found.

path: "metadata/output/vol9"

The contributors property describes the editors and reviewers of the volume or issue:

contributors:
  - given: Frank R.
    surname: Thompson
    suffix: III
    role: editor
    sequence: first
    institution:
      institution_name: USDA Forest Service
  - given: Roger D.
    surname: Applegate
    role: editor
    sequence: additional
    institution:
      institution_name: Tennessee Wildlife Resources Agency
  - given: Leonard A.
    surname: Brennan
    role: editor
    sequence: additional
    institution:
      institution_name: Texas A&M University-Kingsville
      institution_department: Caesar Kleberg Wildlife Research Institute
  - given: C. Brad
    surname: Dabbert
    role: editor
    sequence: additional
    institution:
      institution_name: Texas Tech University
  - given: Stephen J.
    surname: DeMaso
    role: editor
    sequence: additional
    institution:
      institution_name: U.S. Fish and Wildlife Service
  - given: Kenneth
    surname: Duren
    role: editor
    sequence: additional
    institution:
      institution_name: Pennsylvania Game Commission
  - given: James A.
    surname: Martin
    role: editor
    sequence: additional
    institution:
      institution_name: University of Georgia
  - given: Kelly S.
    surname: Reyna
    role: editor
    sequence: additional
    institution:
      institution_name: Texas A&M University-Commerce
  - given: Evan P.
    surname: Tanner
    role: editor
    sequence: additional
    institution:
      institution_name: Texas A&M University-Kingsville
      institution_department: Caesar Kleberg Wildlife Research Institute
  - given: Theron M.
    surname: Terhune II
    role: editor
    sequence: additional
    institution:
      institution_name: Orton Plantation
  - given: Molly K.
    surname: Foley
    role: editor
    sequence: additional
    institution:
      institution_name: National Bobwhite & Grassland Initiative

The journal_metadata property includes metadata about the journal overall.

journal_metadata:
  full_title:
    - National Quail Symposium Proceedings
    - Quail
    - National Quail Symposium proceedings
    - Proceedings of the ... National Quail Symposium
    - Proceedings of the National Quail Symposia
    - "Gamebird : a joint conference of Quail and Perdix"
    - NQSP
  abbrev_title:
    - NQSP
  issn_data:
    - issn: 2573-5667
      type: print
    - issn: 2573-5683
      type: electronic
  doi_data:
    doi: "10.7290/nqsp"
    resource: "https://trace.tennessee.edu/nqsp/"

The journal_issue property includes other metadata about the issue.

journal_issue:
  publication_date:
    year: "2022"
  journal_volume:
    volume: "9"
  titles:
    title: "Quail 9: National Quail Symposium"

Finally, the head property includes metadata required for deposit.

head:
  doi_batch_id: utk_nqsp_9_10_2022
  timestamp: "20221021080808"
  depositor:
    depositor_name: Mark Baggett
    email_address: mbagget1@utk.edu
  registrant: University of Tennessee

DOIJournalBatchWriter and Other Classes Used for Registration Generation

Crossref batch registration of DOIs for journals and journal articles is handled primarily by the crawl_papers.py script in this repository. While there are a few classes here, DOIJournalBatchWriter is primarily used.

DOIJournalBatchWriter also includes an optional argument, csv_path. By default, this is an empty string. If the string is not empty, it signifies to the DOIJournalBatchWriter instance that the relevant DOIs in this registration is found in an attached CSV rather than the metadata record for individual works.

On initialization, DOIJournalBatchWriter requires one argument: yaml_config or the yaml file that contains additional metadata beyond what is found in the article level metadata that is needed for deposit and DOI registration. The yaml_config is read by the class, converted to a dictionary, and stored as an attribute. If csv_path is included, it is stored in an attribute as well.

Also during initialization, initial namespaces are declared for later use. This is important to know in case there are future efforts to make use of other namespaces listed in the Crossref documentation and examples above as they are not declared currently and will need to be before they can be used. A special attribute called doi_location is also defined. This attribute determines whether the source of the DOIs should be the metadata record or a CSV.

def __find_doi_location(self):
    if self.csv_path != "":
        return "csv"
    else:
        return "metadata"

Next, the path to the metadata files that is declared in the yaml_config is crawled. Each metadata file found in the path is passed to the Article class which builds relevant metadata and determines whether the article should have a DOI registered. This method sends the path to the current file, where to look for the DOI, a path to the CSV to use for lookup if the source is CSV.

def __crawl_journal_articles(self):
    """Crawl a directory of Journal Articles and find a list files with DOIs. Ignore any journal content where no
    DOI is present."""
    valid_articles = []
    for path, directories, files in os.walk(self.path_to_articles):
        for file in files:
            article = Article(f"{path}/{file}", self.doi_location, self.csv_path)
            if article.doi:
                valid_articles.append(article.metadata)
    return valid_articles

The Article class decodes the binary XML file and converts it to an ElementTree. It then passes relevant metadata to other defined classes to build the title, date, doi, and contributor information that is expected in the registration. Note that if future imports expect more information in the article that additional classes will need to be added and added to Article appropriately.

class Article(BaseProperty):
    def __init__(self, path, doi_location='metadata', doi_csv=None):
        super().__init__(path)
        self.doi_location = doi_location
        self.contributors = Contributors(path).contributors
        self.title = Title(path).titles[0]
        self.doi = DOI(path, doi_location, doi_csv).doi_data
        self.publication_date = PublicationDate(path).publication_date
        self.metadata = self.__get_relevant_metadata()

    def __get_relevant_metadata(self):
        return {
            "contributors": self.contributors,
            "title": self.title,
            "doi": self.doi,
            "date": self.publication_date
        }

The DOI class includes the code for find the relevant DOI based on the DOI location information passed to it during initialiization. If the source is the metadata file, it looks at /documents/document/fields/field[@name="doi"]/value. If the source is a CSV, it crawls the CSV for each row in the CSV looking for a match.

class DOI(BaseProperty):
    def __init__(self, path, doi_source='metadata', doi_csv=''):
        super().__init__(path)
        self.coverpage = self.__get_resource()
        self.doi_csv = doi_csv
        self.doi = self.__get_doi(doi_source)
        self.doi_data = self.__build_doi_object()

    def __get_doi(self, source):
        if source == 'metadata':
            matches = [doi.text for doi in self.root.xpath('/documents/document/fields/field[@name="doi"]/value')]
            if len(matches) > 0:
                return matches[0]
            else:
                return None
        else:
            return self.__find_if_doi_in_csv()

    def __find_if_doi_in_csv(self):
        with open(self.doi_csv, 'r') as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                if row['url'] == self.coverpage:
                    return row['doi']
        return None

    def __get_resource(self):
        return [url.text for url in self.root.xpath('/documents/document/coverpage-url')][0]

    def __build_doi_object(self):
        if self.doi:
            return {
                "doi": self.doi.replace("https://doi.org/", ""),
                "resource": self.coverpage,
                "timestamp": str(arrow.utcnow().format("YYYYMMDDHHmmss"))
            }
        else:
            return None

Finally, the output XML file is generated. Each section of the outgoing XML is defined in a private method in DOIBatchJournalWriter like in the examples below:

def __build_journal_issue(self):
    return self.cr.journal_issue(
        self.__build_contributors(),
        self.cr.titles(
            self.cr.title(
                self.proceedings_metadata['journal_issue']['titles']['title']
            )
        ),
        self.cr.publication_date(
            self.cr.year(
                self.proceedings_metadata['journal_issue']['publication_date']['year']
            )
        ),
        self.cr.journal_volume(
            self.cr.volume(
                self.proceedings_metadata['journal_issue']['journal_volume']['volume']
            )
        )
    )

def __build_journal_metadata(self):
    return self.cr.journal_metadata(
        *self.__get_full_titles(),
        *self.__get_abrev_titles(),
        *self.__get_issns(),
        self.__get_doi()
    )

Ultimately, this data is passed up appropriately to methods representing parent nodes and ultimately converted to one XML file.

def __build_response(self):
    return etree.tostring(
        self.__build_xml(),
        pretty_print=True,
        xml_declaration=True,
        encoding='iso-8859-1'
    )

def __build_xml(self):
    begin = self.cr.doi_batch(
        self.__build_head(),
        self.__build_body()
    )
    begin.attrib['{http://www.w3.org/2001/XMLSchema-instance}schemaLocation'] = "http://www.crossref.org/schema/5.3.1 http://www.crossref.org/schemas/crossref5.3.1.xsd"
    begin.attrib['version'] = '5.3.1'
    return begin

In order to pass information accordingly, file and path names are added for each registration at the bottom of the file as so:

if __name__ == "__main__":
    path_to_proceedings_metadata = "data/quail_journal.yml"
    x = DoiJournalBatchWriter('test.xml', path_to_proceedings_metadata).response
    with open('example_journal.xml', 'wb') as example:
        example.write(x)

Crawling Papers

Crawling papers and generating an XML upload can be done with the script found here. The script iterates over all XML files in a directory and creates an XML file according to the Crossref 5.3.1 XML schema definition. The script needs a yml file with the parts described above including a path to the metadata files.

To generate an initial XML registration where DOIs are located in the metadata record, you can run the script like this:

python utilities/crawl_papers.py -y data/quail_journal -o quail_8.xml

To generate an initial XML registration where DOIs are found in an attached CSV, you can run the script like this:

python utilities/crawl_papers.py -y data/quail_journal.yml -d csv -c nqsp_8.csv -o nqsp8.xml

This will generate a registration file that includes metadata from the supplied yaml and any articles in the path that have a DOI. Once the XML file is generated, it may need to be cleaned. The following section describes this process.

Finalizing XML Deposit

Finally, run lxml_transform.py to remove blank elements and perform other required steps for finalizing the XML output.

python utilities/lxml_transform.py -i quail_8.xml -o quail_8_clean.xml

Then, take that XML file and upload it to Crossref for testing.

First, check that your XML is wellformed and valid by uploading here.

Next, upload your XML file to the test system for processing and to insure there are no major issues.

Finally, if all is good, upload to the production system. After deposit, you will receive an email stating whether your upload was successful.