Zum Inhalt

Module RelaxNG Scanner

General Information

This module serves the input validation of XML documents obtained from publishers. Every XML document can be validated against several RelaxNG schemes.

Selecting RelaxNG schemas for a journal in the config file

Module logic

The module uses git to obtain the latest RelaxNG schemas. It will then, for each article, load the single document from the xml folder and try to validate it against the RelaxNG schemas in the order as specified in the configuration file.

  • Validation is performed via the Jing-Trang validation engine as the LXML output was often too unspecific to understand the reason validation failed.
  • The module will only work on a single XML document in the xml folder
  • Validation is finished the first time a document was successfully validated. The name of that RelaxNG is saved.
  • If validation fails for all RelaxNG schemas in the config file, a stop processing folder is created and the errors are logged.
  • The module contains a mechanism to resolve stop-processing folders by attempting another round of validation depending on the parameters. See details below.

Processing steps

Update of the RelaxNG schemas

The module will perform a git clone of the git repository that contains the RelaxNG schemas if the folder that is supposed to contain the RelaxNG schemas in the mounted file systems does not exist. Otherweise, it will perform a git pull.

Matching config and RelaxNG schemas

QUESTION: Do we want to change the logic here before going to production. See also #116

Stop folder reprocessing

Reprocessing configuration

On a journal basis, this module allows for reprocessing of earlier created stop-processing folders. This feature is controlled by the follwing flow parameters:

  • remove_stop_folders: True (stop folders will be reprocessed for the ISSNs specified in limit_stop_folder_removal_list)/False (no re-processing, default value)
  • limit_stop_folder_removal_list: List of ISSNs for which re-processing should be performed
    • default: List only containing the None element, leading to no re-processing
    • list containing specific ISSNs will limit reprocessing to those
    • an empty list will lead to reprocessing of all ISSNs

Reprocessing logic

In reprocessing mode, for which no stop folder was previously created will be processed normally using the previous module's output as input. If there is a stop-folder for the RelaxNG scanner, then a logic will be used that treats the stop-processing- folder as if it were the folder of the previous module.

  • create a temp- folder as copy of the stop-processing- folder
  • process the contents of this temp- folder
  • rename old stop-processing- folder to clear-stop-processing-
    • this is done to keep the data save even if something should happen during the following file operations
  • rename temp directory based on outcome
    • if validation was successful, rename so that it will be processed by the following module
    • if validation failed again, rename to a new stop-processing folder
  • delete the clear-stop-processing- folder and all of its contents

XML pre-processing to remove DTD reference

An XML will be automatically invalid (? not well formed?) if the DTD referenced in the file is not available. Thus, this reference has to be removed to allow for processing without having all DTDs available. This is done by creating a copy of the XML document, using the lxml etree library to open, removing the docinfo, and writing the contents back to the temporary copy. This temporary copy is then used for RelaxNG validation.

RelaxNG validation

A subprocess is then started calling the jing-trang validator with the temporary file from the previous processing step. Based on the output of this subprocess, the XML file is either deemed valid or invalid.

Logging results

  • For validated XML files, the name of the RelaxNG for which the XML file was valid is written to a text file together with the HEAD commit hash of the git repo containing the RelaxNG schema files.
  • For invalid XML files, the jing-trang output containing the validation errors will be written to a csv files for manual intervention.

Writing the processing history

Final processing

If validation was successful for at least one of the RelaxNG schemas in the config file, the module folder will be renamed for processing by the next module. If the XML file was invalid for all RelaxNG schemas, a stop-processing folder is created.