Zum Inhalt

Overview

The validate_metadata contains two primary Python files responsible for metadata validation.

  1. flow.py
  2. tasks.py

flow.py

This file defines the main flow for validating metadata using the Prefect library, and uses prefect's get_run_logger to log various stages along with the module context and previous module information. The configuration list is also converted to an ISSN list before the validation.

validate_metadata()

  • Required Parameters:

  • name_of_config_file (str): Name of the configuration file. Default is "config-test_crossrefcrawler.json"

  • test_copy (bool): Flag to indicate if it is a test run. Default is False

  • Context for Use:

  • This function is the main entry point for the metadata validation flow. It initializes the necessary module specific context and invokes the validation for the current and previous modules (if no previous module exists, it logs the information that the current module is the first module), the ISSN list, date-time and whether it is a test run.

  • Example Usage:

  • python validate_metadata('config-test_crossrefcrawler.json', test_copy=False)

  • Example Output:

  • log INFO: Module name: wdm_module INFO: Set base path to: W:/wdm-test INFO: Module context: {...} INFO: Previous module: previous_wdm_module INFO: ISSN list: ['1234-5678', '8765-4321']


tasks.py

This file contains the task definitions used in the validation flow. It mainly iterates over the ISSNs, validates journal directories, and checks for missing metadata and logs inforation about missing metadata in the articles.

validate()

  • Required Parameters:

  • module_name (str): Name of the current module

  • previous_module (str or None): Name of the previous module, if any
  • issns (list): List of ISSNs to process
  • current_datetime (dict): Dictionary containing the current date and timestamp
  • test_copy (bool): Flag to indicate if it is a test run

  • Context for Use:

  • This function processes each ISSN (journal), validating journal directories and checking for missing metadata. If missing metadata is found and test_copy is False, an exception is raised.

  • Example Usage:

  • python validate('wdm_module', 'previous_wdm_module', ['1234-5678', '8765-4321'], {'date-str': '2024-05-17', 'timestamp': 1716000000}, False)

  • Example Output:

  • log INFO: Previous module: previous_wdm_module INFO: Have to work on 2 journals INFO: Validating journal: 1234-5678 INFO: Validating journal: 8765-4321


mark_missing_metadata()

  • Required Parameters:

  • module_name (str): Name of the current module

  • latest_previous_module (pathlib.Path): Path to the latest previous module directory
  • missing_metadata (list): List of missing metadata fields

  • Context for Use:

  • This function logs and marks articles with missing metadata, creating a file to record these issues.

  • Example Usage:

  • python latest_previous_module = utils.get_previous_module_latest(previous_module, article_dir) mark_missing_metadata('wdm_module', latest_previous_module, ['module_title', 'module_author'])

  • Example Output:

If there is missing metadata, a file is created at the path //missing_metadata.txt with error message in the format: - log journal_issn journal_doi, module_title, module_author


The Lower-level functions in tasks.py are listed below

get_missing_metadata()

  • Required Parameters:

  • module_name (str): Name of the current module

  • latest_previous_module (pathlib.Path): Path to the latest previous module directory

  • Context for Use:

  • This function checks whether important metadata components like author, title, license, issued year, publisher and description are present and returns the missing metadata.

  • Example Usage:

  • python latest_previous_module = utils.get_previous_module_latest(previous_module, article_dir) get_missing_metadata('wdm_module', latest_previous_module)


should_mark_stop()

  • Required Parameters:

  • missing_metadata (list): List of missing metadata fields

  • Context for Use:

  • This function checks whether there is any missing metadata, and returns True if metadata is missing, otherwise it returns False.