Zum Inhalt

Overview

The MDPI contains two primary Python files responsible for metadata validation.

  1. flow.py
  2. tasks.py

flow.py

This file is designed to automate the process of validating and synchronizing metadata from MDPI (Multidisciplinary Digital Publishing Institute) FTP servers. This includes fetching and processing metadata, synchronizing full-text files, and handling any missing metadata issues for a list of journals specified in the configuration file.

mdpi()

  • Required Parameters:

  • name_of_config_file (str): Name of the configuration file. Default is "config-test_crossrefcrawler.json"

  • test_copy (bool): Flag to indicate if it is a test run. Default is False

  • Context for Use:

  • This function is the main entry point for the MDPI metadata validation and processing flow. It initializes the necessary context and invokes the validation for the current and previous modules (if no previous module exists, it logs the information that the current module is the first module), synchronizes metadata from MDPI FTP, and processes articles for missing metadata.

  • Key Components:

  • Initialization

    • module_name = context.get_run_context().flow.name: Retrieves the name of the current flow
  • Context and Configuration
    • data_root = utils.get_data_root(): Gets the root directory for data storage
    • module_context = configuration.get_module_context(data_root.joinpath(name_of_config_file), module_name): Retrieves the context for the current module based on the configuration file
  • Metadata Synchronisation
    • tasks.sync_tocs_from_mdpi_ftp(): Synchronizes the table of contents (TOCs) from the MDPI FTP server
    • abbrev_mapping = tasks.mdpi_isbn_abbreviation_mapping(tasks.get_mdpi_toc_file_path(), tasks.get_abbrevcache_path(), issn_list): Creates an ISSN-ISBN abbreviation mapping
    • journals_to_be_harvested = tasks.select_journals_to_harvest(issn_list, abbrev_mapping): Filters the ISSN-ISBN mapping to include only journals listed in the configuration file
    • tasks.sync_mdpi_files_from_ftp_server(journals_to_be_harvested): Synchronizes full-text files from the MDPI FTP server for the specified journals
  • Journal Processing (Iterates over each ISSN and processes the corresponding journal directory)

    • The directory path for the journal is obtained and works on a copy of the Journal folder if test_copy is set to True.
    • Each item in the journal path is iterated over and processed in the following manner:
    • latest_previous_module = utils.get_previous_module_latest(previous_module, article_dir): The latest previous module directory is obtained, and checked if it is updated and not marked to be skipped
    • is_harvested = tasks.is_full_text_already_harvested(module_name, previous_module, article_dir): Checks if the full text is already harvested. If already harvested, this information is logged and this item is skipped.
    • utils.copy_content_from_previous_module(module_name, latest_previous_module): Copies content from the previous module and then creates the current module temporary directory. From this information and the abbreviation mapping, a tar_ball_info is created and then extracted to the same directory
    • documentation.add_documentation(documentation_data, module_name=module_name, latest_previous_module=latest_previous_module): Upon successfull completion, documentation is added for the processed article
    • Renames the temporary directory for current module, deleted the previous module and marks the previous module as processed.
    • Logging is done throughout these steps to identify various stages and outcomes, including the number of articles processed, skipped, or missing required files.
  • Example Usage:

  • python mdpi('config-test_crossrefcrawler.json', test_copy=False)

  • Example Output:

  • log INFO: Module name: mdpi INFO: Set base path to: W:/wdm-test INFO: Module context: {...} INFO: Previous module: previous_wdm_module INFO: ISSN list: ['1234-5678', '8765-4321'] INFO: Have to work on 1 journals INFO: Journal path: /path/to/journal INFO: Full text was previously harvested for article: article_name. Skipping. INFO: Fetched full text for 2 out of 3 articles in journal: /path/to/journal. INFO: 1 articles had updated metadata but full text already exists.


tasks.py

The file contains tasks that are designed to handle various operations related to the synchronization, processing, and extraction of metadata and full-text files from MDPI (Multidisciplinary Digital Publishing Institute) FTP servers to local NAS folders. These tasks are essential for maintaining up-to-date and accurate metadata records.

The Lower-level functions in tasks.py are listed below

get_mdpi_download_root()

  • Required Parameters:

  • None

  • Context for Use:

  • This function returns the root directory path for MDPI downloads, constructed from the data root directory and a predefined subdirectory for downloaded data.

  • Example Usage:

  • python mdpi_download_root = get_mdpi_download_root()

  • Example Output:

  • log /path/to/data_root/downloaded_data/mdpi


get_mdpi_toc_file_path()

  • Required Parameters:

  • None

  • Context for Use:

  • This function returns the file path for the MDPI table of contents (TOC) XML file within the MDPI download root directory.

  • Example Usage:

  • python toc_file_path = get_mdpi_toc_file_path()

  • Example Output:

  • log /path/to/data_root/downloaded_data/mdpi/tocs/all_journal_toc.xml


get_abbrevcache_path()

  • Required Parameters:

  • None

  • Context for Use:

  • This function returns the file path for the abbreviation cache JSON file within the MDPI download root directory.

  • Example Usage:

  • python abbrevcache_path = get_abbrevcache_path()

  • Example Output:

  • log /path/to/data_root/downloaded_data/mdpi/abbrevcache.json


is_full_text_already_harvested(module_name, previous_module, article_dir_path)

  • Required Parameters:

  • module_name (str): Name of the current module

  • previous_module (str): Name of the previous module
  • article_dir_path (pathlib.Path): Path to the article directory

  • Context for Use:

  • This function checks if the full text for a given article has already been harvested by verifying the existence of processing indicators in the previous module directories.

  • Example Usage:

  • python is_harvested = is_full_text_already_harvested('module1', 'module0', pathlib.Path('/path/to/article'))

  • Example Output:

  • log True


sync_tocs_from_mdpi_ftp()

  • Required Parameters:

  • None

  • Context for Use:

  • This function synchronizes the table of contents (TOCs) from the MDPI FTP server to the local MDPI download root directory (NAS Folder) using the rclone command-line tool and returns the path to the local MDPI tocs.

  • Example Usage:

  • python sync_tocs_from_mdpi_ftp()

  • Example Output:

  • log INFO: Synchronizing TOCs from MDPI FTP. INFO: The following is the command line output from Rclone: ... INFO: ... INFO: The following is debug information from Rclone: ... INFO: ...


sync_mdpi_files_from_ftp_server(journal_dictionary)

  • Required Parameters:

  • journal_dictionary (dict): Dictionary mapping ISSNs to journal folder names on the MDPI FTP server

  • Context for Use:

  • This function synchronizes full-text files from the MDPI FTP server for the specified journals using the rclone command-line tool.

  • Example Usage:

  • python sample_dictionary = {'1234-5678': 'Journal Folder Name'} sync_mdpi_files_from_ftp_server(sample_dictionary)

  • Example Output:

  • log INFO: Copy journal Journal Folder Name (ISSN: 1234-5678) INFO: The following is the command line output from Rclone: ... INFO: ... INFO: The following is debug information from Rclone: ... INFO: ...


parse_mdpi_abbreviation_mapping_xml(toc_file, logger)

  • Required Parameters:

  • toc_file (str): Path to the MDPI TOC XML file

  • logger (Logger): Logger for logging information

  • Context for Use:

  • This function parses the MDPI TOC XML file to create a mapping of ISSNs to journal abbreviations.

  • Example Usage:

  • python folders = parse_mdpi_abbreviation_mapping_xml('path/to/toc_file', logger)

  • Example Output:

  • Returns a dictionary mapping ISSNs to journal abbreviations, eg:

  • log { '1234-5678': 'Journal Abbreviation' }

all_issns_in_mdpi_abbrev_list(issn_list, abbrev_dict)

  • Required Parameters:

  • issn_list (list): List of ISSNs to check

  • abbrev_dict (dict): Dictionary mapping ISSNs to journal abbreviations

  • Context for Use:

  • This function checks if all ISSNs in the provided list are present in the abbreviation dictionary.

  • Example Usage:

  • python issn_list = ['1234-5678', '9876-5432'] abbreviation_dict = {'1234-5678': 'Journal Abbreviation'} result = all_issns_in_mdpi_abbrev_list(issn_list, abbreviation_dict)

  • Example Output:

  • log False


mdpi_isbn_abbreviation_mapping(toc_file, dump_file=None, issn_list=None, recreate=False)

  • Required Parameters:

  • toc_file (str): Path to the MDPI TOC XML file

  • dump_file (str, optional): Path to the abbreviation cache file
  • issn_list (list, optional): List of ISSNs to be included in the mapping
  • recreate (bool, optional): Flag to indicate if the abbreviation cache should be recreated. Default is False

  • Context for Use:

  • This function creates an ISSN-ISBN abbreviation mapping from the MDPI TOC XML file and optionally caches the mapping in a JSON file.

  • Example Usage:

  • python mapping = mdpi_isbn_abbreviation_mapping('path/to/toc_file', 'path/to/cache', ['1234-5678'], recreate=True)

  • Example Output:

  • Returns a dictionary mapping ISSNs to abbreviations

  • log { '1234-5678': 'Journal Abbreviation' }

calculate_tarball_file_location(article_list, abbrev_mapping, mdpi_base_folder)

  • Required Parameters:

  • article_list (list): List of articles with metadata

  • abbrev_mapping (dict): Dictionary mapping ISSNs to abbreviations
  • mdpi_base_folder (str): Base folder path for MDPI data

  • Context for Use:

  • This function calculates the file locations for tarballs, XML, and PDF files for each article based on the provided metadata and abbreviation mapping.

  • Example Usage:

  • python new_article_list = calculate_tarball_file_location(article_list, abbrev_mapping, '/path/to/mdpi/base_folder')

  • Example Output:

  • Returns a list of articles with updated file locations, eg:

  • log [ { 'tar_file': '/path/to/tarfile', 'path_to_xml_in_tarfile': '/path/to/xml', 'path_to_pdf_in_tarfile': '/path/to/pdf', 'xml_filename': 'file.xml', 'pdf_filename': 'file.pdf' } ]

unpack_and_move_files_from_tarball(article_list)

  • Required Parameters:

  • article_list (list): List of articles with metadata

  • Context for Use:

  • This function unpacks XML and PDF files from tarballs and moves them to the appropriate directories.

  • Example Usage:

  • python new_article_list = unpack_and_move_files_from_tarball(article_list)

  • Example Output:

  • Returns a list of articles with updated status, eg:

  • log [ { 'status': 0, 'tar_file': '/path/to/tarfile', 'path_to_xml_in_tarfile': '/path/to/xml', 'path_to_pdf_in_tarfile': '/path/to/pdf', 'xml_filename': 'file.xml', 'pdf_filename': 'file.pdf' } ]

select_journals_to_harvest(issn_list, issn_folder_dictionary)

  • Required Parameters:

  • issn_list (list): List of ISSNs to be processed

  • issn_folder_dictionary (dict): Dictionary mapping ISSNs to journal folder names on the MDPI FTP server

  • Context for Use:

  • This function selects the journals to be harvested based on the provided ISSN list and abbreviation mapping.

  • Example Usage:

  • python issn_list = ['1234-5678'] issn_folder_dictionary = {'1234-5678': 'Journal Folder Name'} journals_to_be_harvested = select_journals_to_harvest(issn_list, issn_folder_dictionary)

  • Example Output:

  • Returns a dictionary of journals to be harvested, eg:

  • log { '1234-5678': 'Journal Folder Name' }

extract_files_from_tar(tar_info, current_module_temp_dir)

  • Required Parameters:

  • tar_info (TarballInfo): Information about the tarball

  • current_module_temp_dir (pathlib.Path): Path to the temporary directory for the current module

  • Context for Use:

  • This function extracts XML and PDF files from the tarball to the specified temporary directory.

  • Example Usage:

  • python result = extract_files_from_tar(tar_info, pathlib.Path('/path/to/temp/dir'))

  • Example Output:

  • Returns a tuple indicating success and error code, eg:

  • log (True, ErrorCode.NoError)