Overview

The MDPI contains two primary Python files responsible for metadata validation.

flow.py

This file is designed to automate the process of validating and synchronizing metadata from MDPI (Multidisciplinary Digital Publishing Institute) FTP servers. This includes fetching and processing metadata, synchronizing full-text files, and handling any missing metadata issues for a list of journals specified in the configuration file.

mdpi()

Required Parameters:
name_of_config_file (str): Name of the configuration file. Default is "config-test_crossrefcrawler.json"
test_copy (bool): Flag to indicate if it is a test run. Default is False
Context for Use:
This function is the main entry point for the MDPI metadata validation and processing flow. It initializes the necessary context and invokes the validation for the current and previous modules (if no previous module exists, it logs the information that the current module is the first module), synchronizes metadata from MDPI FTP, and processes articles for missing metadata.
Key Components:
Initialization
- module_name = context.get_run_context().flow.name: Retrieves the name of the current flow
Context and Configuration
- data_root = utils.get_data_root(): Gets the root directory for data storage
- module_context = configuration.get_module_context(data_root.joinpath(name_of_config_file), module_name): Retrieves the context for the current module based on the configuration file
Metadata Synchronisation
- tasks.sync_tocs_from_mdpi_ftp(): Synchronizes the table of contents (TOCs) from the MDPI FTP server
- abbrev_mapping = tasks.mdpi_isbn_abbreviation_mapping(tasks.get_mdpi_toc_file_path(), tasks.get_abbrevcache_path(), issn_list): Creates an ISSN-ISBN abbreviation mapping
- journals_to_be_harvested = tasks.select_journals_to_harvest(issn_list, abbrev_mapping): Filters the ISSN-ISBN mapping to include only journals listed in the configuration file
- tasks.sync_mdpi_files_from_ftp_server(journals_to_be_harvested): Synchronizes full-text files from the MDPI FTP server for the specified journals
Journal Processing (Iterates over each ISSN and processes the corresponding journal directory)
- The directory path for the journal is obtained and works on a copy of the Journal folder if test_copy is set to True.
- Each item in the journal path is iterated over and processed in the following manner:
- latest_previous_module = utils.get_previous_module_latest(previous_module, article_dir): The latest previous module directory is obtained, and checked if it is updated and not marked to be skipped
- is_harvested = tasks.is_full_text_already_harvested(module_name, previous_module, article_dir): Checks if the full text is already harvested. If already harvested, this information is logged and this item is skipped.
- utils.copy_content_from_previous_module(module_name, latest_previous_module): Copies content from the previous module and then creates the current module temporary directory. From this information and the abbreviation mapping, a tar_ball_info is created and then extracted to the same directory
- documentation.add_documentation(documentation_data, module_name=module_name, latest_previous_module=latest_previous_module): Upon successfull completion, documentation is added for the processed article
- Renames the temporary directory for current module, deleted the previous module and marks the previous module as processed.
- Logging is done throughout these steps to identify various stages and outcomes, including the number of articles processed, skipped, or missing required files.
Example Usage:
python mdpi('config-test_crossrefcrawler.json', test_copy=False)
Example Output:
log INFO: Module name: mdpi INFO: Set base path to: W:/wdm-test INFO: Module context: {...} INFO: Previous module: previous_wdm_module INFO: ISSN list: ['1234-5678', '8765-4321'] INFO: Have to work on 1 journals INFO: Journal path: /path/to/journal INFO: Full text was previously harvested for article: article_name. Skipping. INFO: Fetched full text for 2 out of 3 articles in journal: /path/to/journal. INFO: 1 articles had updated metadata but full text already exists.

tasks.py

The file contains tasks that are designed to handle various operations related to the synchronization, processing, and extraction of metadata and full-text files from MDPI (Multidisciplinary Digital Publishing Institute) FTP servers to local NAS folders. These tasks are essential for maintaining up-to-date and accurate metadata records.

The Lower-level functions in tasks.py are listed below

get_mdpi_download_root()

Required Parameters:
None
Context for Use:
This function returns the root directory path for MDPI downloads, constructed from the data root directory and a predefined subdirectory for downloaded data.
Example Usage:
python mdpi_download_root = get_mdpi_download_root()
Example Output:
log /path/to/data_root/downloaded_data/mdpi

get_mdpi_toc_file_path()

Required Parameters:
None
Context for Use:
This function returns the file path for the MDPI table of contents (TOC) XML file within the MDPI download root directory.
Example Usage:
python toc_file_path = get_mdpi_toc_file_path()
Example Output:
log /path/to/data_root/downloaded_data/mdpi/tocs/all_journal_toc.xml

get_abbrevcache_path()

Required Parameters:
None
Context for Use:
This function returns the file path for the abbreviation cache JSON file within the MDPI download root directory.
Example Usage:
python abbrevcache_path = get_abbrevcache_path()
Example Output:
log /path/to/data_root/downloaded_data/mdpi/abbrevcache.json

is_full_text_already_harvested(module_name, previous_module, article_dir_path)

Required Parameters:
module_name (str): Name of the current module
previous_module (str): Name of the previous module
article_dir_path (pathlib.Path): Path to the article directory
Context for Use:
This function checks if the full text for a given article has already been harvested by verifying the existence of processing indicators in the previous module directories.
Example Usage:
python is_harvested = is_full_text_already_harvested('module1', 'module0', pathlib.Path('/path/to/article'))
Example Output:
log True

sync_tocs_from_mdpi_ftp()

Required Parameters:
None
Context for Use:
This function synchronizes the table of contents (TOCs) from the MDPI FTP server to the local MDPI download root directory (NAS Folder) using the rclone command-line tool and returns the path to the local MDPI tocs.
Example Usage:
python sync_tocs_from_mdpi_ftp()
Example Output:
log INFO: Synchronizing TOCs from MDPI FTP. INFO: The following is the command line output from Rclone: ... INFO: ... INFO: The following is debug information from Rclone: ... INFO: ...

sync_mdpi_files_from_ftp_server(journal_dictionary)

Required Parameters:
journal_dictionary (dict): Dictionary mapping ISSNs to journal folder names on the MDPI FTP server
Context for Use:
This function synchronizes full-text files from the MDPI FTP server for the specified journals using the rclone command-line tool.
Example Usage:
python sample_dictionary = {'1234-5678': 'Journal Folder Name'} sync_mdpi_files_from_ftp_server(sample_dictionary)
Example Output:
log INFO: Copy journal Journal Folder Name (ISSN: 1234-5678) INFO: The following is the command line output from Rclone: ... INFO: ... INFO: The following is debug information from Rclone: ... INFO: ...

parse_mdpi_abbreviation_mapping_xml(toc_file, logger)

Required Parameters:
toc_file (str): Path to the MDPI TOC XML file
logger (Logger): Logger for logging information
Context for Use:
This function parses the MDPI TOC XML file to create a mapping of ISSNs to journal abbreviations.
Example Usage:
python folders = parse_mdpi_abbreviation_mapping_xml('path/to/toc_file', logger)
Example Output:
Returns a dictionary mapping ISSNs to journal abbreviations, eg:
log { '1234-5678': 'Journal Abbreviation' }

all_issns_in_mdpi_abbrev_list(issn_list, abbrev_dict)

Required Parameters:
issn_list (list): List of ISSNs to check
abbrev_dict (dict): Dictionary mapping ISSNs to journal abbreviations
Context for Use:
This function checks if all ISSNs in the provided list are present in the abbreviation dictionary.
Example Usage:
python issn_list = ['1234-5678', '9876-5432'] abbreviation_dict = {'1234-5678': 'Journal Abbreviation'} result = all_issns_in_mdpi_abbrev_list(issn_list, abbreviation_dict)
Example Output:
log False

mdpi_isbn_abbreviation_mapping(toc_file, dump_file=None, issn_list=None, recreate=False)

Required Parameters:
toc_file (str): Path to the MDPI TOC XML file
dump_file (str, optional): Path to the abbreviation cache file
issn_list (list, optional): List of ISSNs to be included in the mapping
recreate (bool, optional): Flag to indicate if the abbreviation cache should be recreated. Default is False
Context for Use:
This function creates an ISSN-ISBN abbreviation mapping from the MDPI TOC XML file and optionally caches the mapping in a JSON file.
Example Usage:
python mapping = mdpi_isbn_abbreviation_mapping('path/to/toc_file', 'path/to/cache', ['1234-5678'], recreate=True)
Example Output:
Returns a dictionary mapping ISSNs to abbreviations
log { '1234-5678': 'Journal Abbreviation' }

calculate_tarball_file_location(article_list, abbrev_mapping, mdpi_base_folder)

Required Parameters:
article_list (list): List of articles with metadata
abbrev_mapping (dict): Dictionary mapping ISSNs to abbreviations
mdpi_base_folder (str): Base folder path for MDPI data
Context for Use:
This function calculates the file locations for tarballs, XML, and PDF files for each article based on the provided metadata and abbreviation mapping.
Example Usage:
python new_article_list = calculate_tarball_file_location(article_list, abbrev_mapping, '/path/to/mdpi/base_folder')
Example Output:
Returns a list of articles with updated file locations, eg:
log [ { 'tar_file': '/path/to/tarfile', 'path_to_xml_in_tarfile': '/path/to/xml', 'path_to_pdf_in_tarfile': '/path/to/pdf', 'xml_filename': 'file.xml', 'pdf_filename': 'file.pdf' } ]

unpack_and_move_files_from_tarball(article_list)

Required Parameters:
article_list (list): List of articles with metadata
Context for Use:
This function unpacks XML and PDF files from tarballs and moves them to the appropriate directories.
Example Usage:
python new_article_list = unpack_and_move_files_from_tarball(article_list)
Example Output:
Returns a list of articles with updated status, eg:
log [ { 'status': 0, 'tar_file': '/path/to/tarfile', 'path_to_xml_in_tarfile': '/path/to/xml', 'path_to_pdf_in_tarfile': '/path/to/pdf', 'xml_filename': 'file.xml', 'pdf_filename': 'file.pdf' } ]

select_journals_to_harvest(issn_list, issn_folder_dictionary)

Required Parameters:
issn_list (list): List of ISSNs to be processed
issn_folder_dictionary (dict): Dictionary mapping ISSNs to journal folder names on the MDPI FTP server
Context for Use:
This function selects the journals to be harvested based on the provided ISSN list and abbreviation mapping.
Example Usage:
python issn_list = ['1234-5678'] issn_folder_dictionary = {'1234-5678': 'Journal Folder Name'} journals_to_be_harvested = select_journals_to_harvest(issn_list, issn_folder_dictionary)
Example Output:
Returns a dictionary of journals to be harvested, eg:
log { '1234-5678': 'Journal Folder Name' }

extract_files_from_tar(tar_info, current_module_temp_dir)

Required Parameters:
tar_info (TarballInfo): Information about the tarball
current_module_temp_dir (pathlib.Path): Path to the temporary directory for the current module
Context for Use:
This function extracts XML and PDF files from the tarball to the specified temporary directory.
Example Usage:
python result = extract_files_from_tar(tar_info, pathlib.Path('/path/to/temp/dir'))
Example Output:
Returns a tuple indicating success and error code, eg:
log (True, ErrorCode.NoError)