Module DSpace Article Ingest
General Information
This module extracts all necessary information from a DIP that contains files in a folder structure as created by the harvesting full text modules. The metadata are extracted from the metadata of the archivematica METS file. It then creates and crosslinks the following entities in DSpace
- a dataset/catalog entry
- several distributions containing the previously harvested files
- a distribution for the converted XML with a dummy file to store rights and role information
Those entities are then crosslinked according to the DSpace entity model
Module logic
The central entity in the module is an instance of the class articleDIP. articleDIP has methods to load all data necessary for DSpace into class variables and convert them to information in the structure that the DSpace API needs that are stored again in class variables. It is therefore a representation of the DIP in a way to allow interaction with the DSpace API.
The module has further functions to perform the actual API calls such as create communities/collections/items/bundles/bitstreams in DSpace and create relationships.
Unexpected events in the API communication will lead to the module to stop processing of the respective article and try removing all entities from DSpace that have already been created for that article. Failing that, it will stop flow execution.
DSpace API connection
The module uses the dspace_rest_python library. The connection itself is wrapped into a self defined connectionHandler class mainly to allow re-using the same connection over multiple articles but also to re-authenticate every few minutes in the DSpace system to avoid failing API calls due to session termination.
Processing steps
DIP Preparation and information extraction
- The module will work on a single folder in the DIP folder. Multiple folders in DIP will lead to the module skipping the article.
- The module will create a temporary unzipped copy of the DIP contents in the same folder where the zipped contents are
DIP preparation is mainly performed by the function prepare_dip_contents in dip_functions.py. It will call, in the correct order, all articleDIP methods that are necessary for loading DIP contents. The DIP path is determined beforehand by the function get_dip_folderpath from dip_functions.py
dip_functions.get_dip_folderpath(module_path)
Examines the DIP folder of the respective article and returns the path the single folder in there if there is exactly one folder, otherwise False.
Parameters
- module_path: The processing folder for the article for the respective folder (the temp folder created by copying the contents from the previous module)
Return values
- processing as expected
- Path to DIP as string
- error during processing
- False
dip_functions.prepare_dip_contents(dip_path, dummy_file)
Parameters
- dip_path: Path on the file system to the DIP folder that will be loaded
- dummy_file: Path on the file system to a file that will be uploaded as dummy_file for the distribution that will later contain the converted document
Return values
- processing as expected
- the newly created articleDIP instance with all data loaded
- None
- error during processing
- None
- String containing error messages
Called articleDIP methods
- constructor __init__(self, dip_path):
- defines instance variables
- The class articleDIP is directly initialized using a specific dip whose location is passed with dip_path
- load_metadata(self)
- Extracts metadata values from the DIP METS files and saves them as object variables
- create_hierarchy(self)
- Creates tuple of the articles position in the DSpace hierarchy with the collection being the last tuple element and saves it in instance variable
- unpack_zip(self)
- Unpacks ZIP file contained in the DIP into the same folder
- create_dataset_metadata_dict(self):
- Sorts the metadata loaded from the DIP into the dict according to DSpace API format and saves the resulting dict as instance variable
- check_distribution_type_existence(self)
- Walks through a decision tree to decide which distribution should be created. No data may be left out, but reuse should be minimal and never occur for main texts
- creates a list of distributions with the structure shown in the addendum
- Compatible file formats to create distributions are defined in distribution_types.py
- create_distribution_metadata_dicts(self):
- Based on the metadata available in the DIP abject, adds item metadata dict for every type of Distribution in the distribution list
- uses a generic method for all distributions (create_distribution_metadata_dict_generic(self)) and adds distribution-type specific data
- create_distribution_file_list(self)
- analyses the folder contents of each distributions and adds list of files to be uploaded to each distribution
- create_bitstream_metadata(self)
- adds type information to all files to be uploaded as bitstreams
- create_tei_dummy_distribution(self, dummy_file):
- expands the distribution list with "pre-reserved" distribution for the converted document to have the possibility to already adjust rights and roles
- registers the file from the path in the variable dummy_file to be uploaded as dummy bitstream
Check for previous article ingest
The module will check based on the toolbox ID (part of the metadata extracted during DIP preparation) if the article is already in DSpace. If it is, processing is stopped for this article by creating a stop-processing-folder and the module will continue with the next article.
item_functions.toolbox_id_already_in_dspace(connection, toolbox_id, catalog_top_uuid)
Parameters
- connection: a dspace_rest_python API connection
- toolbox_id: the toolbox ID of the current article
- catalog_top_uuid: The top uuid of the DSpace catalog/datasets community, used to distinguish between test and productive communities
Return values
- processing as expected
- False
- Empty list
- unexpected behaviour: previous ingest
- True
- List of DSpace objects
Creating and identifying DSpace communities and collections
Based on the information in the articleDIP instance (publisher, journal, volume, issue), the module will decide into which collection to sort an article. First, it will query the DSpace API whether fitting communities and collections already exist. If not, those will be created. Information on existing and newly created DSpace communities and collections are cached during a respective module run to reduce the number of API calls.
Data structure and cache
The hierarchy for an article is stored in a tuple with the structure (publisher, "Periodika", journal, volume) with volume being the collection. Information during each run is cached in a dict separately for catalog and the distributions hierarchy. The cache keys itself are tuples of the following structure: (top_uuid, (part_of_hierarchy)), e.g. (top_uuid, (publisher, "Periodika", journal)) for the community of the respective journal.
hierarchy_functions.get_collection_uuid(connection, top_community_uuid, hierarchy, url)
Parses and creates necessary entries in hierarchy below collection top_community_uuid. The function is called separately for catalog and distribution collection.
Parameters
- connection: dspace_rest_client connection object
- top_community_uuid: uuid of community under which hierarchy is parsed (either the top community for catalog or distributions)
- hierarchy: list of element names in hierarchy, last being the collection name as created when preparing the articleDIP object
- url: base url of the DSpace REST-API
Return value
Returns uuid for the collection that is the last entry in hierarchy list
Item creation in DSpace
The module will create new items for datasets and distributions in the previously identified collections and, for distributions, also create bundles and upload files.
item_functions.create_dspace_objects(connection, dip, dataset_collection, distribution_collection, url)
This function will make the necessary API calls to create all DSpace objects needed for the articleDIP object and update the articleDIP object with the information retrieved from the respective endpoints. It also hides the ulbxml dummy distribution by deleting all READ policies. Currently, no access right information are provided to any distribution or bitstream as open access documents are handled.
Parameters
- connection: dspace_rest_client connection object
- dip: Instance of articleDIP containing all the necessary information for DSpace object creation (will be modified with information from the endpoints)
- dataset_collection: uuid of collection that will contain the dataset item
- distribution_collection: uuid of collection that will contain the distribution items
- url: base url of the DSpace REST-API
Return values
- processing as expected
- True
- String containing processing log
- Empty string
- error during processing
- False
- String containing processing log to the point where the error occured
- String containing error description
Entity crosslinking in DSpace
The script will then crosslink the dataset item with the distribution item by creating DSpace entity relationships.
item_functions.create_relationships(connection, dip, url)
Parameters
- connection: dspace_rest_client connection object
- dip: Instance of articleDIP containing all the necessary information for DSpace object creation
- url: base url of the DSpace REST-API
Return value
- processing as expected: True
- error during processing: False
Exception handling
For item creation, data upload and crosslinking, the script will try to catch most exceptions and try to handle them in a way that does not interfere with the DSpace data integrity and allows to ingest the same article again without manual interventions. Mostly, this means that the module keeps track of newly created DSpace objects within the articleDIP object and will attempt to delete the items (and thus, also the attached data) again if article ingest cannot be finalized.
Issues with deletion of DSpace objects due to unexpected behaviour will result in Script termination.
item_functions.revert_item_creation(connection, dip, url):
Deletes newly created items. Needed in the case that some items were already created before some error happened.
Parameters
- connection: dspace_rest_client connection object
- dip: articleDIP object whose created entities should be deleted
- url: base url of the DSpace REST-API
Return values
- processing as expected:
- True
- empty list
- error during processing
- False
- Liste of items which could not be deleted
Processing documentation
The script will write one line in the history.csv for each created object of the following types: dataset item, distribution item, bitstream. The input information will always include the archivematica DIP filename (that contains to the AIP UUID) and, for bitstreams, a specific path within the DIP. The output includes the type of entity created followed by the DSpace object UUID. A link to a previously created entity is specified before the entity type for distributions (dataset item UUID) and bitstreams (distribution item UUID). For distributions, the type is given in brackets as semicolon separated list.
Addendum
Structure of distribution list
{
"aux": [list of other distribution elements, currently just images],
"dso": DSO_object (after ingest) or None,
"dspace_metadata": { dict of DSpace metadata according to DSpace REST API format },
"files": [ list of files, see below ],
"main": "main type, could be pdf/supplierxml/ulbxml or others",
"uuid": DSpace_item_object UUID as string after ingest or None
}
Structure of file list element
{
"dso": dspace_rest_client Item Object after creation, otherwise None,
"dspace_metadata": { DSpace REST APi format metadata (gernerated by articleDIP method)},
"function": main text, image, ...
"path": path on NAS,
"uuid": DSpace object UUID after creation, otherwise None
}