Module DSpace Article Ingest

General Information

This module extracts all necessary information from a DIP that contains files in a folder structure as created by the harvesting full text modules. The metadata are extracted from the metadata of the archivematica METS file. It then creates and crosslinks the following entities in DSpace

a dataset/catalog entry
several distributions containing the previously harvested files
a distribution for the converted XML with a dummy file to store rights and role information

Those entities are then crosslinked according to the DSpace entity model

Module logic

The central entity in the module is an instance of the class articleDIP. articleDIP has methods to load all data necessary for DSpace into class variables and convert them to information in the structure that the DSpace API needs that are stored again in class variables. It is therefore a representation of the DIP in a way to allow interaction with the DSpace API.

The module has further functions to perform the actual API calls such as create communities/collections/items/bundles/bitstreams in DSpace and create relationships.

Unexpected events in the API communication will lead to the module to stop processing of the respective article and try removing all entities from DSpace that have already been created for that article. Failing that, it will stop flow execution.

DSpace API connection

The module uses the dspace_rest_python library. The connection itself is wrapped into a self defined connectionHandler class mainly to allow re-using the same connection over multiple articles but also to re-authenticate every few minutes in the DSpace system to avoid failing API calls due to session termination.

Processing steps

DIP Preparation and information extraction

The module will work on a single folder in the DIP folder. Multiple folders in DIP will lead to the module skipping the article.
The module will create a temporary unzipped copy of the DIP contents in the same folder where the zipped contents are

DIP preparation is mainly performed by the function prepare_dip_contents in dip_functions.py. It will call, in the correct order, all articleDIP methods that are necessary for loading DIP contents. The DIP path is determined beforehand by the function get_dip_folderpath from dip_functions.py

dip_functions.get_dip_folderpath(module_path)

Examines the DIP folder of the respective article and returns the path the single folder in there if there is exactly one folder, otherwise False.

Parameters

module_path: The processing folder for the article for the respective folder (the temp folder created by copying the contents from the previous module)

Return values

processing as expected
1. Path to DIP as string
error during processing
1. False

dip_functions.prepare_dip_contents(dip_path, dummy_file)

Parameters

dip_path: Path on the file system to the DIP folder that will be loaded
dummy_file: Path on the file system to a file that will be uploaded as dummy_file for the distribution that will later contain the converted document

Return values

processing as expected
1. the newly created articleDIP instance with all data loaded
2. None
error during processing
1. None
2. String containing error messages

Called articleDIP methods

constructor __init__(self, dip_path):
- defines instance variables
- The class articleDIP is directly initialized using a specific dip whose location is passed with dip_path
load_metadata(self)
- Extracts metadata values from the DIP METS files and saves them as object variables
create_hierarchy(self)
- Creates tuple of the articles position in the DSpace hierarchy with the collection being the last tuple element and saves it in instance variable
unpack_zip(self)
- Unpacks ZIP file contained in the DIP into the same folder
create_dataset_metadata_dict(self):
- Sorts the metadata loaded from the DIP into the dict according to DSpace API format and saves the resulting dict as instance variable
check_distribution_type_existence(self)
- Walks through a decision tree to decide which distribution should be created. No data may be left out, but reuse should be minimal and never occur for main texts
- creates a list of distributions with the structure shown in the addendum
- Compatible file formats to create distributions are defined in distribution_types.py
create_distribution_metadata_dicts(self):
- Based on the metadata available in the DIP abject, adds item metadata dict for every type of Distribution in the distribution list
- uses a generic method for all distributions (create_distribution_metadata_dict_generic(self)) and adds distribution-type specific data
create_distribution_file_list(self)
- analyses the folder contents of each distributions and adds list of files to be uploaded to each distribution
create_bitstream_metadata(self)
- adds type information to all files to be uploaded as bitstreams
create_tei_dummy_distribution(self, dummy_file):
- expands the distribution list with "pre-reserved" distribution for the converted document to have the possibility to already adjust rights and roles
- registers the file from the path in the variable dummy_file to be uploaded as dummy bitstream

Check for previous article ingest

The module will check based on the toolbox ID (part of the metadata extracted during DIP preparation) if the article is already in DSpace. If it is, processing is stopped for this article by creating a stop-processing-folder and the module will continue with the next article.

item_functions.toolbox_id_already_in_dspace(connection, toolbox_id, catalog_top_uuid)

Parameters

connection: a dspace_rest_python API connection
toolbox_id: the toolbox ID of the current article
catalog_top_uuid: The top uuid of the DSpace catalog/datasets community, used to distinguish between test and productive communities

Return values

processing as expected
1. False
2. Empty list
unexpected behaviour: previous ingest
1. True
2. List of DSpace objects

Creating and identifying DSpace communities and collections

Based on the information in the articleDIP instance (publisher, journal, volume, issue), the module will decide into which collection to sort an article. First, it will query the DSpace API whether fitting communities and collections already exist. If not, those will be created. Information on existing and newly created DSpace communities and collections are cached during a respective module run to reduce the number of API calls.

Data structure and cache

The hierarchy for an article is stored in a tuple with the structure (publisher, "Periodika", journal, volume) with volume being the collection. Information during each run is cached in a dict separately for catalog and the distributions hierarchy. The cache keys itself are tuples of the following structure: (top_uuid, (part_of_hierarchy)), e.g. (top_uuid, (publisher, "Periodika", journal)) for the community of the respective journal.

hierarchy_functions.get_collection_uuid(connection, top_community_uuid, hierarchy, url)

Parses and creates necessary entries in hierarchy below collection top_community_uuid. The function is called separately for catalog and distribution collection.

Parameters

connection: dspace_rest_client connection object
top_community_uuid: uuid of community under which hierarchy is parsed (either the top community for catalog or distributions)
hierarchy: list of element names in hierarchy, last being the collection name as created when preparing the articleDIP object
url: base url of the DSpace REST-API

Return value

Returns uuid for the collection that is the last entry in hierarchy list

Item creation in DSpace

The module will create new items for datasets and distributions in the previously identified collections and, for distributions, also create bundles and upload files.

item_functions.create_dspace_objects(connection, dip, dataset_collection, distribution_collection, url)

This function will make the necessary API calls to create all DSpace objects needed for the articleDIP object and update the articleDIP object with the information retrieved from the respective endpoints. It also hides the ulbxml dummy distribution by deleting all READ policies. Currently, no access right information are provided to any distribution or bitstream as open access documents are handled.

Parameters

connection: dspace_rest_client connection object
dip: Instance of articleDIP containing all the necessary information for DSpace object creation (will be modified with information from the endpoints)
dataset_collection: uuid of collection that will contain the dataset item
distribution_collection: uuid of collection that will contain the distribution items
url: base url of the DSpace REST-API

Return values

processing as expected
1. True
2. String containing processing log
3. Empty string
error during processing
1. False
2. String containing processing log to the point where the error occured
3. String containing error description

Entity crosslinking in DSpace

The script will then crosslink the dataset item with the distribution item by creating DSpace entity relationships.

item_functions.create_relationships(connection, dip, url)

Parameters

connection: dspace_rest_client connection object
dip: Instance of articleDIP containing all the necessary information for DSpace object creation
url: base url of the DSpace REST-API

Return value

processing as expected: True
error during processing: False

Exception handling

For item creation, data upload and crosslinking, the script will try to catch most exceptions and try to handle them in a way that does not interfere with the DSpace data integrity and allows to ingest the same article again without manual interventions. Mostly, this means that the module keeps track of newly created DSpace objects within the articleDIP object and will attempt to delete the items (and thus, also the attached data) again if article ingest cannot be finalized.

Issues with deletion of DSpace objects due to unexpected behaviour will result in Script termination.

item_functions.revert_item_creation(connection, dip, url):

Deletes newly created items. Needed in the case that some items were already created before some error happened.

Parameters

connection: dspace_rest_client connection object
dip: articleDIP object whose created entities should be deleted
url: base url of the DSpace REST-API

Return values

processing as expected:
1. True
2. empty list
error during processing
1. False
2. Liste of items which could not be deleted

Processing documentation

The script will write one line in the history.csv for each created object of the following types: dataset item, distribution item, bitstream. The input information will always include the archivematica DIP filename (that contains to the AIP UUID) and, for bitstreams, a specific path within the DIP. The output includes the type of entity created followed by the DSpace object UUID. A link to a previously created entity is specified before the entity type for distributions (dataset item UUID) and bitstreams (distribution item UUID). For distributions, the type is given in brackets as semicolon separated list.

Addendum

Structure of distribution list

{
    "aux": [list of other distribution elements, currently just images],
    "dso": DSO_object (after ingest) or None,
    "dspace_metadata": { dict of DSpace metadata according to DSpace REST API format },
    "files": [ list of files, see below ],
    "main": "main type, could be pdf/supplierxml/ulbxml or others",
    "uuid": DSpace_item_object UUID as string after ingest or None
}

Structure of file list element

{
    "dso": dspace_rest_client Item Object after creation, otherwise None,
    "dspace_metadata": { DSpace REST APi format metadata (gernerated by articleDIP method)},
    "function": main text, image, ...
    "path": path on NAS,
    "uuid": DSpace object UUID after creation, otherwise None
}