OAI-PMH Harvester and LOM Parser

 

Author and version

Author: Paul Hollands - < p.j.hollands@ncl.ac.uk >

Modification date: 2005/04/29 14:21:41.134 GMT+1

Version: 0.3 - DRAFT

Introduction

As part of my work for the RDN / LTSN interoperability project I have been attempting to write an OAI-PMH harvester which would parse and upload data from the IEEE LOM IMS XML binding.

I started this work in Python but switched to Perl when I encountered problems with the XPath implementation in the Python libxml2 bindings.

I managed to develop a functioning harvester in Perl using the Net::OAI::Harvester and LWP::Simple modules as starting points. Having really struggled to manipulate such a complex XML tree with Perl's baroque data structures and having figured out why my XPath wasn't working in Python I've now stopped development on the Perl version and switched back to Python. Both the Perl and and the Python versions are described below.

The Python Harvester

What the harvester does and doesn't do

The harvester negotiates with an OAI repository, gets a list of identifiers, then iterates through the list doing a GetRecord request for each identifier and then parses the resulting XML. (The XML is also stored in the xml_store table for each GetRecord request). The data is then parsed out of the XML using a series of XPath queries into a data structure. This is then passed to a set of handler methods to be loading into a staging database (ltsn01_harvester).

The harvester does not do any checks to see whether each record is an update or a new addition at this stage. This is done later using the methods in the normalize_harvest.py module file. This us why the data is loaded into a staging area as this comparison is much easier once the data is in MySQL.

The harvester has two modes of operation corresponding to daily or full harvests. The type is signified as a command line argument passed to the Python script. For full harvests the ListIdentifiers request has no date constraint. For daily harvests the script grabs the last harvest date from the database and uses it as a from constraint on the harvest.

The harvester.py script should be run on the command line as follows:

  • $ python harvester.py LTSN-01 full

The two command line arguments refer to the handle of the repository you wish to harvest from (as defined in the repository_definitions.ini configuration file) and the type of harvest e.g. daily or full.

In addition to these two harvests, the script always calls a separate ListIdentifiers response without date constraint. The resulting list is concatenated into a single string with :: as separators and stored in the harvests table. This is to ensure that any deleted records can be determined by comparing the current tally of all the identifiers with the results from the previous harvest. Again this is handled by a function in the normalize_harvest.py module file.

The Python LOM Parser

The Python parser works in a similar way to the Perl version in that it pulls a list of XPath statements from a database table and uses these to parse out individual sections of the LOM XML binding. There are in fact three parser functions defined in the script, one for the main bulk of the LOM XML binding and one each for the about and classification sections respectively. The about section is heavily namespaced and needs different treatment and the classification section is very nested and needs to be broken out with separate parsing runs to set narrower and narrower contexts.

What the parsers return

Again the result of the harvest is tokenized into a Python data structure as follows:

  • a harvest list of records is produced
  • each record is a dictionary (hash) with keys named in the same way as those produced by the Perl scripts above with the exception that there is no version information encoded in the keys (no -v0, -v1, -v2 etc). This is because instead of containing a single value for each key, the dictionaries contain a list to accomodate repeating values giving entries such as:

lom_general_keyword_string: [ "Unstable angina","Chest pain" ]

lom_general_title_string__language: [ "en-GB" ]

See the following for a full example of what the main parser produces:

What the classification parser returns

The XML binding for the classification section is much more nested than other areas of the LOM. The parser needs to take account of the following:

  • each record can have multiple classification entries (characterised by one purpose container each)
  • each classification section can hold multiple taxonPaths which in turn can contain mulitple taxon nodes

Because of this the parser has to make three XPath parse passes for each classification node. This is to ensure that the right taxon nodes appear in the parsed data structure still in their correct containers.

This could probably done by manipulating XPath contexts in libxml2. However, I couldn't find any documentation on how to does this, so instead I grab each classification node as a list, then iterate through them, and for each, re-serialize it back into a new XML document and run further XPath on that. I then do the same for each taxonPath node within these classification documents.

The parser then produces an exact replica of the classification nodes as outlined in the following example:

The storage of the classification data

Storing the classification section once parsed in a relational database is problematic for the same reasons that it is problematic to parse. If you can guarantee that there will only ever be one taxon node in each taxonPath then this is not too big a deal as the relationships are pretty much one to one and a simple table structure like that in the following example will work:

Classification purpose and source information are duplicated and associated with each taxon. Further there is no representation of separate taxonPaths (since each taxonPath only contains one taxon node such a representation is redundant).

If there's more than one taxon node however, the taxonPath representation becomes vital otherwise there is no way to tell which taxon belongs in what path.

Defining a suitable database structure in these circumstances becomes very diffcult however. I have therefore taken a pragmatic approach. Taxons are concatenated into a single string using the :: separator and inserted into the class_taxon_entry column in the existing table. This means that each row now represents a taxonPath node.

Developers will have to split the string back out into a list for display purposes but this is not onerous.

The LOM marshalling handlers

The data once parsed, is loaded into the MySQL database by a number of handler functions defined in the lom_handlers.py script.

Each handler is passed the record dictionary and builds and runs either MySQL insert or update statements to upload the data into the database.

Post-harvest data normalization

Data is only retained in the ltsn01_harvester database until the next harvest. The ltsn01_oai_live database is intended to be the main data store for harvested records. The normalize_harvest.py script contains functions for managing new and updated records in the live database and also record deletions post-harvest. Records from mulitple target repositories can be stored and managed using these fucntions, and this is done by calling these functions from the harvester.py script as the final part of the harvest process.

What is required to run the Python harvester

N.B. Future updates for these scripts will be available through the LTSN-01 CVS server..

--------------------

The Perl harvester

What the harvester does and doesn't do

N.B. This is work in progress (no longer in progress since I've switched to Python.. Please feel free to finish it :) )

The harvester negotiates with an OAI repository, gets a list of identifiers, then iterates through the list doing a GetRecord request for each identifier and then parses the resulting XML. (The XML is also stored in the xml_store table for each GetRecord request). The data is then parsed out of the XML using a series of XPath queries into a data structure. This is then passed to a set of handler methods to be loading into a staging database.

The harvester does not do any checks to see whether each record is an update or a new addition. This us why the data is loaded into a staging area as this comparison would be much easier once the data is in MySQL.

Functionality that needs to be added

  • The list of XPath expressions stored in the xpath_map table of the ltsn01_harvester database needs to be completed so the full LOM XML binding is parsed (or those bits you want from it)
  • Some sort of recursive parser needs to be included to deal effectively with taxonpaths
  • There's definitely some redundant code in the Perl parser after the parseXML method returns a list of nodes and this needs to be tidied up
  • There are only handler scripts to deal with general.title nodes, other handlers will need to be written
  • Some sort of tally of when harvests were performed will need to be kept so that the next harvest only grabs records amended since the last time
    • This could probably be inferred from the time stamps in the xml_store table, which is where copies of the XML for each GetRecord response are kept

What the parser returns

The parseXML method returns a list which should be hashed into key / value pairs in the following format:

lom_general_description_string-v0: Interactive case based patient simulation of chest pain in a 63 year old male.

lom_general_identifier_catalog-v0: URI

lom_general_identifier_entry-v0: http://purl.org/poi/ltsn-01.ac.uk/7

lom_general_keyword_string-v0: Unstable angina

lom_general_keyword_string-v1: Chest pain

lom_general_language-v0: en-US

lom_general_title_string-v0: Acute coronary syndromes interactive case : 'A case of chest pain in a 63 year old male'

lom_general_title_string__language-v0: en-GB

Note that keys that arise from an XPath expression with attribute elements such as //lom/general/title/string/@language get two underscores, i.e. lom_general_title_string__language-v0

Note also the variant numbering convention employed such that lom_general_title_string__language-v1 is the language code for the title string in lom_general_title_string-v1 and so forth.

What is required to run the Perl harvester

  • Perl and MySQL
  • the following modules available from CPAN:
    • Net::OAI::Harvester
    • LWP::Simple
    • DBI
    • DBD::mysql
    • XML::XPath
    • Data::Dumper for debugging
  • You are free to use our database structure for your staging area should you wish. Note the harvester requires the xpath_map and xml_store tables
  • The harvester consists of the following files

enquiries@medev.ac.uk

+44 191 222 5888

The Higher Education Academy Subject Centre for Medicine, Dentistry and Veterinary Medicine

School of Medical Sciences Education Development, Faculty of Medical Sciences,
Newcastle University, NE2 4HH