OAI-PMH Harvester and LOM Parser
Author and version
Author: Paul Hollands - < p.j.hollands@ncl.ac.uk >
Modification date: 2005/04/29 14:21:41.134 GMT+1
Version: 0.3 - DRAFT
Introduction
As part of my work for the RDN / LTSN interoperability project I have been attempting to write an
OAI-PMH harvester which would parse
and upload data from the IEEE LOM IMS XML binding.
I started this work in Python but switched to Perl when I encountered problems with the XPath
implementation in the Python libxml2
bindings.
I managed to develop a functioning harvester in Perl using the Net::OAI::Harvester
and LWP::Simple
modules as starting points. Having really struggled to manipulate such a complex XML tree with Perl's
baroque data structures and having figured out why my XPath wasn't working in Python I've now stopped
development on the Perl version and switched back to Python. Both the Perl and
and the Python versions are described below.
What the harvester does and doesn't do
The harvester negotiates with an OAI repository, gets a list of identifiers, then iterates through
the list doing a GetRecord request for each identifier and then parses the resulting XML. (The XML
is also stored in the xml_store table for each GetRecord request). The data
is then parsed out of the XML using a series of XPath queries into a data structure. This is then passed
to a set of handler methods to be loading into a staging database (ltsn01_harvester).
The harvester does not do any checks to see whether each record is an update or a new addition at
this stage. This is done later using the methods in the normalize_harvest.py module file.
This us why the data is loaded into a staging area as this comparison is much easier once
the data is in MySQL.
The harvester has two modes of operation corresponding to daily or full harvests.
The type is signified as a command line argument passed to the Python script. For full harvests the ListIdentifiers request
has no date constraint. For daily harvests the script grabs the last harvest date from the database and uses it as a from constraint
on the harvest.
The harvester.py script should be run on the command line as follows:
$ python harvester.py LTSN-01 full
The two command line arguments refer to the handle of the repository you wish to harvest from (as defined
in the repository_definitions.ini configuration file) and the type of harvest e.g. daily or full.
In addition to these two harvests, the script always calls a separate ListIdentifiers response without date constraint. The resulting
list is concatenated into a single string with :: as separators and stored in the harvests table. This is to ensure that any deleted records
can be determined by comparing the current tally of all the identifiers with the results from the previous harvest.
Again this is handled by a function in the normalize_harvest.py module file.
The Python LOM Parser
The Python parser works in a similar way to the Perl version in that it pulls a list of XPath statements from a database table and uses these to parse out individual sections
of the LOM XML binding. There are in fact three parser functions defined in the script, one for the main bulk of the LOM XML binding and one each for the about
and classification sections respectively. The about section is heavily namespaced and needs different treatment and the classification
section is very nested and needs to be broken out with separate parsing runs to set narrower and narrower contexts.
What the parsers return
Again the result of the harvest is tokenized into a Python data structure as follows:
- a
harvest list of records is produced
- each record is a dictionary (hash) with keys named in the same way as those produced by the Perl
scripts above with the exception that there is no version information encoded in the keys
(no
-v0, -v1, -v2 etc).
This is because instead of containing a single value for each key, the dictionaries contain a
list to accomodate repeating values giving entries such as:
lom_general_keyword_string: [ "Unstable angina","Chest pain" ]
lom_general_title_string__language: [ "en-GB" ]
See the following for a full example of what the main parser produces:
What the classification parser returns
The XML binding for the classification section is much more nested than other
areas of the LOM. The parser needs to take account of the following:
- each record can have multiple
classification entries (characterised by one purpose
container each)
- each
classification section can hold multiple taxonPaths which in turn can contain
mulitple taxon nodes
Because of this the parser has to make three XPath parse passes for each classification node. This is to ensure that the right
taxon nodes appear in the parsed data structure still in their correct containers.
This could probably done by manipulating XPath contexts in libxml2. However, I couldn't find any
documentation on how to does this, so instead I grab each classification node as a list,
then iterate through them, and for each, re-serialize it back into a new XML document and run further
XPath on that. I then do the same for each taxonPath node within these classification
documents.
The parser then produces an exact replica of the classification nodes as outlined in the
following example:
The storage of the classification data
Storing the classification section once parsed in a relational database is
problematic for the same reasons that it is problematic to parse. If you can guarantee that there will
only ever be one taxon node in each taxonPath then this is not
too big a deal as the relationships are pretty much one to one and a simple table structure like
that in the following example will work:
Classification purpose and source information are duplicated and associated with each taxon. Further there is no
representation of separate taxonPaths (since each taxonPath only contains
one taxon node such a representation is redundant).
If there's more than one taxon node however, the taxonPath representation
becomes vital otherwise there is no way to tell which taxon belongs in what path.
Defining a suitable database structure in these circumstances becomes very diffcult however. I
have therefore taken a pragmatic approach. Taxons are concatenated into a single string using the ::
separator and inserted into the class_taxon_entry column in the existing table. This
means that each row now represents a taxonPath node.
Developers will have to split the string back out into a list for display purposes but this is not onerous.
The LOM marshalling handlers
The data once parsed, is loaded into the MySQL database by a number of handler functions defined
in the lom_handlers.py script.
Each handler is passed the record dictionary and builds and runs either MySQL insert or update statements
to upload the data into the database.
Post-harvest data normalization
Data is only retained in the ltsn01_harvester database until the next harvest. The
ltsn01_oai_live database is intended to be the main data store for harvested records.
The normalize_harvest.py script contains functions for managing new and updated records
in the live database and also record deletions post-harvest. Records from mulitple target repositories
can be stored and managed using these fucntions, and this is done by calling these functions from the
harvester.py script as the final part of the harvest process.
What is required to run the Python harvester
- Python and MySQL
- the OAI-PMH module
- libxml2 libraries and Python bindings from xmlsoft.org website
- the ConfigParser module
- the MySQL Python modules
- You will need to use our database structures for your staging area and live repository. Note the harvester
requires the
xpath_map, harvests and xml_store tables
- The harvester consists of the following files (* marks a new or updated file - last update 2004-05-14)
N.B. Future updates for these scripts will be available through the LTSN-01 CVS server..
--------------------
What the harvester does and doesn't do
N.B. This is work in progress (no longer in progress since I've switched
to Python.. Please feel free to finish it :) )
The harvester negotiates with an OAI repository, gets a list of identifiers, then iterates through
the list doing a GetRecord request for each identifier and then parses the resulting XML. (The XML
is also stored in the xml_store table for each GetRecord request). The data
is then parsed out of the XML using a series of XPath queries into a data structure. This is then passed
to a set of handler methods to be loading into a staging database.
The harvester does not do any checks to see whether each record is an update or a new addition.
This us why the data is loaded into a staging area as this comparison would be much easier once
the data is in MySQL.
Functionality that needs to be added
- The list of XPath expressions stored in the
xpath_map table of the
ltsn01_harvester database
needs to be completed so the full LOM XML binding is parsed (or those bits you want from it)
- Some sort of recursive parser needs to be included to deal effectively with
taxonpaths
- There's definitely some redundant code in the Perl parser after the parseXML method returns a
list of nodes and this needs to be tidied up
- There are only handler scripts to deal with
general.title nodes, other handlers
will need to be written
- Some sort of tally of when harvests were performed will need to be kept so that the next
harvest only grabs records amended since the last time
- This could probably be inferred from the time stamps in the
xml_store table,
which is where copies of the XML for each GetRecord response are kept
What the parser returns
The parseXML method returns a list which should be hashed into key / value pairs in
the following format:
lom_general_description_string-v0: Interactive case based patient simulation of chest pain in a 63 year old male.
lom_general_identifier_catalog-v0: URI
lom_general_identifier_entry-v0: http://purl.org/poi/ltsn-01.ac.uk/7
lom_general_keyword_string-v0: Unstable angina
lom_general_keyword_string-v1: Chest pain
lom_general_language-v0: en-US
lom_general_title_string-v0: Acute coronary syndromes interactive case : 'A case of chest pain in a 63 year old male'
lom_general_title_string__language-v0: en-GB
Note that keys that arise from an XPath expression with attribute elements such as
//lom/general/title/string/@language get two underscores, i.e.
lom_general_title_string__language-v0
Note also the variant numbering convention employed such that lom_general_title_string__language-v1
is the language code for the title string in lom_general_title_string-v1 and so forth.
What is required to run the Perl harvester
- Perl and MySQL
- the following modules available from CPAN:
Net::OAI::Harvester
LWP::Simple
DBI
DBD::mysql
XML::XPath
Data::Dumper for debugging
- You are free to use our database structure for your staging area should you wish. Note the harvester
requires the
xpath_map and xml_store tables
- The harvester consists of the following files