Monday, 23 April 2007

Blank nodes for specimens without URI

Some specimens in GenBank can be easily linked to an external record via a URI (albeit one I've constructed), but for many GenBank sequences the specimen is either so poorly described, or doesn't have a digital representation, that simply linking to a URI is not possible. After playing with generating my own URIs for records in a local MySQL database of specimens, it occurred to be (eventually) that blank nodes might be a useful way to handle these. That is, a node in the RDF that has no URI, but to which all the information about that specimen is linked. The diagram on the right shows the model. In RDF/XML, it would look something like this:

<bioguid:voucher rdf:parseType="Resource">
<rdf:type rdf:resource=""/>
<darwin:Locality>Rio San Juan, 10deg56'N 84deg18'W</darwin:Locality>
<dc:title>OMNH 33325</dc:title>

The original GenBank record is DQ502492.
In the absence of a URI, we make statements such as "the specimen with the title 'OMNH 33325'".

Sunday, 15 April 2007

Adding GUIDs to GenBank records

GenBank records typically come with links to related NCBI records, such as the NCBI Taxonomy and PubMed databases, but not all sequences have PubMed records. For example, sequence DQ343272 has the following publication record:

REFERENCE   1  (bases 1 to 563)
AUTHORS Schubart,C.D., Cannicci,S., Vannini,M. and Fratini,S.
TITLE Molecular phylogeny of grapsoid crabs (Decapoda, Brachyura) and
allies based on two mitochondrial genes and a proposal for
refraining from current superfamily classification
JOURNAL J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)

Ideally, every publication would have a GUID, and the GenBank record would be linked to that GUID. As a first step to this, bioGUID uses a simple web service to parse the JOURNAL field and look for a DOI. The web service uses the Open Source ParaTools to extract metadata from the citation, then calls CrossRef's OpenURL resolver to search for a DOI.

Returning to the example above, if we append J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006) to, we get this XML result (you can get the same result by clicking here):
<?xml version="1.0" encoding="UTF-8"?>
<paracite result="parsed">
<publication>J. Zoolog. Syst. Evol. Res.</publication...</marked>
<title>J. Zoolog. Syst. Evol. Res.</title>
<ref>J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)</ref>
<openurl>sid=paracite&amp;spage=193...year=2006 </openurl>

If the result attribute of the paracite tag is parsed, then the service found a template that matches the citation (shown in the match tag) and extracted the metadata. If it didn't match a template, the attribute is set to failed and no metadata is returned.
Any metadata found is used to construct an OpenURL query, which is sent to CrossRef. In this example, the reference has the DOI doi:10.1111/j.1439-0469.2006.00354.x, which gives us a GUID to link the sequence to. This is an example of finding an existing GUID based on metadata, and thereby adding value to a GenBank record.

Sunday, 1 April 2007


A couple of comments on bioGUID have appeared. Egon Willighagen posted a short note on Nature's Semantic Web for the Life Sciences forum, to which I've responded by clarifying what bioGUID does.

Leigh Dodds has a note on the All My Eye blog. Embarrasingly the example Leigh used was broken for a while because the California Academy of Sciences DiGIR provider was offline, but it's back now. This is one of the perils of federation. When it's online (or the metadata for the record is already in the cache), it looks like this (if your browser supports SVG):