Sunday 15 April 2007

Adding GUIDs to GenBank records

GenBank records typically come with links to related NCBI records, such as the NCBI Taxonomy and PubMed databases, but not all sequences have PubMed records. For example, sequence DQ343272 has the following publication record:

REFERENCE   1  (bases 1 to 563)
AUTHORS Schubart,C.D., Cannicci,S., Vannini,M. and Fratini,S.
TITLE Molecular phylogeny of grapsoid crabs (Decapoda, Brachyura) and
allies based on two mitochondrial genes and a proposal for
refraining from current superfamily classification
JOURNAL J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)


Ideally, every publication would have a GUID, and the GenBank record would be linked to that GUID. As a first step to this, bioGUID uses a simple web service to parse the JOURNAL field and look for a DOI. The web service uses the Open Source ParaTools to extract metadata from the citation, then calls CrossRef's OpenURL resolver to search for a DOI.

Returning to the example above, if we append J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006) to http://bioguid.info/cgi-bin/paracite?q=, we get this XML result (you can get the same result by clicking here):
<?xml version="1.0" encoding="UTF-8"?>
<paracite result="parsed">
<issue>3</issue>
<date>2006</date>
<year>2006</year>
<publication>J. Zoolog. Syst. Evol. Res.</publication...</marked>
<volume>44</volume>
<match>_PUBLICATION_ _VOLUME_ (_ISSUE_), _SPAGE_-_EPAGE_ (_YEAR_)
</match>
<epage>199</epage>
<title>J. Zoolog. Syst. Evol. Res.</title>
<spage>193</spage>
<ref>J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)</ref>
<openurl>sid=paracite&amp;spage=193...year=2006 </openurl>
<doi>10.1111/j.1439-0469.2006.00354.x</doi>
</paracite>

If the result attribute of the paracite tag is parsed, then the service found a template that matches the citation (shown in the match tag) and extracted the metadata. If it didn't match a template, the attribute is set to failed and no metadata is returned.
Any metadata found is used to construct an OpenURL query, which is sent to CrossRef. In this example, the reference has the DOI doi:10.1111/j.1439-0469.2006.00354.x, which gives us a GUID to link the sequence to. This is an example of finding an existing GUID based on metadata, and thereby adding value to a GenBank record.

No comments: