Friday, 21 September 2007

DOAJ - Directory of Open Access Journals

Stumbled across DOAJ, which has some useful biodiversity-related content. Could think about adding this to my OpenURL resolver...

Thursday, 6 September 2007


JSTOR is a major user of SICIs, hence I've been exploring it to see how hard it would be to add support for JSTOR (and SICIs) to my OpenURL resolver. Helpfully JSTOR provides a list of journals they archive, which is also available as a comma delimited file. I grabbed the biological serials collection list and extracted information on the journals JSTOR carries.

Another incidental by product was a further 87 new journals to my database of journal names and ISSNs. Just to remind myself how to do this, this is the SQL to insert the new journal names and ISSNs.

INSERT INTO issn(title, issn)
SELECT jstor.journal, jstor.issn FROM jstor
LEFT JOIN issn ON jstor.journal = issn.title
WHERE issn.title IS NULL;

The next step would be to add code to the OpenURL resolver so it could figure out whether a reference was in JSTOR and provide a link to that resource (as well as any other links I have). For example, Walker et al. "The Effect of Feeding History on Retreat Construction in the Wolf Spider Hogna helluo (Araneae, Lycosidae)" is in JSTOR (sici:0161-8202(1999)27<689>2.0.CO;2-7), and is also in my database as the PDF is freely available online.

It would be nice if my OpenURL resolver could take a SICI and return metadata that included both the availability in JSTOR and on the American Arachnological Society web site.

Thursday, 30 August 2007

Piggy Bank and bioGUID - browsing the biodiversity Semantic Web

Piggy Bank is, according to its developers:
...a Firefox extension that turns your browser into a mashup platform, by allowing you to extract data from different web sites and mix them together.

I've been rather underwhelmed in the past, but Jim Balhoff asked me why bioGUID didn't play ball, so I decided to take another look.

On pointing Piggy Bank at a bioGUID page, such as casent0498428 and clicking on the "data coin" icon in the bottom right corner of the Firefox browser window failed to produce anything interesting (Piggy Bank just grabbed some text from the web page, not the underlying RDF). After a bit of fussing, I finally got bioGUID to work with Piggy Bank.

BioGUID resolves a GUID and returns RDF as an XML document, with a XSL instruction included so that the web browser renders it as HTML. This means that a user with a web browser gets a nicley formatted page, but a Semantic Web tool gets RDF. However, Piggy Bank needs some help. First, I include a link to the RDF in the <head> element of the HTML, as described on the Piggy Bank web sitee.g.:

<link rel="alternate" type="application/rdf+xml"

The &format=raw parameter is important as this tells bioGUID to return RDF with the MIME type "application/rdf+xml". Piggy Bank needs this MIME type set, otherwise it just parses HTML. I normally return RDF as "application/xml" so it displays nicely in web browsers if I'm debugging, and to ensure the XML is transformed into HTML. Clearly I'm going to have to rethink this, perhaps by exploring content negotiation.

Anyway, for the time being Piggy Bank and bioGUID seem to work together, as this screen shot shows:

If you want to explore Piggy Bank as a Semantic WEb browser for biodoversity data, one place to start is and the sequences it links to, many of which link to specimens.

Wednesday, 15 August 2007

OpenURL and COinS

I've improved the bioGUID OpenURL service, based on my experience with the OpenURL Referrer extension for FireFox. This extension takes COinS links in HTML and inserts a link to an OpenURL resolver. The EDIT scratchpads being developed by Vince Smith and colleagues at the Natural History Museum contain COinS (see the APOL - Abyssal Polychaetes OnLine bibliography, for an example). I installed the OpenURL Referrer extension, but my OpenURL resolver broke. Turned out the the COinS want OpenURL 1.0, whereas I supported the much simpler version 0.1 (Caveat Lector pretty much sums up how I feel about this). Anyway, a quick hack of my code and it now works.

If you add as a profile, you can then use my OpenURL resolver (there's even a little button to click on at Once installed, scratchpads such as APOL - Abyssal Polychaetes OnLine bibliography now have links to the OpenURL resolver.

Why use bioGUID's service? Well, in addition to supporting papers that have DOIs, I'm adding other sources, as mentioned earlier in this blog, and on iPhylo. Some 9000 freely available full text articles are now available through this service. There's still a lot more to do to this service, such as supporting other identifiers (e.g., PubMed and SICIs).

Thanks to Julius Welby for reminding me about COinS.

Wednesday, 6 June 2007

Banff Manifesto

From Bio2RDF, I stumbled across this link to the Banff Manifesto. To quote:
Some participant from the HCLS-DI2007 workshop, and from the I3 Workshop propose to the community of life science to establish an authority for attributing namespaces used in semantic web to build URIs. This authority will also establish a set of rules for the construction of well formed RDF documents. When someone will use URI approved from the BM authority, they will know that some standard of naming are upheld. The set of rules are good pratices that need to be followed to be in recognized by the BM community.

Much in the way DOI work, BM will certified the appartenance of a namespace to a data provider and disambiguate the naming for it.

Essentially, it proposes that identifiers are written as


where the authority_namespace is bm.Resolution (at least in the examples presented) is via a persistent URL, e.g. bm:uniprot:p26838 becomes

This looks to be much the same idea as bioGUID. Interestingly, one of the exmaples of the URI chaos at present is a DOI for a Protein Data Bank record (doi:10.2210/pdb2b3y/pdb. This DOI resolves to a compressed file served by FTP. You can see the web version here. The CrossRef XML for this DOI is interesting, in that sort-of treats the record as an article, but also has a component-list (which curiously states that this record is a part of itself).

Tuesday, 29 May 2007

OpenURL resolver for Psyche

Following on from first efforts with Journal of Arachnology described on iPhylo, I've added Psyche to my OpenURL resolver. Psyche is the journal of the Cambridge Entomological Club. What makes it easy to add is that there are full text PDFs available for most articles, Jonathan Rees has created a series of XML files for each issue (e.g., 103.xml), listing metadata for each article, and the URLs for the PDFs are easy to construct from the metadata. So, it was simply a case of harvesting the XML files, extracting the metadata, and adding this to a local MySQL database.

There were some minor "gotchas", such as the presence of entities in the XML (e.g. &#233; for the character é). These weren't declared, so I needed to stick these into each XML file:

<!ENTITY aelig "&#230;" >
<!ENTITY ldquo "&#8220;">
<!ENTITY rdquo "&#8221;">
<!ENTITY lsquo "&#8220;">
<!ENTITY rsquo "&#8221;">
<!ENTITY ouml "&#246;">
<!ENTITY uuml "&#252;">
<!ENTITY mdash "&#8212;">
<!ENTITY eacute "&#233;">
<!ENTITY euml "&#235;">
<!ENTITY oelig "&#339;">
<!ENTITY OElig "&#338;">
<!ENTITY AElig "&#198;">
<!ENTITY acir "&#226;">
<!ENTITY oacute "&#243;">
<!ENTITY iacute "&#237;">
<!ENTITY aacute "&#225;">
<!ENTITY ndash "&#8211;">
<!ENTITY atilde "&#227;">
<!ENTITY uacute "&#250;">
<!ENTITY auml "&#228;">
<!ENTITY ocirc "&#244;">

Now, my OpenURL resolver will check if you are trying to resolve a link to an article in Pysche, and if it knows where the PDF is you will be taken there. For example, this link on three species of Phidippus jumping spiders (Araneae: Salticidae)&title=Psyche, Camb.&date=1965&year=1965&volume=72&spage=133&epage=147 goes to the PDF of this paper on Phidippus.

Monday, 23 April 2007

Blank nodes for specimens without URI

Some specimens in GenBank can be easily linked to an external record via a URI (albeit one I've constructed), but for many GenBank sequences the specimen is either so poorly described, or doesn't have a digital representation, that simply linking to a URI is not possible. After playing with generating my own URIs for records in a local MySQL database of specimens, it occurred to be (eventually) that blank nodes might be a useful way to handle these. That is, a node in the RDF that has no URI, but to which all the information about that specimen is linked. The diagram on the right shows the model. In RDF/XML, it would look something like this:

<bioguid:voucher rdf:parseType="Resource">
<rdf:type rdf:resource=""/>
<darwin:Locality>Rio San Juan, 10deg56'N 84deg18'W</darwin:Locality>
<dc:title>OMNH 33325</dc:title>

The original GenBank record is DQ502492.
In the absence of a URI, we make statements such as "the specimen with the title 'OMNH 33325'".

Sunday, 15 April 2007

Adding GUIDs to GenBank records

GenBank records typically come with links to related NCBI records, such as the NCBI Taxonomy and PubMed databases, but not all sequences have PubMed records. For example, sequence DQ343272 has the following publication record:

REFERENCE   1  (bases 1 to 563)
AUTHORS Schubart,C.D., Cannicci,S., Vannini,M. and Fratini,S.
TITLE Molecular phylogeny of grapsoid crabs (Decapoda, Brachyura) and
allies based on two mitochondrial genes and a proposal for
refraining from current superfamily classification
JOURNAL J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)

Ideally, every publication would have a GUID, and the GenBank record would be linked to that GUID. As a first step to this, bioGUID uses a simple web service to parse the JOURNAL field and look for a DOI. The web service uses the Open Source ParaTools to extract metadata from the citation, then calls CrossRef's OpenURL resolver to search for a DOI.

Returning to the example above, if we append J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006) to, we get this XML result (you can get the same result by clicking here):
<?xml version="1.0" encoding="UTF-8"?>
<paracite result="parsed">
<publication>J. Zoolog. Syst. Evol. Res.</publication...</marked>
<title>J. Zoolog. Syst. Evol. Res.</title>
<ref>J. Zoolog. Syst. Evol. Res. 44 (3), 193-199 (2006)</ref>
<openurl>sid=paracite&amp;spage=193...year=2006 </openurl>

If the result attribute of the paracite tag is parsed, then the service found a template that matches the citation (shown in the match tag) and extracted the metadata. If it didn't match a template, the attribute is set to failed and no metadata is returned.
Any metadata found is used to construct an OpenURL query, which is sent to CrossRef. In this example, the reference has the DOI doi:10.1111/j.1439-0469.2006.00354.x, which gives us a GUID to link the sequence to. This is an example of finding an existing GUID based on metadata, and thereby adding value to a GenBank record.

Sunday, 1 April 2007


A couple of comments on bioGUID have appeared. Egon Willighagen posted a short note on Nature's Semantic Web for the Life Sciences forum, to which I've responded by clarifying what bioGUID does.

Leigh Dodds has a note on the All My Eye blog. Embarrasingly the example Leigh used was broken for a while because the California Academy of Sciences DiGIR provider was offline, but it's back now. This is one of the perils of federation. When it's online (or the metadata for the record is already in the cache), it looks like this (if your browser supports SVG):

Friday, 30 March 2007

Common Naming

Jonathan Rees pointed me to his document Common Naming, a "memo about effective use of shared URI's to link semantic web projects." It outlines a vision scarily like bioGUIDs, but with the added use of Persistent URLs (PURLs). To quote from the introduction:

In order for independently developed semantic web projects to work together, they need to share names for things that they both talk about. Common choice and consistent use of URI's forms the backbone of the semantic web.

For example, if two resources or applications talk about a database record (or the things it describes), they need to use a common URI for the record if a combination ('mashup') of the two resources is to connect information that each resource has relating to that record.

This is not generally the case now.

I see this document as reinforcement of my feeling that the lack of good URIs is perhaps the main thing holding us back from building really cool applications.


This is where I'll post comments about bioGUID, a tool to "bootstrap the biodiversity Semantic Web" (I say that with my tongue a little in my check).