Summary & FAQ

BioCorder (Biological Recorder) is an attempt to develop a generalized framework to represent taxonomic and systematic data within the spirit of the Semantic Web. It is being developed to facilitate the collaborative sharing and distribution of data acquired during taxon focused research projects. BioCorder databases track specimen-based information from inception (i.e. project design), to data acquisition, analysis, and beyond, linking the users locally stored data to distributed data in other installations of BioCorder, as well as other data providers (e.g. GenBank, PubMed, Google Scholar etc).

The BioCorder concept was defined in a grant application co-written by David Reed (University of Florida) and Vince Smith (INHS, Univ. of Illinois), and awarded a little over half a million dollars in March 2005 by the National Science Foundation. This project is a collaboration between its authors, who are joined by Mark Hafner at Louisiana State University, and Rod Page at the University of Glasgow.

FAQ:
  • What is BioCorder?
  • What is the timescale for the BioCorder project?
  • What data standards does BioCorder use?
  • What is the Semantic Web?
  • Who is developing the Semantic Web?
  • Why do taxonomists need the Semantic Web?
  • Limitations of existing standards to exchange data?
  • Does this mean the data standards being developed are redundant?
  • How does the Semantic Web work?
  • What is BioCorder?
    Fundamentally BioCorder is about data integration. It is about letting taxonomists and systematists to represent their data (e.g., taxonomic names, morphological characters, images etc), and not just the products of that data (websites, keys, descriptive publications etc), but doing this in a way that allows these data to be seamlessly integrated across the World Wide Web, regardless of where these data physically reside. A solution to this problem is something that has become known by computer scientists as the Semantic Web, and BioCorder is our attempt to provide a framework for representing taxonomic and systematic data using Semantic Web architecture.

    What is the timescale for the BioCorder project?
    Although our funding was announced in March 2005, the project did not formally begin until both programmers were employed (June, 2005). Our first year (05/06) is being devoted to proving and testing the concept behind the Semantic Web technologies that will underpin the sharing and exchange of data. In this regard we are using our current web based databases (notably LouseBASE, SID and the Taxonomic Search Engine) as a test-bed for development. Year two (06/07) will be devoted toward developing the first release of the BioCorder database, as outlined on the BioCorder website. In year three (07/08) we hope to begin deploying and testing BioCorder amongst our collaborators.

    What data standards does BioCorder use?
    Two sets of standards are relevant here. The first are those that allow data represented within BioCorder to become part of the Semantic Web. The second are those that BioCorder uses to store and exchange data with non-Semantic Web applications (e.g. the GBIF data portal). Data standards for the former are essentially those outlined by World Wide Web Consortium (W3C) for the Semantic Web. Specifically these are URI’s (there are many kinds of these, BioCorder largely uses LSID’s), HTTP (Hypertext Transfer Protocol, the protocol that links up the web) and RDF (Resource Description Framework, the format for providing machine readable information on the Semantic Web). Internally, specimen data within BioCorder is being developed around the Darwin Core, a simple exchange format for specimens that is compatible with GBIF. Other internationally recognised standards will be adopted as the project develops. However, in most cases these standards lack the breath required by the diverse data concepts stored by BioCorder – indeed this is partly the motivation for development of the Semantic Web. See my answer on “what is wrong with existing standards?” for further details.

    What is the Semantic Web?
    In short, the Semantic Web is a mesh of information linked up in such a way that it can be easily processed by computers on a global scale. You can think of it as being an efficient way of representing data on the World Wide Web, or as a globally linked database. Information on the Semantic Web is maintained in a structured form on web servers, and is fairly easy for both computers and people to work with.

    Who is developing the Semantic Web?
    This is an initiative of the W3C, which is the organization that defines the standards and protocols upon which the World Wide Web operates. You are viewing this web page via HTTP that was also developed by W3C. Like HTTP, the protocols for the Semantic Web are defined by W3C and (for the most part) organizations cannot ignore W3C standards, although they occasionally “add” to them, which for example explain why some web browsers display web pages slightly differently.

    Why do taxonomists need the Semantic Web?
    Taxonomists in the twenty first century face a crisis of data organization. There are an estimated 1.7 million known species described by at least 3.5 million names, and represented by several hundred million specimens in natural history collections across the globe. Each of these specimens has thousands of morphological, molecular, ecological and behavioural attributes, of which it is the job of taxonomists to describe. How can taxonomists, most of whom specialize on a tiny fraction of this diversity, represent their contribution to this ocean of information, and do this in a way that allows others to find and integrate these data with their own research? The Semantic Web provides a solution to this problem. It provides a mechanism by which anybody can make any statement about any piece of data, and do this in a way that allows everyone to precisely find that piece of data and understand its meaning across the web. What is more, because it is the computer, and not the user that automatically processes the context (or semantics) of these data, its possible for the computer to automatically search, organise, and integrate this information, placing it instantly at the disposal of the user.

    Limitations of existing standards to exchange data?
    In recent years there has been a considerable effort to develop data standards for exchanging taxonomic and systematic data. Examples include the “Darwin Core” and “ABCD schema” for exchanging specimen related information or the Taxonomic Concept Transfer Schema for taxonomic data. These standards act like templates allowing people to share documents containing data. However they only work if two conditions are met. Firstly, they require everyone to have agreed on the types of information they what to exchange. The moment someone comes up with a new data type, all the users of that standard have to meet and agree on modifications necessary to accommodate that new data concept. Secondly they assume that all users define their data types in the exactly same way. In practice this is not always the case. Typically these data are exchanged between computers in XML (Extensible Markup Language), a language that was developed for sharing documents (not data) across the World Wide Web. Because of this, XML has many features (like attributes and entities) that make sense for document-orientated systems, but cause problems when expressing data. For example, basic operations like merging XML documents can be very difficult. In part this is because there are many ways to say the same thing in XML. Still not convinced? Check out "From XML to RDF: how semantic web technologies will change the design of 'omic' standards" by Xiaoshu Wang, Robert Gorlitsky & Jonas S Almeida in Nature Biotechnology (Sept. 2005). Vol. 23, No. 9, pp1099-1103.

    Does this mean the data standards being developed are redundant?
    NO, ABSOLUTLY NOT. Semantic Web technologies are very much in their infancy. Any developments of the Semantic Web for exchange of taxonomic and systematic data will need to be compatible with these standards, and in many respects will be dependent upon them for the foreseeable future. We view our efforts with BioCorder as complimentary to projects like GBIF and the Taxonomic Database Working Groups (TDWG). Lessons learnt from the development of BioCorder will be relevant to organisations like GBIF and TDWG. For example, both GBIF and TDWG are already working to develop a system of Globally Unique Identifiers (GUID’s) for data concepts and values – an essential prerequisite for sharing data on the Semantic Web. BioCorder tackles this problem through the use of IBM’s Life Science Identifiers (LSID’s), a system of GUID’s that is used throughout our Taxonomic Search Engine and experimentally in some of the BioCorder modules.

    How does the Semantic Web work?
    The Semantic Web will enable machines to COMPREHEND semantic documents and data, not human speech and writings. Meaning is expressed in RDF statements, which encodes data and concepts in sets of triples, each triple being rather like the subject, verb and object of an elementary sentence. These triples can be written using XML tags. In RDF, a document makes assertions that particular things (specimens, people or whatever) have properties (such as "is a species of," or "is the author of") with certain values (e.g., beetle, book, etc). This structure turns out to be a natural way to describe the vast majority of the data processed by machines. Subject and object are each identified by a Universal Resource Identifier (URI), just as used in a link on a Web page. (URL's, Uniform Resource Locators, are the most common type of URI.) The verbs are also identified by URI's, which enables anyone to define a new concept, a new verb, just by defining a URI for it somewhere on the Web. Computers exchange these RDF statements in specially formatted documents made available on web servers. These are easily combined by appending one file to another. Rules and relationships can be applied to RDF statements, allowing computers to find, process and organize data around these different data concepts, regardless of where the information physically resides. Importantly, the Semantic Web architecture includes a built in layer of proof and trust, so that you can identify who says what on the Semantic Web and apply rules to establish whom you trust.