Biomedical Text Mining

Partners: Luis Rocha, University of Irvine; Florentino Fdez-Riverola, University of Vigo.

Abstract

BioTM, i.e., the field that deals with the automatic retrieval and processing of biomedical literature, is perhaps one of today's most promising research fields. The ability to link structured database information to the essentially unstructured scientific literature and to extract additional information is not only necessary for primary sequence analysis and annotation, but also for the reconstruction of metabolic and regulatory networks.

In contrast to previous approaches, the computational reconstruction of biological systems requires modelling and simulating complex biological processes comprised of thousands of chemical components and reactions. Biological databases may provide basic chemical and functional information, but deep understanding of structure, dynamics, control and design of these systems requires the retrieval and processing of both theoretical and experimental information, mostly residing in scientific literature.

Researchers spend much of their time searching for relevant documents to particular problems. Document search and retrieval are just the first steps. Documents have to be inspected, their relevance has to be assessed and, if considered relevant, their most important contents have to be processed. Manual curation is time-consuming and can be quite laborious even for an expert researcher. Documents are full of information, namely scientific terms, identifiers (e.g. ECnumbers for enzymes) and other references to external databases (e.g. sequence databases). Often, the researcher is not familiar with some particular terminology and needs to search for synonyms, definitions and context information.

The ability to cross-reference data adequately has become invaluable. Scientific publishing grows at a steady rate and research goals are becoming ever more focused and complex. The urge for automatic curation methods and tools is now greater than ever. BioTM is stressing all known experience in free text processing, taking advantage of Information Retrieval (IR), Natural Language Processing (NLP) and Data Mining (DM) techniques. Yet, the biomedical domain presents challenging problems. Biological terminology does not always follows standard nomenclatures, new terms are constantly emerging and term homonymy and synonymy (including term variants and abbreviations) make it very difficult to accurately identify entities. In this regard, most research efforts have addressed the development of new named entity recognition methods and techniques and more recently, the extraction of biologically relevant relations.

Aims

The motivation for the present work is two-fold:



Development efforts are focused on: