Bio Text Mining

Information Retrieval
Text Mining

Although BioTM presents many specific and quite challenging problems, the automated processing of free text is not exactly a new computational issue. The Text Mining (TM) research field has a long experience in the retrieval and processing of general text. The development of search engines and indexed directories, the compilation of dictionaries and the automatic translation of documents have impelled the development of powerful text processing techniques. Wisely, BioTM researchers have taking into advantage many of such techniques, adapting or extending them according to the domain's specificity.

Biomedical Information Retrieval

Biomedical information retrieval is mostly supported by bibliographic databases and open-access journals. Currently, MEDLINE is the largest life science and biomedical bibliographic database containing over 16 million records. It is fully linked with factual databases of DNA sequences, protein sequences and 3D molecular structures, and many of the publishers that provide online journals. Moreover, the National Centre for Biotechnology Information (NCBI) provides access to its contents as part of PubMed, a larger collection of literature databases. The NCBI Entrez server supports the online query of PubMed as well as external access through the Entrez Programming Utilities (eUtils).

Compared to other domains were data access may not be a problem (e.g. dictionaries, newspapers or Web site contents), scientific literature presents strong restrictions in terms of access. Bibliographic databases often provide document abstracts and open-access journals are granting an increasing number of full-text documents, but access to most documents is still constrained by journal subscription and copyright policies.

Most BioTM research has focused on the analysis of document abstracts, which are commonly available. However, the ever-growing need for advanced and complex information, residing mostly in full texts, is urging for document disclosure.

Biomedical Named Entity Recognition

Biomedical terminology poses additional challenges to NER. There are some standards, but they do not cover the whole range of biological entities. Moreover, standards are not commonly used in scientific publications. For example, metabolites (small molecules) and enzymes are usually referred to in the literature by common names, although there are standards for both (IUPAC for metabolites and the EC classification system for enzymes).

Term homonymy and synonymy (including term variants and abbreviations) make it very difficult to accurately identify entities. Metabolites, proteins and genes often have a variety of terms for denoting the same concept. Furthermore, within the same document, a term can be given in an extended form as well as expressed through several variation forms. Term normalisation, i.e., mapping text occurrences to well-defined terms and the corresponding classes is vital for cross-reference.