High quality biomedical corpora (English, German,
  • Home»
  • View Asset:High quality biomedical corpora (English, German,

Go back

High quality biomedical corpora (English, German,

Date created: 20/03/2014

Website or demo video presentation:


  • Jointly foreground

Asset type:

  • Knowledge resource

Technology readiness level:

  • TRL 8 – system complete and qualified

Type of collaboration seeking:

  • Business: Commercial/Marketing support


  • ClinicalTrials.gov
  • Cochrane
  • diabetes
  • drug information web sites
  • DrugBank
  • Genetics Home Reference
  • ImageCLEF 2010
  • MEDLINE abstracts
  • semantics
  • thesaurus
  • UMLS

The BioMedical Corpora includes: (1) a comparable corpus of monolingual texts (CS and EN indexed biomedical data), (2) some parallel corpora (aligned data). The first one (1) is a dataset that contains indexed monolingual documents from several European languages with terms from existing multi-lingual medical taxonomies and vocabularies (such as MeSH and other sources within UMLS). For now, the following Czech medical documents are indexed: - Bibliographia Medica Čechoslovaca (BMČ), - MeSH in Czech And the following English medical documents are indexed: - ClinicalTrials.gov - Cochrane - drug information web sites - DrugBank - Genetics Home Reference - HON classified diabetes web sites - ImageCLEF 2010 - MEDLINE abstracts - UMLS These corpora are used in the subsequent tasks (4.1b, 4.2. etc.) also for the development of language models and translation systems. The second one (2) contains automatically aligned parallel document corpora for the biomedical domain for the purpose of improving machine translation systems. Language pairs: EN-FR, EN-CS, EN-DE.

This asset is assigned to

This asset was created by