Job Detail

PhD position - CINECA

Inseriert am: 11.03.2019
Thousands of human datasets are now being generated in healthcare and in research contexts. Centralized storage and analysis of these data are no longer technically feasible due to ethical, legal, social reasons. Nevertheless, integrating and analyzing genomic datasets, electronic medical records (EMRs), and data obtained with mobile devices on large cohorts is a key challenge for precision medicine. Indeed, critical directions for the advancement of precision medicine include reducing gaps in access to and availability of essential data and establishing data-sharing platforms and infrastructure [1]. In this context, the H2020 Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) project aims to create a federated cloud-enabled infrastructure to make population-scale genomic and biomolecular data accessible across international borders, accelerating research, and improving the health of individuals across continents.



In this PhD project, our goal is to support the discoverability and reusability of precision medicine datasets by facilitating data integration and exchange through artificial intelligence methods combined with semantic models. Datasets are often represented in free-text formats, particularly in EMRs. The ability to derive structured representation out of such narrative contents is key for the standardization and semantic representation of biomedical concepts and models, enabling automated processing and secondary data analyses. Structured information can be extracted at different levels of granularity. Previous and ongoing work in the biomedical domain has focused on extracting pairwise relationships, e.g., protein-protein [2], drug-drug [3], and chemical-disease [4] interactions, or as more complicated networks, such as action graphs describing synthesis recipes in material sciences [5]. Information extracted with such methods can be enriched with meta-information, such as timestamps [6], and linked to complementary semantic resources [7], improving the discoverability and reusability of the datasets.



In this project we are specifically interested i) to develop and apply text mining methods to support the automatic population of standard minimal metadata models, ii) investigate the use of generic and specific biomedical NLP methods for structuring and enriching short and long textual datasets and iii) to analyze how the information encoded by these structured concepts varies overtime within data sharing networks. The student will work in the framework of EH 2020 CINECA project.



References

1. Fröhlich H, Balling R, Beerenwinkel N, Kohlbacher O, Kumar S, Lengauer T, Maathuis MH, Moreau Y, Murphy SA, Przytycka TM, Rebhan M. From hype to reality: data science enabling personalized medicine. BMC medicine. 2018 Dec;16(1):150.

2. Mallory EK, Zhang C, Ré C, Altman RB. Large-scale extraction of gene interactions from full-text literature using DeepDive. Bioinformatics. 2015 Sep 3;32(1):106-13.

3. Segura-Bedmar I, Martínez P, Zazo MH. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). InSecond Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) 2013 (Vol. 2, pp. 341-350).

4. Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database. 2016 Jan 1;2016.

5. Mysore S, Kim E, Strubell E, Liu A, Chang HS, Kompella S, Huang K, McCallum A, Olivetti E. Automatically extracting action graphs from materials science synthesis procedures. arXiv preprint arXiv:1711.06872. 2017 Nov 18.

6. Zhou H, Deng H, Huang D, Zhu M. Hedge scope detection in biomedical texts: an effective dependency-based method. PloS one. 2015 Jul 28;10(7):e0133715.

7. Teodoro D, Mottin L, Gobeill J, Arighi C, Ruch P. Assessing text embedding models for assigning UniProt classes to scientific literature. Proceedings of Biocuration. f1000research. com/slides/6-1673. 2017.