The LabEx EFL is hiring a Post-Doc researcher on the topic of “Distantly Supervised Relation Extraction for Scientific Texts''. The post will be supported by the “Laboratoire d’Excellence” Empirical Foundations of Linguistics, in the context of an LabEx Strand 5 collaboration between the LIPN (http://lipn.univ-paris13.fr/en/laboratory), RCLN team “Représentation des Connaissances et Langage Naturel” and the ERTIM ”Équipe de Recherche Textes, Informatique, Multilinguisme” (http://www.er-tim.fr) research labs. These partners have already conducted several experiments on unsupervised knowledge extraction from scientific papers [7,8,9]. This post- doc is a follow-up of this collaboration.
Semantic Relation Extraction (RE) is a central task in identifying domain-specific knowledge in text and structuring it into knowledge bases. In general, a semantic relationship is coded as a triple (entity_1, r, entity_2) where the two entities are linked by a relation r. Currently, most of the systems that are used to carry out this task are based either on unsupervised or supervised paradigms, which have both advantages and disadvantages. Unsupervised methods usually rely on hand-based patterns that may have a very good precision but limited coverage. The patterns themselves could be easier to define for some relations and more difficult for others. Supervised methods usually obtain a better overall score (in terms of balance between accuracy and coverage) but they require annotated data, which are expensive and slow to produce. In previous work, we explored the scope and advantages of these paradigms [8, 11]. We found that while the two methods have complementary strengths, hybridation techniques allow to improve their performance. These experiments were performed on the ACL-RelAcs
 corpus of scientific papers in NLP. The dataset was also exploited for a SemEval evaluation campaign in supervised scientific information extraction . A methodology that does not present the problem of manual intervention, either for composing rules or for annotating data, is the so-called Distant Supervision (DS). With DS, any text containing the couple of entities to be linked can constitute a training example . Recently DS has been the focus of various works which highlighted its effectiveness, especially when paired with deep learning methods [14,15,16].
Our research work on relation extraction in scientific text has highlighted the difficulty of the RE task in this specific domain. The difficulties derive from various factors: the fact that entities are not “named entities” like in other Knowledge Bases, the fact that the entities can appear as subject or object in different relations, and the way in which relations are expressed: sometimes these can span various sentences, or be formulated in very different ways. Examples of such relations are “used by”, “applied to”, …, “improves”... etc. In our previous work  we had to combine various extractors to compensate for their deficiencies, taken individually, in order to obtain a good enough accuracy in scientific RE. We believe that Distant Supervision could help to improve the extraction process and eventually replace the ensemble extractors. The PostDoc will review the existing state of the art in the domain of Distantly Supervised Relation Extraction and in collaboration with the team will work towards the definition of a Distantly Supervised methodology for RE in scientific text.
Salary between 2100 and 2300€ /month (net)
PhD in Computer Science
Experience and/or interest in:
Natural Language Processing
Text Mining and Machine Learning
Knowledge Engineering, Semantic Web
- Good scientific writing skills
Python programming, knowledge of PyTorch
Duration: 12 months (between LIPN and ERTIM) Start: from September 2022
Notice: the first interviews will be carried out on the afternoon of the 29/06/2022
The candidates should send to Davide Buscaldi (email@example.com) and Kata Gábor (firstname.lastname@example.org):
a detailed CV (with a list of publications)
a cover letter
the names and e-mails of two references
Agirre E., Olatz A., Hovy E.H., Martinez D. (2000) Enriching very large ontologies using the WWW. In ECAI Workshop on Ontology Learning.
Chavalarias, D. and Cointet, J.-P. (2013). Phylomemetic patterns in science evolution - the rise and fall of scientific fields. PLOS ONE, 8(2).
Fabian M. Suchanek, Mauro Sozio,Gerhard Weikum (2009). Sofie: A self-organizing framework for information extraction. In WWW conference, pp. 631– 640.
Bunescu and Mooney (2005). A shortest path dependency kernel for relation extraction. In Proceedings of Empricial Methods in Natural Language Processing, EMNLP ’05, p.724–731.
Auger, A., & Barrière, C. (2008). Pattern-based approaches to semantic relation extraction: A state- of-the-art. In Terminology, 14(1), pp. 1-19.
Nicolas Béchet, Peggy Cellier, Thierry Charnois, Bruno Crémilleux (2012). Discovering Linguistic Patterns Using Sequence Mining. In CICLing 2012. pp. 154-165
Gábor K., Zargayouna H., Buscaldi D., Tellier I., Charnois T. (2016) : Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature, LREC, Portoroz (Slovenia).
Gábor K., Zargayouna H., Buscaldi D., Tellier I., Charnois T. (2016) : Unsupervised Relation Extraction in Specialized Corpora Using Sequence Mining, Advances in Intelligent Data Analysis XV (IDA 2016), LNCS 9897, p.237-248, Stokholm (Sweden).
Gábor K., Zargayouna H., Tellier I., Buscaldi D., Charnois T. (2016) : A Typology of Semantic Relations Dedicated to Scientific Literature Analysis. SAVE-SD Workshop at the 25th World Wide Web Conference.
Gábor K., Buscaldi D., Schumann A-K., QasemiZadeh B., Zargayouna H., Charnois T.: Semeval- 2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers. In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, USA.
Gábor K, Zargayouna H, Tellier I, Buscaldi D, Charnois T: Exploring Vector Spaces for Semantic Relations. In: EMNLP 2017, Copenhagen, Denmark.
Dessì, D., Osborne, F., Recupero Reforgiato, D., Buscaldi, D., & Motta E. (2020). Generating knowledge graphs by employing Natural Language Processing and Machine Learning techniques within the scholarly domain. Future Generation Computer Systems, 116, (pp. 253-264).
Distantly Supervised Relation Extraction using Multi-Layer Revision Network and Confidence-based Multi-Instance Learning https://aclanthology.org/2021.emnlp-main.15/
Distantly Supervised Relation Extraction with Sentence Reconstruction and Knowledge Base Priors https://arxiv.org/abs/2104.08225
Distantly Supervised Relation Extraction via Recursive Hierarchy-Interactive Attention and Entity- Order Perception https://arxiv.org/abs/2105.08213