Understanding environment textually and linguistically
Coordinator: Marie-Claude L’Homme, Linguistique et traduction, Université de Montréal
This project is supported by the Social Sciences and Humanities Research Council (SSHRC), 2013-2018
The field of the environment is very complex (since it integrates concepts that are related to meteorology, climatology, geology, economy, etc.) and is dealt with in a wide variety of publications (reports written by experts, articles in newspapers, ideological leaflets, popularized publications, etc.). The issues raised by the field are also extremely important and new words are created in order to convey them. It thus becomes very difficult for both experts and non-experts to keep track of all the changes that occur in the field .
The objective of this project is to develop methods for characterizing the contents of texts on two different levels: textual (using methods and techniques derived from corpus linguistics and text mining) and linguistic (based on lexical semantics models). The project combines theories, methods and techniques used in linguistics, information science and terminology.
First, we will develop a text typology describing and classifying environmental texts according to two different perspectives: 1. The topic dealt with (e.g. climate change, recycling, sustainable development); 2. The level of specialization (expert to expert; expert to initiate, expert to layperson, etc.). The typology will be developed for texts written in English, French and Spanish and will call upon work on text genres by Biber (1988) and Swales (1990) and on communicative situations by Pearson (1998). Then, descriptive text mining methods will be applied in order to identify important or new topics in various texts (e.g., greenhouse effect, climate warming, shale gas). This part of the project will be carried out, first, by applying unsupervised classification algorithms to texts (to group texts dealing with comparable topics), and then by extracting terms that are representative of groups of texts produced by classification algorithms. This procedure will allow us to discover the thematic structure that appears in the corpora. We hypothesize that topics identified by text mining techniques correspond to conceptual clusters that are important in the field of environment. These topics will serve to start a linguistic description of the specialized lexicon that appears in environmental texts. The linguistic description will be based on a lexical semantics theory, i.e. is Frame Semantics (Fillmore 1982). FS are conceptual scenarios that lexical units evoke from different perspectives. For instance, we can hypothesize that the LUs change (n.), change (v.), fluctuate, fluctuation, vary, and variation evoke the same frame in the field. The identification of LUs that are likely to belong to the same frame will be carried out semi-automatically with TermoStat (Drouin 2003), which is a term extractor that offers different viewpoints on the lexicon contained in a corpus. Once the frames and their lexical units are identified, we will describe them based on the methodology developed within the FrameNet project (Ruppenhofer et al. 2010). The method comprises a step that consists in annotating LUs and their participants in corpora. This annotation can then lead to a better description of LUs used in texts.
The main expected outcome of the project is a method that will serve to improve the management of information related to the field of the environment. More specifically, it will allow us to test automatic classification methods and to adapt them to a very complex field of knowledge. Finally, the lexical descriptions will be placed online in a freely available resource. Terminological, lexicographical and pedagogical applications could then be derived from them.