29th June 2025, University of Birmingham
The CL2025 pre-conference workshop day will take place at Birmingham Business School, University of Birmingham, on Sunday 29th June 2025. Workshops are typically pedagogical in nature, teaching participants useful skills in corpus linguistics such as corpus design, data collection, analytical techniques, or the use of a specific tool. To register for the workshop day and sign up for your chosen workshop(s), please see the instructions on the registration page.
Workshops will run in parallel in the morning and afternoon. Lunch and morning and afternoon refreshments will be provided.
Please note: participants are required bring their own devices (e.g. laptop) to participate in the workshops.
09:00-12:30 (including 30-minute break)
Nathan Dykes, Stephanie Evert, Michaela Mahlberg, Alexander Piperski
Business School G12
Maximum number of participants: 40
Concordance analysis has long been central to corpus linguistics and other text-based disciplines, including digital humanities, computational social sciences, and computer-assisted language learning. It gives researchers a systematic lens for observing and interpreting patterns of language use, integrating both quantitative and qualitative perspectives. By focusing on a single search word or phrase in a context-limited display—commonly known as a KWIC (Key Word In Context)—scholars can investigate various aspects of its usage and meaning.
In spite of its wide applications, concordance reading has seen little innovation to date. Popular functions of concordance tools are still the traditional approaches, such as sorting lines alphabetically by the left or right context of the node or filtering for specific words. Another challenge for concordance reading is the documentation of the research process and methods applied, in order to ensure reproducibility. The workshop addresses these challenges by introducing both a taxonomy of concordance-reading strategies and a set of computational algorithms that build on these strategies to organize large amounts of textual data efficiently and transparently. Through hands-on exercises using the new Python library FlexiConc (https://pypi.org/project/FlexiConc/), which will be integrated into CLiC (https://clic-fiction.com/), the workshop will demonstrate how to apply robust concordance reading approaches to a variety of research contexts.
The workshop starts with an introduction to concordance analysis, including its place in the continuum of quantitative and qualitative research. We cover the most common general strategies for concordance analysis: selecting, sorting, and grouping lines, and show how each of them can aid interpretation. Participants will also learn about basic formal definitions and mathematical properties of the computational algorithms that underlie these strategies. We will discuss how algorithms extend beyond simple alphabetical ordering, opening up new possibilities for advanced text analysis.
The workshop will include a series of practical exercises, where participants first work with existing tools like AntConc or CQPweb and then proceed to exploring the functionalities of the FlexiConc library and its web interface. This library is designed to support a wide range of concordance reading strategies and to document user decisions in a systematic way. The web interface is designed to be intuitively accessible and to enable convenient interactive exploration. We introduce the concept of an ‘analysis tree’ to ensure the reproducibility and accountability of concordance research. By using a tree structure to trace the decisions taken when selecting lines from concordances, ordering, and grouping them, we can document not only the final results but also the process that led there. This approach fosters transparency, which is crucial for collaborative and interdisciplinary projects, as well as for replicating or extending research.
Introduction to concordance analysis: fundamentals and strategies
Participants are introduced to basic concepts of concordance analysis. After a brief definition of fundamental terms and concepts, we give an overview of concordance software and its functionalities and allow participants to explore selected example concordances. Participants will be encouraged to share their observations on linguistic patterns as they work with existing concordancing tools.
We introduce strategies for organizing concordances (different types of selecting, ordering, and grouping). Each strategy is discussed with regards to its purpose, and how it may be combined with other strategies. In a hands-on exercise, participants apply different strategies themselves to example data and compare their observations to those from the step before to see how the application of dedicated strategies helps with concordance organisation and enhances systematicity.
Computational algorithms
Participants are introduced to our algorithmic approach to concordance reading, which extends the basic strategies and enhances their flexibility.
In a hands-on exercise, participants try out different concordance algorithms, including complex applications such as clustering, which are not widely available in current concordance tools. They can work with the web interface of our library on a public server, so no software installation is required (but advanced participants are welcome to work directly with the Python library).
Analysis trees for research documentation
We discuss reproducibility as a central challenge for concordance analysis and how this problem can be solved with the help of the ‘analysis tree’. The tree-like display, accessible through the web interface of our library, enables users to trace and illustrate decision-making during concordance analysis.
This workshop targets an interdisciplinary audience, including students and researchers in corpus linguistics, general linguistics, computational linguistics, digital humanities, and computer-assisted language learning. We will keep the technical discussion to a manageable level to accommodate participants from both technical and non-technical backgrounds. Those interested in advanced techniques, such as more low-level concordance processing using Python, will be directed toward additional resources and follow-up materials after the session.
No software installation will be required prior to the workshop.
Yuze Sha
Business School G05
Maximum number of participants: 30
Methods that enable comprehensive corpus analysis of multimodal data are essential for advancing our understanding of social media and digital communication. Social media posts are inherently multimodal, combining semiotic resources such as texts, emojis, memes, videos, and hyperlinks. Multimodal social media discourses have increasingly attracted research attention (Bouvier and Machin 2020; Djonov and Zhao 2013), whereas the methodological approaches remain largely divided into two camps: qualitative (e.g., Chałupnik and Brookes 2022; Hansson and Page 2023) and quantitative (e.g., Christiansen et al. 2020), each with strengths and limitations.
Within corpus linguistics, efforts to investigate multimodal discourse on social media are still at a relatively early stage. So far, no available corpus tool is capable of analysing multimodal data as systematically and comprehensively as monomodal linguistic data, such as by identifying (in)frequent co-occurrences and generating concordances across modes. In this workshop, I demonstrate how ATLAS.ti (version 25.0.1) can be used to construct and analyse multimodal social media corpora (Sha and Malory, in press). As a computer-assisted qualitative data analysis (CAQDAS) tool, ATLAS.ti is designed to support inductive, iterative data category development in line with grounded theory principles (Page 2022), offering transparent, systematic, and replicable workflows (Woods et al. 2016). Its quantifying functions, such as Code Co-occurrence Analysis and associated visualisation tools, can be flexibly applied to examine co-occurring patterns both within and across modes.
This workshop introduces four functionalities of the software that advance corpus-assisted discourse studies of multimodal social media data. It will be especially helpful for researchers who are interested in exploring the interplay between language and other modes of communication on social media platforms but currently lack dedicated tools for doing so.
This workshop has three primary aims.
First, it will guide participants through the process of constructing a multimodal social media corpus using ATLAS.ti. This includes collecting and cleaning data from platforms such as Twitter, and addressing ethical considerations.
Second, it will demonstrate how to use ATLAS.ti for problem-oriented corpus annotation and analysis, tailored to specific research questions. The session will cover annotation scheme design and introduce methods to: (1) obtain an overview of the corpus, (2) locate patterns of mono- and multimodal (non-)co-occurrences, (3) visualise these patterns, and (4) examine them in depth by reviewing the associated multimodal concordances and their (extra-)linguistic context.
Third, the workshop will discuss the constraints involved in using ATLAS.ti for short-form multimodal social media research and propose practical strategies to address them.
By the end of the workshop, participants should have confidence in using ATLAS.ti to construct and examine their own multimodal social media corpora.
Introduction to multimodal social media corpus design
Overview of how multimodal corpora differ from linguistic monomodal corpora, as well as how multimodal social media corpora are distinct from other multimodal corpora.
Ethical issues around collecting and using social media content.
Considerations for data sampling, cleansing and annotation, with particular attention to the influence of social media affordances.
Data management in ATLAS.ti
Demonstration of how to import and manage different semiotic resources, including texts, emojis, images, hyperlinks, and videos.
Organising units of analysis (document groups, documents, quotations).
Managing project files and backups.
Guided hands-on practice.
Coding and techniques
Developing annotation schemes aligned with specific research questions.
Strategies for improving consistency and plausibility of annotation schemes.
Guided hands-on practice.
Analysing and interpreting results
Designing analytical procedures according to research questions.
Generating quantitative insights using tools such as Code-Document Analysis, Code Co-Occurrence Analysis, and the Query Tool.
Moving beyond surface patterns: integrating qualitative analysis of multimodal concordances and (extra-)linguistic features.
Discussing how to combine various tools within ATLAS.ti to support comprehensive, research-driven analyses of multimodal social media corpora.
Guided hands-on practice.
No prior experience with ATLAS.ti is required. However, familiarity with general concepts of corpus annotation and analysis would be helpful. Those who have worked primarily with monomodal linguistic corpora will benefit from learning how to include multimodal data, while those with qualitative multimodal research experience will gain insights into applying a more replicable, systematic approach to larger datasets.
ATLAS.ti is available in both desktop (Windows & macOS) and web versions, with the desktop version offering more advanced functionality. Participants are advised to bring their laptops with ATLAS.ti (version 22 or newer) pre-installed. Brief setup instructions will be provided ahead of the workshop and reiterated at the beginning of the session. Alternatively, the web version can be used for trial purposes, while it has some functional limitations.
Mark McGlashan, Charlotte-Rose Kennedy
Business School G07
Maximum number of participants: 25
Online content glamorising and promoting suicide and self-harm is prolific and presents a pressing issue for children and young people online. Such content can potentially exacerbate existing mental health issues and instigate new ones (Ofcom 2024). Recently, the UK’s Online Safety Act (2024) introduced a new criminal offence for encouraging, promoting or providing instructions for suicide, self-harm, and eating disorders, highlighting the significance of the problem.
One particularly prevalent issue is online content that promotes eating disorders such as anorexia and bulimia. ‘Pro-ana’ and ‘pro-mia’ websites contain ‘tips’ on how to engage in extreme weight loss, images of emaciated bodies (i.e., ‘thinspiration’), encourage competitions to be thin, and seek to prevent recovery (Rouleau and van Ranson 2011). Comorbidities with eating disorders include numerous physical (e.g., cardiovascular disease) and mental health (e.g. personality, anxiety, and mood disorders; Juli, et al., 2023) issues. Anorexia, for example, causes more deaths by suicide than starvation (Eating Disorder Hope 2024). As the median age of anorexia onset is around 12 years old (Eating Recovery Centre 2024), safeguarding children and young people from such content is imperative.
To protect young people against online harms, the Department for Education (2024: 40) requires that UK schools implement filtering and monitoring software to “block harmful and inappropriate content without unreasonably impacting teaching and learning”. Many filtering and monitoring systems use ‘keyword monitoring’ to track language use on online devices to identify specific words or phrases (e.g. ‘bomb’) that correlate with specific forms of risk (e.g. violence). However, this poses some issues; filtering and monitoring software tends only to raise concerns if there is a direct match to a ‘keyword’, and the ‘keywords’ themselves are often isolated from their context(s) of use. This can lead to ‘false positives’, wherein a keyword match raises an automatic safeguarding concern (e.g. ‘bomb’) even if the use of the keyword was innocuous (e.g. ‘bath bomb’).
This sandpit is informed by a broader Innovate UK-funded Knowledge Transfer Partnership that partners academics with the safeguarding solutions provider Senso.cloud to improve the accuracy and context-sensitivity of ‘keyword monitoring’ in safeguarding. The sandpit centres on using innovative corpus techniques to identify important linguistic features and constructions that provide insights into language around eating disorders. Participants will be given access to a 660,611-word corpus of disclosures and discussions about eating disorders (comprising content from Reddit and Childline message board posts) and encouraged to use any corpus or computational methods, tools, and/or techniques that they see fit to explore the data. Ultimately, the sandpit will give participants the opportunity to consider how their methods findings could be used to inform safeguarding practice and protect children and young people from online eating disorder content.
This sandpit offers participants the opportunity to engage in knowledge exchange between academia and industry by applying their skills in corpus and/or computational linguistics to a significant social issue. Short sprint workshops encourage collaboration, task prioritisation, results-driven analysis, and the development and presentation of actionable insights – all skills that are valued in industry. Secondly, participants will be able to engage with the company directors of Senso.cloud to better understand how (corpus) linguistics can be applied in a non-academic, safeguarding context. Participants may be subsequently invited to share their findings at Senso.cloud headquarters through a series of ‘Linguistics Knowledge Transfer Sessions’, thus providing routes to impact and application (for early career researchers especially), and strengthening ties between industry and academic partners.
The aims of the workshop will be to apply corpus/computational methods to provide insights into eating disorder content by:
Fostering team working/collaboration between participants
Highlighting the range of potential methods/tools/techniques available to participants for corpus analysis
Providing opportunities for networking, collaboration, and the development of new, original research ideas pertinent to both academic and non-academic audiences
Encouraging participants to synthesise and communicate the results of analysis to a group of academic peers and industry professionals
Introduction
Welcome to the Senso.cloud Safeguarding Sandpit!
Orientation: the initial part of the workshop will introduce the sandpit leaders (McGlashan, Kennedy) and the sponsor company (Senso.cloud). This initial introduction will detail the Knowledge Exchange partnership between academics and industry, highlighting the potential non-academic applications of corpus linguistics research knowledge and skills
Aims: see above
Task details: The task will involve participants analysing a prebuilt corpus in any way(s) they choose. See below.
Disclaimers: Participants will be made aware of the challenging content of the task and relevant disclaimers will be provided ahead of and throughout the sandpit – participants should only participate if they are comfortable working directly with social media posts that discuss safeguarding concerns
Task
Analyse the corpus in any way you choose but focus on how findings from this process can be used to help safeguard children and young people from online eating disorder content
Workshop participants will be split into groups
Groups will work independently and will together agree the approach(es) to be taken when addressing the task; participants direct this activity
Data will be made available to download in advance
Presentations
Give a 5-minute group presentation on the task. Presentations might include partial findings and results – the point is to show how methods have been applied and the results that can be generated from their application. Presentations must include the sections:
Methods: what method(s) were used in the analysis along with each methodological step taken in their application(s)
Findings: the results generated through the application of the methods chosen alongside discussion of their context(s) and implications – why are these findings interesting and how might they be used to enhance safeguarding practice? What might the ‘real world’ implications of these findings be?
Postgraduate-level knowledge of corpus and/or computational linguistics
Data will be provided ahead of time but no specific software will be provided or recommended
13:30-17:00 (including 30-minute break)
Paul Rayson, Daisy Lal, John Vidler
Business School G12
Maximum number of participants: 80
This half day (3 hours) workshop will provide practical hands-on tutorial with the new version of the web-based Wmatrix corpus analysis and comparison software (https://ucrel.lancs.ac.uk/wmatrix/). Version 7 of Wmatrix is now open access for academic researchers and incorporates the Python open source (Apache Licence 2.0) version of the multilingual UCREL Semantic Analysis System (PyMUSAS) that automatically assigns semantic fields to words and multiword expressions to corpora (Rayson et al, 2004). Wmatrix7 via PyMUSAS provides support for 8 languages (https://pypi.org/project/pymusas/) and facilitates the extension of the key semantic domains method (Rayson, 2008) to those languages. Wmatrix7 represents the most significant update to the online software since the first version was presented at the ICAME 2001 conference (Louvain-la-Neuve, Belgium) and is now free to use. Wmatrix7 has a completely new indexing system implemented in the open source sqlite database allowing indexing of 10s of millions of words. The semantic lexicons used in PyMUSAS are also now freely available under Creative Commons CC-NC-BY-SA 4.0 licence (https://github.com/UCREL/Multilingual-USAS). Open access and open source tools are vital for the replicability and reproducibility of future corpus linguistics studies and support the explainability of annotation and analysis methods in corpus linguistics and NLP software, especially in light of the speedy uptake in new generative AI methods and large language models (LLMs), some of which are not open source or do not declare their training materials. Open tools also facilitate the exchange of methods and techniques to enable further developments to be built on top of existing groundwork e.g. as has been done in the Australian Text Analytics Platform (Jufri & Sun, 2022) building on PyMUSAS.
New and ongoing developments and features will also be highlighted including the future integration with large scale parallel processing using the UCREL-hex facility at Lancaster, a hybrid multiprocessor system including shared GPUs (https://www.lancaster.ac.uk/scc/research/research-facilities/hex/). Facilities like hex have been used to hugely speed up the large scale annotation of extreme scale corpora e.g. for the 1.2 billion words of the ParlaMint II corpus of comparable parliamentary data across Europe (Erjavec et al, 2024) from 18 days to around 7 hours. We will also describe further development of the English, Spanish, Dutch and Danish PyMUSAS taggers and lexicons as part of the 4D Picture project (https://4dpicture.eu/).
To provide a guided introduction to semantic annotation methods in corpus linguistics and natural language processing
To provide an introduction to the key semantic domains method and how it is operationalised in the Wmatrix7 tool along with PyMUSAS
To allow participants to explore the tools following guided tutorials and receive live direct feedback from the workshop organisers and tool developers themselves
Participants will also be given the opportunity to feed into future developments of the software via the collection of their requirements and preferences for new and adapted features in Wmatrix. They will also have the opportunity to discuss the development of PyMUSAS for new languages and to plan further collaborations.
The workshop will begin with a 30 minute overview presentation introducing the theories and methods implemented in Wmatrix and PyMUSAS. The remainder of the time will be spent by participants being supported while following online tutorials to explore the tools using ready made corpora, e.g. the UK election manifestos corpora (https://github.com/perayson/manifestos) as well as to load in their own corpora for analysis.
Participants will likely have used other corpus linguistics software already, but the main methods (frequency lists, concordances, keywords, n-grams, collocations) will be introduced in the tutorials if needed. Participants with programming and command line experience will also be guided through the Python code necessary for use and integration of PyMUSAS in their own code, via Python Notebook demonstrators.
Wmatrix is a web based tool, so participants will only require an internet connection on their laptop or tablet plus a good web browser e.g. Chrome or Firefox. PyMUSAS will be demonstrated via web based access, but participants can bring their own laptops to install and run it locally using a Python programming environment, see https://pypi.org/project/pymusas/ for installation instructions.
Nathan Dykes, Tim Feldmüller
Business School G05
Maximum number of participants: 20
Word embeddings are vector representations of words. In order to generate them, each type in a corpus is transformed into a list of numbers through a neural network that has learned relationships between individual words and their syntagmatic contexts. This representation makes it possible to condense a vast spectrum of semantic (and morphosyntactic) information into the embeddings and to organize words in a vector space where words close to each other tend to share semantic and/or functional aspects (Bubenhofer 2020).
The analytical potential of word embeddings is complementary to that of collocations: embeddings reflect paradigmatic distributional similarity in the sense of “words that do not themselves co-occur, but whose surrounding words are often the same” (Sahlgren 2008: 43). The potential for Corpus Assisted Discourse Studies (CADS) is substantial: for instance, embeddings can help analysts to find (near-)synonyms that reflect prominent lexical fields. Training a model on a specialised target corpus can uncover associations that differ from typical language use in everyday discourse. This, in turn, can help identify covert attitudes and evaluations. For instance, one might investigate the distributional similarity of different person references to explore how similarly certain actors are represented.
While word embeddings and the language-model architectures built on them, such as BERT and GPT, have been ubiquitous in Computational Linguistics at least since the publication of Word2Vec (Mikolov et al. 2013), Corpus Linguistics and, in particular, CADS have shown little interest in word embeddings (see, however, for German e.g. Bubenhofer 2020; Knuchel & Bubenhofer 2023; Meier-Vieracker 2024). One important reason for this is that word embeddings are usually computed in programming languages like Python. However, extensive programming knowledge is not actually necessary for computing and carrying out basic analyses with word embeddings. Our workshop aims to fill this gap: We plan to provide the necessary skills to load and analyze both pre-trained word embedding models and to train word embeddings on one’s own corpora.
Our workshop aims to provide an accessible introduction to applying word embeddings to Discourse Analysis. It is directed at researchers interested in applying corpus-driven quantitative methods to enable research questions where one quickly encounters limitations when using traditional Corpus Linguistic approaches. Applied CL, and CADS in particular, has often been focused on comparing frequencies on the level of individual words, making it challenging to explore themes realised through a wide range of lexical choices. Such examples may include phenomena tied to lexical fields (e.g. metaphor domains), near-synonyms or attitudes expressed with a wide variety of terms.
For the most part, analysts trained in linguistics explore corpora through tools that offer a graphical user interface. While these tools are convenient in terms of usability, they have limitations when it comes to incorporating more elaborate methods. At the same time, word embeddings are a valuable resource with significant potential for CADS.
While working in Python requires more introduction than a dedicated corpus platform, basic skills such as functions, variables and processing files can be quickly learned and transferable to various applications. Moreover, word embeddings as a specific resource are well-established in other fields and are available in relatively accessible formats. The skills conveyed in this workshop are thus intended to transfer relatively readily to participants’ individual interests.
Short introduction: “What are Word Embeddings?”
We begin with a concise introduction to the concept of word embeddings, explaining how words are represented as vectors and which potentials there are for discourse analysis.
Python basics (variables, functions, key data types, reading and writing files)
Participants receive an accessible overview of essential Python concepts for working with text. We work with prepared Jupyter Notebooks, allowing the participants to apply the concepts in an accessible environment.
Solving technical problems
We address typical technical issues that might occur during the hands-on sessions, such as installing libraries and setting up the computing environment.
Loading an existing model
We demonstrate how to load pre-trained word embedding models (one trained on a general language corpus and one on a more specialised thematic corpus) into Python, giving participants the opportunity to explore embedding spaces without the need to train a model from scratch.
Simple analyses (e.g. nearest neighbors, clustering)
Participants learn how to retrieve the nearest neighbors of a word in vector space and how to perform basic clustering or similarity analyses. They are encouraged to explore and reflect on the differences between the two provided models.
Training a custom model
We guide participants through the process of training a word embedding model on their own corpus (where available) or on sample data, highlighting important parameters and considerations.
Presentation of studies that use word embeddings
We showcase selected studies that apply word embeddings to discourse analysis. This includes hands-on discussion of examples from research, where participants are encouraged to critically engage with the results of embedding models and reflect on the potential for their own projects.
Discussion and further practice
If there is time left, participants can explore more advanced questions and discuss their own research interests.
No previous programming or in-depth technical knowledge is necessary. We will introduce all relevant concepts as we go along. Participants should have Jupyter Notebook installed on their computers before attending, as we will use it for all practical exercises. We will provide instructions for the installation before the workshop.
Iulianna van der Lek, Giulia Pedonese, Alexander König, Martin Wynne, Francesca Frontini, Megan Bushnell
Business School G07
Maximum number of participants: 40
Recent projects and initiatives acknowledged that there is a lack of general awareness among lecturers, students and researchers of research data management practices, including knowledge of the FAIR data principles (https://www.go-fair.org/fair-principles/) to make digital resources more Findable, Accessible, Interoperable and Reusable (e.g. UPSKILLS project (https://upskillsproject.eu/), EOSC Skills and Training Working Group (https://eosc.eu/opportunity-area-exp/oa5-skills-training-rewards-recognition-upscaling/), EC Digital Skills for FAIR and Open Science (https://data.europa.eu/doi/10.2777/59065)).
Therefore, the FAIR Competence Framework for Higher Education proposes a set of core competencies for FAIR data education that universities can use to design and integrate research data management and FAIR-data-related skills in their curricula and programmes (Demchenko et al., 2021). Students, scholars, teachers and researchers from all disciplines are encouraged to acquire fundamental skills for open science, including the ability to effectively interact with federated research infrastructures and open science tools for collaborative research. To further support the integration of these skills into the university curricula, an adoption handbook was published “How to be FAIR with your data – A teaching and training handbook for higher education institutions’’ (Engelhardt et al., 2022, https://fairsfair.gitbook.io/fair-teaching-handbook), which contains ready-made lesson plans on a variety of topics, including the use of repositories, data creation and reuse. In addition, the Skills4EOSC project (https://www.skills4eosc.eu/) provides an adaptable framework focusing on digital skills and using existing technologies to improve the competencies and skills of researchers. Specifically for linguistics, and other humanities disciplines, teaching resources and best practices guidelines have been created in the UPSKILLS and H2IOSC projects (Degl’Innocenti et al., 2023) to show various target audiences how the CLARIN research infrastructure (https://www.clarin.eu/) can support researchers in adopting and applying the FAIR data principles in their research practices. Based on the experience acquired in these projects, the authors of the abstract propose a workshop to raise awareness of the FAIR (and, to a lesser degree, the CARE principles for Indigenous Data Governance (https://www.gida-global.org/care)) and how they can be taken as guidance in corpus linguistics projects to ensure that the language research data is not only FAIR but also follows ethical research practices and supports Open Science. Hands-on demonstrations will be included using services, tools and language resources from the CLARIN research infrastructure.
This workshop will show participants how to incorporate the FAIR and CARE ( Compared to the more generally applicable FAIR principles, the CARE principles focus specifically on certain research scenarios and will therefore play a less prominent role in the workshop.) principles into their corpus linguistics research projects. The programme will consist of theoretical and hands-on exercises, including services and tools from CLARIN, a European Research Infrastructure for language as social and cultural data. Through a combination of theoretical principles, hands-on activities and case studies, the participants will learn how to identify the requirements for a linguistic resource (e.g. a linguistic corpus) to align with the FAIR and CARE principles and apply them in their research workflow. Finally, the workshop will contain a case study and a roleplay on how to write a Data Management Plan (DMP) for your research. The case study will focus on an early-career researcher's experience working on a research project in corpus linguistics. The workshop participants will learn how to draft a DMP using a sample research project as an example and get to know the Argos application (https://argos.openaire.eu/home). This tool, developed by OpenAIRE, allows scholars to write, save and export their DMP according to FAIR principles and Open Access best practices.
By the end of this workshop, the workshop participants will be able to:
Identify the requirements for a resource to align with the FAIR and CARE principles
Find and use certified research data repositories for data collection, sharing and archiving
Create the outline of a research data management plan and familiarise themselves with the Argos application
Identify and use infrastructure tools for data processing and analysis
Introduction
What is CLARIN?
What are FAIR and CARE principles, and how can they be applied in corpus linguistics?
Finding and analysing linguistic resources in CLARIN
How CLARIN supports the FAIRness of data
Guided tour of CLARIN’s language data discovery portal, the Virtual Language Observatory (https://vlo.clarin.eu)
Tool examples from the Language Resource Switchboard (https://switchboard.clarin.eu/): processing a text with Weblicht (demo)
Creating a Data Management Plan (case study and discussion)
Demo of depositing, sharing and archiving your corpus data
No previous knowledge of FAIR and CARE principles is required.
Institutional login to access the CLARIN services (Most academic accounts can be used for logging into CLARIN services thanks to the CLARIN service provider federation, see https://www.clarin.eu/content/federated-identity for details). Please test beforehand. If you encounter access issues, you can request a CLARIN account at https://user.clarin.eu/user/register.