Pre-conference workshops

CL2025 Pre-conference workshops

29th June 2025, University of Birmingham

The CL2025 pre-conference workshop day took place at Birmingham Business School, University of Birmingham, on Sunday 29th June 2025. Workshops are typically pedagogical in nature, teaching participants useful skills in corpus linguistics such as corpus design, data collection, analytical techniques, or the use of a specific tool.

Morning workshops

09:00-12:30 (including 30-minute break)

Workshop 1.1: Reading concordances with algorithms

Nathan Dykes, Stephanie Evert, Michaela Mahlberg, Alexander Piperski

Business School G12

Maximum number of participants: 40 50

Context

Concordance analysis has long been central to corpus linguistics and other text-based disciplines, including digital humanities, computational social sciences, and computer-assisted language learning. It gives researchers a systematic lens for observing and interpreting patterns of language use, integrating both quantitative and qualitative perspectives. By focusing on a single search word or phrase in a context-limited display—commonly known as a KWIC (Key Word In Context)—scholars can investigate various aspects of its usage and meaning.

In spite of its wide applications, concordance reading has seen little innovation to date. Popular functions of concordance tools are still the traditional approaches, such as sorting lines alphabetically by the left or right context of the node or filtering for specific words. Another challenge for concordance reading is the documentation of the research process and methods applied, in order to ensure reproducibility. The workshop addresses these challenges by introducing both a taxonomy of concordance-reading strategies and a set of computational algorithms that build on these strategies to organize large amounts of textual data efficiently and transparently. Through hands-on exercises using the new Python library FlexiConc (https://pypi.org/project/FlexiConc/), which will be integrated into CLiC (https://clic-fiction.com/), the workshop will demonstrate how to apply robust concordance reading approaches to a variety of research contexts.

Aims

The workshop starts with an introduction to concordance analysis, including its place in the continuum of quantitative and qualitative research. We cover the most common general strategies for concordance analysis: selecting, sorting, and grouping lines, and show how each of them can aid interpretation. Participants will also learn about basic formal definitions and mathematical properties of the computational algorithms that underlie these strategies. We will discuss how algorithms extend beyond simple alphabetical ordering, opening up new possibilities for advanced text analysis.

The workshop will include a series of practical exercises, where participants first work with existing tools like AntConc or CQPweb and then proceed to exploring the functionalities of the FlexiConc library and its web interface. This library is designed to support a wide range of concordance reading strategies and to document user decisions in a systematic way. The web interface is designed to be intuitively accessible and to enable convenient interactive exploration. We introduce the concept of an ‘analysis tree’ to ensure the reproducibility and accountability of concordance research. By using a tree structure to trace the decisions taken when selecting lines from concordances, ordering, and grouping them, we can document not only the final results but also the process that led there. This approach fosters transparency, which is crucial for collaborative and interdisciplinary projects, as well as for replicating or extending research.

Format

Introduction to concordance analysis: fundamentals and strategies

Participants are introduced to basic concepts of concordance analysis. After a brief definition of fundamental terms and concepts, we give an overview of concordance software and its functionalities and allow participants to explore selected example concordances. Participants will be encouraged to share their observations on linguistic patterns as they work with existing concordancing tools.

We introduce strategies for organizing concordances (different types of selecting, ordering, and grouping). Each strategy is discussed with regards to its purpose, and how it may be combined with other strategies. In a hands-on exercise, participants apply different strategies themselves to example data and compare their observations to those from the step before to see how the application of dedicated strategies helps with concordance organisation and enhances systematicity.

Computational algorithms

Participants are introduced to our algorithmic approach to concordance reading, which extends the basic strategies and enhances their flexibility.

In a hands-on exercise, participants try out different concordance algorithms, including complex applications such as clustering, which are not widely available in current concordance tools. They can work with the web interface of our library on a public server, so no software installation is required (but advanced participants are welcome to work directly with the Python library).

Analysis trees for research documentation

We discuss reproducibility as a central challenge for concordance analysis and how this problem can be solved with the help of the ‘analysis tree’. The tree-like display, accessible through the web interface of our library, enables users to trace and illustrate decision-making during concordance analysis.

Pre-requisites

This workshop targets an interdisciplinary audience, including students and researchers in corpus linguistics, general linguistics, computational linguistics, digital humanities, and computer-assisted language learning. We will keep the technical discussion to a manageable level to accommodate participants from both technical and non-technical backgrounds. Those interested in advanced techniques, such as more low-level concordance processing using Python, will be directed toward additional resources and follow-up materials after the session.

No software installation will be required prior to the workshop.

Workshop 1.2: Using ATLAS.ti for constructing and analysing multimodal social media corpora

Yuze Sha

Business School G05

Maximum number of participants: 30

Context

Methods that enable comprehensive corpus analysis of multimodal data are essential for advancing our understanding of social media and digital communication. Social media posts are inherently multimodal, combining semiotic resources such as texts, emojis, memes, videos, and hyperlinks. Multimodal social media discourses have increasingly attracted research attention (Bouvier and Machin 2020; Djonov and Zhao 2013), whereas the methodological approaches remain largely divided into two camps: qualitative (e.g., Chałupnik and Brookes 2022; Hansson and Page 2023) and quantitative (e.g., Christiansen et al. 2020), each with strengths and limitations.

Within corpus linguistics, efforts to investigate multimodal discourse on social media are still at a relatively early stage. So far, no available corpus tool is capable of analysing multimodal data as systematically and comprehensively as monomodal linguistic data, such as by identifying (in)frequent co-occurrences and generating concordances across modes. In this workshop, I demonstrate how ATLAS.ti (version 25.0.1) can be used to construct and analyse multimodal social media corpora (Sha and Malory, in press). As a computer-assisted qualitative data analysis (CAQDAS) tool, ATLAS.ti is designed to support inductive, iterative data category development in line with grounded theory principles (Page 2022), offering transparent, systematic, and replicable workflows (Woods et al. 2016). Its quantifying functions, such as Code Co-occurrence Analysis and associated visualisation tools, can be flexibly applied to examine co-occurring patterns both within and across modes.

This workshop introduces four functionalities of the software that advance corpus-assisted discourse studies of multimodal social media data. It will be especially helpful for researchers who are interested in exploring the interplay between language and other modes of communication on social media platforms but currently lack dedicated tools for doing so.

Aims

This workshop has three primary aims.

First, it will guide participants through the process of constructing a multimodal social media corpus using ATLAS.ti. This includes collecting and cleaning data from platforms such as Twitter, and addressing ethical considerations.
Second, it will demonstrate how to use ATLAS.ti for problem-oriented corpus annotation and analysis, tailored to specific research questions. The session will cover annotation scheme design and introduce methods to: (1) obtain an overview of the corpus, (2) locate patterns of mono- and multimodal (non-)co-occurrences, (3) visualise these patterns, and (4) examine them in depth by reviewing the associated multimodal concordances and their (extra-)linguistic context.
Third, the workshop will discuss the constraints involved in using ATLAS.ti for short-form multimodal social media research and propose practical strategies to address them.

By the end of the workshop, participants should have confidence in using ATLAS.ti to construct and examine their own multimodal social media corpora.

Format

Introduction to multimodal social media corpus design

Overview of how multimodal corpora differ from linguistic monomodal corpora, as well as how multimodal social media corpora are distinct from other multimodal corpora.
Ethical issues around collecting and using social media content.
Considerations for data sampling, cleansing and annotation, with particular attention to the influence of social media affordances.

Data management in ATLAS.ti

Demonstration of how to import and manage different semiotic resources, including texts, emojis, images, hyperlinks, and videos.
Organising units of analysis (document groups, documents, quotations).
Managing project files and backups.
Guided hands-on practice.

Coding and techniques

Developing annotation schemes aligned with specific research questions.
Strategies for improving consistency and plausibility of annotation schemes.
Guided hands-on practice.

Analysing and interpreting results

Designing analytical procedures according to research questions.
Generating quantitative insights using tools such as Code-Document Analysis, Code Co-Occurrence Analysis, and the Query Tool.
Moving beyond surface patterns: integrating qualitative analysis of multimodal concordances and (extra-)linguistic features.
Discussing how to combine various tools within ATLAS.ti to support comprehensive, research-driven analyses of multimodal social media corpora.
Guided hands-on practice.

Pre-requisites

No prior experience with ATLAS.ti is required. However, familiarity with general concepts of corpus annotation and analysis would be helpful. Those who have worked primarily with monomodal linguistic corpora will benefit from learning how to include multimodal data, while those with qualitative multimodal research experience will gain insights into applying a more replicable, systematic approach to larger datasets.

ATLAS.ti is available in both desktop (Windows & macOS) and web versions, with the desktop version offering more advanced functionality. Participants are advised to bring their laptops with ATLAS.ti (version 22 or newer) pre-installed. Brief setup instructions will be provided ahead of the workshop and reiterated at the beginning of the session. Alternatively, the web version can be used for trial purposes, while it has some functional limitations.

Afternoon workshops

13:30-17:00 (including 30-minute break)

Workshop 2.1: Open access and open source tools for corpus linguistics: Wmatrix version 7 and PyMUSAS

Paul Rayson, Daisy Lal, John Vidler, Andrew Moore

Business School G12

Maximum number of participants: 80

Context

This half day (3 hours) workshop will provide practical hands-on tutorial with the new version of the web-based Wmatrix corpus analysis and comparison software (https://ucrel.lancs.ac.uk/wmatrix/). Version 7 of Wmatrix is now open access for academic researchers and incorporates the Python open source (Apache Licence 2.0) version of the multilingual UCREL Semantic Analysis System (PyMUSAS) that automatically assigns semantic fields to words and multiword expressions to corpora (Rayson et al, 2004). Wmatrix7 via PyMUSAS provides support for 8 languages (https://pypi.org/project/pymusas/) and facilitates the extension of the key semantic domains method (Rayson, 2008) to those languages. Wmatrix7 represents the most significant update to the online software since the first version was presented at the ICAME 2001 conference (Louvain-la-Neuve, Belgium) and is now free to use. Wmatrix7 has a completely new indexing system implemented in the open source sqlite database allowing indexing of 10s of millions of words. The semantic lexicons used in PyMUSAS are also now freely available under Creative Commons CC-NC-BY-SA 4.0 licence (https://github.com/UCREL/Multilingual-USAS). Open access and open source tools are vital for the replicability and reproducibility of future corpus linguistics studies and support the explainability of annotation and analysis methods in corpus linguistics and NLP software, especially in light of the speedy uptake in new generative AI methods and large language models (LLMs), some of which are not open source or do not declare their training materials. Open tools also facilitate the exchange of methods and techniques to enable further developments to be built on top of existing groundwork e.g. as has been done in the Australian Text Analytics Platform (Jufri & Sun, 2022) building on PyMUSAS.

New and ongoing developments and features will also be highlighted including the future integration with large scale parallel processing using the UCREL-hex facility at Lancaster, a hybrid multiprocessor system including shared GPUs (https://www.lancaster.ac.uk/scc/research/research-facilities/hex/). Facilities like hex have been used to hugely speed up the large scale annotation of extreme scale corpora e.g. for the 1.2 billion words of the ParlaMint II corpus of comparable parliamentary data across Europe (Erjavec et al, 2024) from 18 days to around 7 hours. We will also describe further development of the English, Spanish, Dutch and Danish PyMUSAS taggers and lexicons as part of the 4D Picture project (https://4dpicture.eu/).

Aims

To provide a guided introduction to semantic annotation methods in corpus linguistics and natural language processing
To provide an introduction to the key semantic domains method and how it is operationalised in the Wmatrix7 tool along with PyMUSAS
To allow participants to explore the tools following guided tutorials and receive live direct feedback from the workshop organisers and tool developers themselves
Participants will also be given the opportunity to feed into future developments of the software via the collection of their requirements and preferences for new and adapted features in Wmatrix. They will also have the opportunity to discuss the development of PyMUSAS for new languages and to plan further collaborations.

Format

The workshop will begin with a 30 minute overview presentation introducing the theories and methods implemented in Wmatrix and PyMUSAS. The remainder of the time will be spent by participants being supported while following online tutorials to explore the tools using ready made corpora, e.g. the UK election manifestos corpora (https://github.com/perayson/manifestos) as well as to load in their own corpora for analysis.

Pre-requisites

Participants will likely have used other corpus linguistics software already, but the main methods (frequency lists, concordances, keywords, n-grams, collocations) will be introduced in the tutorials if needed. Participants with programming and command line experience will also be guided through the Python code necessary for use and integration of PyMUSAS in their own code, via Python Notebook demonstrators.

Wmatrix is a web based tool, so participants will only require an internet connection on their laptop or tablet plus a good web browser e.g. Chrome or Firefox. PyMUSAS will be demonstrated via web based access, but participants can bring their own laptops to install and run it locally using a Python programming environment, see https://pypi.org/project/pymusas/ for installation instructions.

Workshop 2.2: Word embeddings for discourse studies

Nathan Dykes, Tim Feldmüller

Business School G05

Maximum number of participants: 20

Context

Word embeddings are vector representations of words. In order to generate them, each type in a corpus is transformed into a list of numbers through a neural network that has learned relationships between individual words and their syntagmatic contexts. This representation makes it possible to condense a vast spectrum of semantic (and morphosyntactic) information into the embeddings and to organize words in a vector space where words close to each other tend to share semantic and/or functional aspects (Bubenhofer 2020).

The analytical potential of word embeddings is complementary to that of collocations: embeddings reflect paradigmatic distributional similarity in the sense of “words that do not themselves co-occur, but whose surrounding words are often the same” (Sahlgren 2008: 43). The potential for Corpus Assisted Discourse Studies (CADS) is substantial: for instance, embeddings can help analysts to find (near-)synonyms that reflect prominent lexical fields. Training a model on a specialised target corpus can uncover associations that differ from typical language use in everyday discourse. This, in turn, can help identify covert attitudes and evaluations. For instance, one might investigate the distributional similarity of different person references to explore how similarly certain actors are represented.

While word embeddings and the language-model architectures built on them, such as BERT and GPT, have been ubiquitous in Computational Linguistics at least since the publication of Word2Vec (Mikolov et al. 2013), Corpus Linguistics and, in particular, CADS have shown little interest in word embeddings (see, however, for German e.g. Bubenhofer 2020; Knuchel & Bubenhofer 2023; Meier-Vieracker 2024). One important reason for this is that word embeddings are usually computed in programming languages like Python. However, extensive programming knowledge is not actually necessary for computing and carrying out basic analyses with word embeddings. Our workshop aims to fill this gap: We plan to provide the necessary skills to load and analyze both pre-trained word embedding models and to train word embeddings on one’s own corpora.

Aims

Our workshop aims to provide an accessible introduction to applying word embeddings to Discourse Analysis. It is directed at researchers interested in applying corpus-driven quantitative methods to enable research questions where one quickly encounters limitations when using traditional Corpus Linguistic approaches. Applied CL, and CADS in particular, has often been focused on comparing frequencies on the level of individual words, making it challenging to explore themes realised through a wide range of lexical choices. Such examples may include phenomena tied to lexical fields (e.g. metaphor domains), near-synonyms or attitudes expressed with a wide variety of terms.

For the most part, analysts trained in linguistics explore corpora through tools that offer a graphical user interface. While these tools are convenient in terms of usability, they have limitations when it comes to incorporating more elaborate methods. At the same time, word embeddings are a valuable resource with significant potential for CADS.

While working in Python requires more introduction than a dedicated corpus platform, basic skills such as functions, variables and processing files can be quickly learned and transferable to various applications. Moreover, word embeddings as a specific resource are well-established in other fields and are available in relatively accessible formats. The skills conveyed in this workshop are thus intended to transfer relatively readily to participants’ individual interests.

Format

Short introduction: “What are Word Embeddings?”

We begin with a concise introduction to the concept of word embeddings, explaining how words are represented as vectors and which potentials there are for discourse analysis.

Python basics (variables, functions, key data types, reading and writing files)

Participants receive an accessible overview of essential Python concepts for working with text. We work with prepared Jupyter Notebooks, allowing the participants to apply the concepts in an accessible environment.

Solving technical problems

We address typical technical issues that might occur during the hands-on sessions, such as installing libraries and setting up the computing environment.

Loading an existing model

We demonstrate how to load pre-trained word embedding models (one trained on a general language corpus and one on a more specialised thematic corpus) into Python, giving participants the opportunity to explore embedding spaces without the need to train a model from scratch.

Simple analyses (e.g. nearest neighbors, clustering)

Participants learn how to retrieve the nearest neighbors of a word in vector space and how to perform basic clustering or similarity analyses. They are encouraged to explore and reflect on the differences between the two provided models.

Training a custom model

We guide participants through the process of training a word embedding model on their own corpus (where available) or on sample data, highlighting important parameters and considerations.

Presentation of studies that use word embeddings

We showcase selected studies that apply word embeddings to discourse analysis. This includes hands-on discussion of examples from research, where participants are encouraged to critically engage with the results of embedding models and reflect on the potential for their own projects.

Discussion and further practice

If there is time left, participants can explore more advanced questions and discuss their own research interests.

Pre-requisites

No previous programming or in-depth technical knowledge is necessary. We will introduce all relevant concepts as we go along. Participants should have Jupyter Notebook installed on their computers before attending, as we will use it for all practical exercises. We will provide instructions for the installation before the workshop.

Workshop 2.3: Applying FAIR data principles in corpus linguistics

Iulianna van der Lek, Giulia Pedonese, Alexander König, Martin Wynne, Francesca Frontini, Megan Bushnell

Business School G07

Maximum number of participants: 40

Context

Recent projects and initiatives acknowledged that there is a lack of general awareness among lecturers, students and researchers of research data management practices, including knowledge of the FAIR data principles (https://www.go-fair.org/fair-principles/) to make digital resources more Findable, Accessible, Interoperable and Reusable (e.g. UPSKILLS project (https://upskillsproject.eu/), EOSC Skills and Training Working Group (https://eosc.eu/opportunity-area-exp/oa5-skills-training-rewards-recognition-upscaling/), EC Digital Skills for FAIR and Open Science (https://data.europa.eu/doi/10.2777/59065)).

Therefore, the FAIR Competence Framework for Higher Education proposes a set of core competencies for FAIR data education that universities can use to design and integrate research data management and FAIR-data-related skills in their curricula and programmes (Demchenko et al., 2021). Students, scholars, teachers and researchers from all disciplines are encouraged to acquire fundamental skills for open science, including the ability to effectively interact with federated research infrastructures and open science tools for collaborative research. To further support the integration of these skills into the university curricula, an adoption handbook was published “How to be FAIR with your data – A teaching and training handbook for higher education institutions’’ (Engelhardt et al., 2022, https://fairsfair.gitbook.io/fair-teaching-handbook), which contains ready-made lesson plans on a variety of topics, including the use of repositories, data creation and reuse. In addition, the Skills4EOSC project (https://www.skills4eosc.eu/) provides an adaptable framework focusing on digital skills and using existing technologies to improve the competencies and skills of researchers. Specifically for linguistics, and other humanities disciplines, teaching resources and best practices guidelines have been created in the UPSKILLS and H2IOSC projects (Degl’Innocenti et al., 2023) to show various target audiences how the CLARIN research infrastructure (https://www.clarin.eu/) can support researchers in adopting and applying the FAIR data principles in their research practices. Based on the experience acquired in these projects, the authors of the abstract propose a workshop to raise awareness of the FAIR (and, to a lesser degree, the CARE principles for Indigenous Data Governance (https://www.gida-global.org/care)) and how they can be taken as guidance in corpus linguistics projects to ensure that the language research data is not only FAIR but also follows ethical research practices and supports Open Science. Hands-on demonstrations will be included using services, tools and language resources from the CLARIN research infrastructure.

Aims

This workshop will show participants how to incorporate the FAIR and CARE ( Compared to the more generally applicable FAIR principles, the CARE principles focus specifically on certain research scenarios and will therefore play a less prominent role in the workshop.) principles into their corpus linguistics research projects. The programme will consist of theoretical and hands-on exercises, including services and tools from CLARIN, a European Research Infrastructure for language as social and cultural data. Through a combination of theoretical principles, hands-on activities and case studies, the participants will learn how to identify the requirements for a linguistic resource (e.g. a linguistic corpus) to align with the FAIR and CARE principles and apply them in their research workflow. Finally, the workshop will contain a case study and a roleplay on how to write a Data Management Plan (DMP) for your research. The case study will focus on an early-career researcher's experience working on a research project in corpus linguistics. The workshop participants will learn how to draft a DMP using a sample research project as an example and get to know the Argos application (https://argos.openaire.eu/home). This tool, developed by OpenAIRE, allows scholars to write, save and export their DMP according to FAIR principles and Open Access best practices.

By the end of this workshop, the workshop participants will be able to:

Identify the requirements for a resource to align with the FAIR and CARE principles
Find and use certified research data repositories for data collection, sharing and archiving
Create the outline of a research data management plan and familiarise themselves with the Argos application
Identify and use infrastructure tools for data processing and analysis

Format

Introduction

What is CLARIN?
What are FAIR and CARE principles, and how can they be applied in corpus linguistics?

Finding and analysing linguistic resources in CLARIN

How CLARIN supports the FAIRness of data
Guided tour of CLARIN’s language data discovery portal, the Virtual Language Observatory (https://vlo.clarin.eu)
Tool examples from the Language Resource Switchboard (https://switchboard.clarin.eu/): processing a text with Weblicht (demo)

Creating a Data Management Plan (case study and discussion)

Demo of depositing, sharing and archiving your corpus data

Pre-requisites

No previous knowledge of FAIR and CARE principles is required.

Institutional login to access the CLARIN services (Most academic accounts can be used for logging into CLARIN services thanks to the CLARIN service provider federation, see https://www.clarin.eu/content/federated-identity for details). Please test beforehand. If you encounter access issues, you can request a CLARIN account at https://user.clarin.eu/user/register.

Page updated

Google Sites

Report abuse