The recent revolution in AI has led to astonishing innovations in language processing and generation, as well as the rapid advancement of multimodal human-human, human-machine, and machine-machine interfaces. Large Language Models (LLMs) have now become ubiquitous, with seemingly every device and Internet service integrating them in some way. However, the AI revolution has also raised very important questions about where, when, and how these models can be used safely, reliably, and responsibly. More fundamentally, many of the computer scientists who build these models still struggle to understand their exact nature. Are LLMs simply stochastic parrots that regurgitate language from vast Internet corpora, or are they more profound in nature, perhaps capturing some of the essence that makes human language so special?
In the talk, I will explore these questions by building on insights gained from over half a century of corpus linguistics research combined with knowledge of machine learning, natural language processing, and recent AI innovations. First, I will outline the basic architecture of LLMs, including the latest reasoning models such as DeepSeek-R1, which helps us to understand their strengths, weaknesses, and innate biases. Next, I will reflect on how LLMs align with models of human language processing, drawing on insights from conversation analysis, pragmatics, discourse analysis, and other areas of applied linguistics. These insights can help us to understand the potential for LLMs to uncover and explain patterns in human language within and across registers and genres. Finally, I will introduce a new platform designed to help corpus linguists engage with the very latest AI models. This platform enables researchers not only to use AI to advance their own research and teaching but also to identify the strengths and weaknesses of current AI models and perhaps guide the development of the next generation of AI models.
In corpus linguistics, we have long grappled with the notion of context. As a nebulous concept, context has become highly polysemous, varying greatly across the range of subfields of applied linguistics that now make use of corpus approaches. In light of this potential variability, our quest to define context has given rise to a series of seemingly perennial questions that many working in the field are still grappling with to this day. These include questions such as, ‘How is context defined in corpus linguistics?’, ‘To what extent might corpora be considered devoid of context?’, ‘How do we integrate context into our corpora and into our corpus linguistic analyses?’, and ‘How much context do we need to ensure a rigorous and situated analysis?’.
In this plenary, I will address these and other questions like them as I take a birds-eye view of the debates surrounding, and the various perspectives on, the issue of context in corpus linguistics. I will consider some of the different ways in which context is conceptualised in corpus linguistics, as well as how these various conceptualisations present different challenges and, as such, give rise to distinct ways of embedding context into corpus analyses (and indeed, into corpora themselves). On this basis, I will reflect critically on long-standing claims about the ineffectiveness of corpus methods for interrogating context. In doing so, I will highlight some of the factors that have underpinned such claims and, looking to the future, I will consider how such factors might be attenuated through the ways we talk about and report on our data and analytical approaches.
Corpora are only valuable insofar as they represent the domain a linguist intends to study (Egbert et al., 2022). Designing highly representative corpora is particularly challenging for conversation, in part because the domain of conversation is difficult to operationalize. Previous studies have investigated incredibly heterogenous corpora that are all labeled ‘conversation’; for example, unplanned/unedited spoken language, scripted interactions, filmed interactions, phone calls with strangers, and text messages. So, how much do we as a field know about what conversation really entails? And how much do we know about what corpora of conversation really represent?
To begin answering these questions, I present a comprehensive description of the register of conversation. This description draws on prototype theory to identify prototypical conversation texts and provide quantitative information about the situational and linguistic features that are most central to conversation (Hanks, 2025). I then introduce methods to quantitatively evaluate representativeness in terms of the domain considerations of a corpus. This evaluation compares continuous ratings for the contextual features of texts within a corpus to the central situational features of the target register.
I illustrate these ideas using The Lancaster-Northern Arizona Corpus of American Spoken English (LANA-CASE)—a new 10-million-word corpus of American English conversation that will be released open access in 2026 (Hanks et al., 2024). I evaluate several novel recruitment methods for their effectiveness at sampling situationally varied conversational language from diverse participants. I examine the extent to which each sampling method, as well as LANA-CASE as a whole, represents situational variation within the register of conversation (in terms of, e.g., planning, setting, communicative purpose) and demographic diversity in the U.S. (across ages, geographic regions, genders, and race/ethnicities). With appropriate adaptation, I show how the methods introduced in this talk can help us to better design, evaluate, and analyze corpora from any domain.
References
Egbert, J., Biber, D., & Gray, B. (2022). Designing and evaluating language corpora: A practical framework for corpus representativeness. Cambridge University Press.
Hanks, E., McEnery, T., Egbert, J., Larsson, T., Biber, D., Reppen, R., Baker, P., Brezina, V., Brookes, G., Clarke, I., & Bottini, R. (2024). Building LANA-CASE, a spoken corpus of American English conversation: Challenges and innovations in corpus compilation. Research in Corpus Linguistics, 12(2), 24-44. https://doi.org/10.32714/ricl.12.02.03
Hanks, E. (2025). Mapping out American English conversation: Central and peripheral features of intra-register variation [Doctoral dissertation, Northern Arizona University]. ProQuest Dissertations Publishing.
AI is actively transforming language learning ecologies (Godwin-Jones, 2023). Rather than replacing traditional teaching, AI is increasingly part of a hybrid, co-constructed space where human and artificial intelligences interact to support learning. As Lévy (2025) has noted, AI represents a new phase in the evolution and manipulation of symbolic systems. They structure our reality. What we perceive, value, and understand is deeply shaped by the symbolic codes we inherit. In a postdigital world (Rowsell & Sandor, 2025), AI is not only creating new symbolic content, it is transforming meta-symbolic capabilities expanding the semiotic field and helping humans navigate symbolic complexity and reshaping how we interact with symbols.
AI is redefining teacher and learner roles, shifting from transmission to collaboration. It supports co-construction and reflective practices through dynamic interactions. Integrated into multimodal, socio-cognitive systems, AI enhances personalized learning and feedback. This evolution demands new literacies—critical, digital, and ethical—and reimagined pedagogical designs. As educators and learners adapt to the new technology (Godwin-Jones, 2021,2023), AI becomes a co-agent in learning, fostering environments where human and machine intelligence intersect to enrich the process of language acquisition. While AI presents challenges and ethical issues in education (McInnes, 2025), language educators and researchers have begun to consider ways into the co-creation and integration of large language models with the long-standing tradition of corpus-based approaches (Pérez-Paredes & Boulton, 2025).
This plenary explores the convergence of corpus linguistics (CL) and AI as complementary paradigms that, when combined, can offer a powerful framework for reimagining data-driven learning in second language education (Curry & McEnery, 2024; Pérez-Paredes & Boulton, 2025). Drawing on recent research from the Broadening the scope of Data-driven Language (BsDDL) project and practical models of AI and corpus literacy (Pérez-Paredes, 2024), I argue for an approach that foregrounds human agency, critical thinking, and metacognitive skill development. While AI affords accessibility and immediacy, CL anchors pedagogy in empirical, attested language use. Together, they support the design of pedagogies that are interactive, ethically grounded, and responsive to 21st-century learning goals. The talk discusses areas of convergence such as critical engagement, technological fluency, self-regulated learning, and interdisciplinary skill-building.
References
Curry, N., & McEnery, T. (2024). Corpus linguistics for language teaching and learning: A research agenda. Language Teaching, 1-20.
Godwin-Jones, R. (2021). Evolving technologies for language learning. Language, Learning & Technology, 25(3), 6–26.
Godwin-Jones, R. (2023). Emerging spaces for language learning: AI bots, ambient intelligence, and the metaverse. Language Learning & Technology, 27(2), 6-27.
Lévy, P. (2025). Symbolism, digital Culture and Artificial Intelligence. RED. Revista de Educación a Distancia, 25(81).
McInnes, R. (2025, April 11). Resist the gen-AI-driven university: A call for reclaiming thought in learning and teaching. ASCILITE TELall Blog. https://blog.ascilite.org/resist-the-gen-ai-driven-university-a-call-for-reclaiming-thought-in-learning-and-teaching/
Pérez-Paredes, P. (2024) Data-driven learning in informal contexts? Embracing Broad Data-driven learning (BDDL) research. In Crosthwaite, P. (Ed.). Corpora for Language Learning: Bridging the Research-Practice Divide, pp. 211-226. Routledge.
Pérez-Paredes, P. & Boulton, A. (2025). Data-driven Learning in and out of the Language Classroom. Cambridge University Press.
Rowsell, J., & Sandor, S. (2025). Literacy in Postdigital Times. In The Comfort of Screens: Literacy in Postdigital Times (pp. 26–45). Cambridge University Press.
Professor of Discourse and Persuasion
University of Sussex, UK
Associate Professor
University of Bologna, Italy
Counting has always been at the heart of corpus linguistics and is what unites us as a community. The papers at CL2025 range across many areas in linguistics but what we have in common is that we will all be counting something and somehow. The premises of corpus linguistics being that what counts is how language is used - and that counting how language is used reveals aspects of language that are invisible to the ‘naked eye’. This invisibility may come about because a pattern is either so large, so diffuse, or so small that human perception alone cannot measure it. The centrality of counting is so ingrained that we rarely step back to look directly at our ‘numerical habits’ – and in the well-established tradition of reflexivity in CL - that is exactly what we would like to take the opportunity do in this plenary. We open up the space for reflection by asking five questions about what counts, asking: Why do we count? What do we count? How do we count? What counts as a big number? What shouldn’t we count? Addressing these questions shows that counting is a theory-laden process because of all the decision making which underpins it and, at the same time, unveils its creative power.