Copy the page URI to the clipboard
Owen, Nathaniel; Shrestha, Prithvi and Bax, Stephen
(2021).
URL: https://www.britishcouncil.org/exam/aptis/research...
Abstract
This project uses automated analysis software (www.textinspector.com) to research the lexical and metadiscourse thresholds, and lexical and metadiscourse profiles, of test-takers’ writing in the British Council's Aptis Writing test, benchmarked to the Common European Framework of Reference for Languages (CEFR). Large quantities of Aptis writing responses (n=6,407), representing 65 countries, together with their score data, were analysed in terms of their use of lexis and metadiscourse. Measures and datasets used in the analysis include standard readability measures, the British National Corpus, the Corpus of Contemporary American English, English Vocabulary profile, the Academic Word List, and a bespoke corpus of metadiscourse markers. The purpose of the research is to enhance the validation argument for the Aptis test through large-scale profiling of candidates’ writing performance.
The findings reveal that the Aptis writing test provides evidence that lexical complexity changes systematically as the CEFR level of learners increases. Of the 110 Text Inspector metrics used in the study, 26 metrics were significant across all CEFR boundaries, including measures of text length (sentence, token and type count), and metrics of lexical sophistication (syllable count and number of words with more than two syllables). Fourteen of the 26 metrics represent vocabulary use. One metric of text complexity (voc-d) was also significant across all thresholds.
The study also explores the utility of these metrics for use in an automated scoring engine. Twenty metrics were used to build an ordinal logistic regression which was trained on a stratified subset of the data. This model was then used to predict the CEFR band of a testing subset which held nationality data constant. The data revealed that lexical use metrics from the Cambridge Learner Corpus (CLC) were the most successful at identifying CEFR level, and the model was most successful in identifying A1 and C-level responses. However, the model failed to accurately differentiate A2, B1 and B2 responses, suggesting that other, organisational variables play a significant role in human judgements, which are not accounted for in this study. The paper concludes with recommendations for rater training on the basis of the findings.