Symposium 1 – The CAT in the language assessments bag

Organizer: Alina A. von Davier – Duolingo, USA

Abstract: With the growth of digital technology and advances in automated test development tools, ranging from automated item generation to automated scoring, opportunity has come to develop innovative forms of technology-based assessments. This symposium offers an overview of how innovative computer adaptive algorithms, especially when coupled with other advanced technologies, support language tests. The four selected papers cover a bag of CATs with a wide range of specific applications that span the language itself (English and German), the country (Brazil, Germany, USA) and the supportive methodologies and technologies (from automatic item and test development to delivery). The first paper presents a computer adaptive English test from Brazil. It provides a historical perspective that reflects on the changes to the test over time. The second paper provides an overview of the Duolingo English Test automatic item generation (AIG) and CAT algorithm procedures. The third paper describes a German language test for professionals–Goethe Test PRO. The paper illustrates the psychometric considerations for the test. The fourth paper introduces a new paradigm where the AIG and the CAT algorithms are blended together into a dynamic assessment design. This paper will build onto the existing methodologies at Duolingo but integrate them with an Elo rating system for an in-real time difficulty estimation. These studies illustrate the similarities and differences in CAT design across language tests and also contribute to computational psychometric research by blending the computational models behind the automatic algorithm into the more traditional CAT approaches.


  • Mariana Curi (University of São Paulo, Brazil), Elias Silva de Oliveira, & Lohan Rodrigues
  • Burr Settles (Duolingo, USA), & Geoff LaFlair
  • Aron Fink (University of Frankfurt, Germany), & Katharina Klein
  • Yigal Attali (Duolingo, USA), & Alina A. von Davier

Symposium 2 – Adaptive testing in PISA: past, present and future

Organizer: Mario Piacentini – OECD, France

Abstract: Starting with the 2018 cycle, PISA uses a multi-stage adaptive testing (MSAT) algorithm to assign different test forms to students of varying ability. This initial foray into adaptive testing helped PISA address test-fairness concerns (by limiting the share of respondents who are given tests that do not allow them to demonstrate their full proficiency); eliminated the need for country-level adaptations; and achieved some reductions in measurement error, especially for students with exceptionally low or high performance. Specifically, a multi-stage design with two branching points and a non-adaptive (random probability) layer was chosen to control exposure of items (for item calibration) and manage non-statistical constraints (coverage of sub-constructs). Only preliminary estimates of item characteristics were available for the adaptive decisions, and all item parameters were re-calibrated after the adaptive administration. The lessons learned in this first experience will inform the design for PISA 2022 as well as for future cycles. Starting with PISA 2022, multiple domains are being administered in adaptive fashion, with potential for using domain inter-correlations for adaptive decisions. Several papers presented at this symposium will illustrate the challenges that the introduction of adaptive testing in PISA faced. The opening presentations (by Matthias von Davier and Leslie Rutkowski) will discuss the conceptual tension between the potential of adaptive testing and the specific goals of large-scale assessments: characterising skill distributions, relating proficiency estimates to contextual variables, and comparing these distributions across participating countries and over time. Following these perspectives, the next two presenters will address technical challenges encountered in the design of the current adaptive test, and how they could be overcome: in particular, Kentaro Yamamoto will explore the importance for item parameter recovery of the non-adaptive layer introduced in the PISA 2018 MST design; while Peter van Rijn will present a method to evaluate local independence in a multi-stage adaptive test such as PISA, where the operational, adaptive test is used for item calibration and for assessing the assumptions of the IRT models. The two final presentations (by Andreas Frey and Hua Hua Chang) will then explore the potential benefits of introducing greater adaptivity in the design through more advanced methods for constraint management, such as shadow testing (ST) and on-the-fly assembled multistage adaptive testing (OMST), and through targeted item development. Wim van der Linden will discuss the six papers, highlighting areas of consensus and drawing implications for future research and development.


  • Matthias von Davier (Lynch School of Education, Boston College, USA)
  • Leslie Rutkowski (Indiana University Bloomington, USA), Dubravka Svetina, & David Rutkowski
  • Kentaro Yamamoto (ETS, USA), Hyo Jeong Shin, & Lale Khorramdel
  • Peter van Rijn (ETS, USA)
  • Andreas Frey (University of Frankfurt, Germany), Christoph König, & Aron Fink
  • Hua Hua Chang (Purdue University, USA), Kit-Tai Hau, Tone Wu, & Xiuxiu Tang
  • Discussant: Wim J. van der Linden (University of Twente, the Netherlands)

Symposium 3 – Standardizing the measurement of physical, mental, and social health in adults and children with or without (chronic) conditions – The Patient-Reported Outcomes Measurement Information System (PROMIS®)

Organizer: Caroline B. Terwee – Universiteit Amsterdam, the Netherlands

Abstract: There is increasing interest in healthcare for measuring physical, mental, and social health from the patients‘ perspective in individual patient care to obtain transparent and comparable outcomes for health care evaluations and improvement initiatives. However, health care providers do not yet measure patient-reported health outcomes consistently because of lack of consensus on what to measure, time investment and the excess of questionnaires that differ in content and quality, and have incomparable scores. The Patient-Reported Outcomes Measurement Information System (PROMIS®) was developed by a collaboration between the US National Institute of Health and eight US research institutes to develop one state-of-the-art assessment system to measure patient-reported health with highly accurate, precise and short measures for use across adult and pediatric (patient) populations. A wide range of generic item banks was developed, targeting various constructs, such as pain, physical function, anxiety, depression, fatigue, sleep disturbances, and participation in social roles and activities. Item banks were developed using item response theory (IRT) methods and can be used as standard short forms (e.g. 4-, 6-, 8-items versions), custom short forms (selection of relevant items for a specific context) and computerized adaptive tests (CAT). To make PROMIS widely available and maintain its scientific quality, a number of resources have been established: the PROMIS Health Organization (PHO) was established to maintain and encourage the application of PROMIS. PHO is a growing open membership society with education (e.g. workshops and annual conferences), and on-demand resources. The “HealthMeasures” team (Northwestern University, Chicago) and website is the official information (helpdesk) and distribution center for PROMIS, which also coordinates all translations. The Assessment Center Application Programming Interface (API) was developed to connect to any data collection software application (e.g. REDCap) with the full library of PROMIS measures, CAT software, and standardized item parameters. PROMIS CATs have been built into electronic health record systems, such as Epic, and are available through the PROMIS iPad App. Scoring manuals and interpretations guidelines were developed for research and clinical practice. Linking studies are being performed to convert PROMIS scores to scores of related commonly used questionnaires. PROMIS National Centers have been established in 19 countries. Their role is to coordinate all translation efforts, communicate the value of PROMIS to the scientific and research community, and encourage, facilitate, and support the application of PROMIS in the local country. PROMIS measures have been translated in more than 60 languages. Cross-cultural validation studies are being performed to evaluate content validity, confirm the underlying calibration model, and assess differential item functioning between language versions to test the PROMIS convention to use a single set of IRT item parameters across populations and language versions to express scores on a common scale (T-score metric). The ultimate aim is to develop PROMIS into a gold-standard outcome metric for measuring patient-reported health outcomes in an efficient, precise, and comparable way across the world.


  • Matthias Rose (Charité-Universitätsmedizin Berlin, Germany)
  • Felix Fischer (Charité-Universitätsmedizin Berlin,Germany)
  • Leo D. Roorda (Amsterdam Rehabilitation Research Center|Reade, Amsterdam, the Netherlands)
  • Benjamin D. Schalet (Feinberg School of Medicine, Northwestern University Chicago, USA)
  • Discussant: Ulf Kröhne (DIPF | Leibniz Institute for Research and Information in Education, Germany)

Symposium 4 – Applications of CAT across multiple fields using the Concerto platform

Organizer: David Stillwell & Luning Sun – The Psychometrics Centre at the University of Cambridge, UK

Abstract: The University of Cambridge Psychometrics Centre strives towards making online adaptive testing available to everyone. That is why we’ve created Concerto: a powerful and user-friendly platform that empowers experts and beginners alike to make better tests, with little to no knowledge of coding experience required. There are minimum set-up costs, no licence fees and no limitations. Concerto harmonises the statistical power of the R programming language, the security of MySQL databases and the flexibility of HTML to deliver advanced online tests. These instruments work in unison, giving users unparalleled freedom and control over the design of their assessments. In-built algorithms for score calculation and report generation ensure a rewarding experience for participants, whatever the context. In this symposium, scholars around the world will share their experience of developing online adaptive tests using the Concerto platform. These projects bring forward a number of successful applications of CAT across multiple fields, including healthcare measurement, music psychology, language testing, and assessment of resilience and quality of life.


  • Chris J. Sidey-Gibbons (University of Texas, USA)
  • Conrad Harrison (University of Oxford, UK), Bao Sheng Loe, Przemysław Lis, & Chris J. Sidey-Gibbons
  • Peter Harrison (Max Planck Insitute for Empirical Aesthetics, Germany), & Daniel Müllensiefen
  • Eren Can Aybek (Pamukkale University, Turkey)
  • Atsushi Mizumoto (Kansai University, Japan)
  • Ecosse Lamoureaux ( National University of Singapore, Singapore)

Symposium 5 – Computerized adaptive practicing

Organizer: Han L. J. van der Maas – University of Amsterdam, the Netherlands

Abstract: Computerized adaptive practicing (CAP) is a variant of Computerized adaptive testing (CAT) combining the goals of formative and summative measurement, i.e., practicing and testing. Both are essential in education. It is well known that learning skills such as arithmetic requires intensive practice adapted to the level of ability of the individual (cf. zone of proximal development, deliberate practice). It is also evident that adaptive practicing requires precise assessments of ability, the goal of adaptive measurement. In the last 15 years we developed an algorithm for CAP and applied this technology in a popular online educational system used by 2000 Dutch primary schools, in which we collect about two million item responses per day in about 50 games concerning arithmetic, intelligence, and language (Dutch and English). The algorithm is based on the Elo rating system developed for chess competitions, but incorporates response time in scoring responses to items. Both items and person parameters are estimated on the fly, such that pre-testing the 60.000 items in the item bank is no longer required. In this symposium a) we explain the educational and psychological concepts underlying this approach and introduce the Elo estimation algorithm , b) describe how and why this algorithm has been optimized in 12 year of Math Garden practice , c) explain what role AB testing plays in this optimization and how the data can be utilized to provide learning analytics beyond the basic IRT estimates of ability, d) discuss limitations of the Elo algorithm and provide insights in trackers of ability in a developmental (learning) context, and e) propose a new algorithm for computerized adaptive practicing that allows for unbiased statistical testing of educational and developmental hypotheses.


  • Han L. J. van der Maas (University of Amsterdam, the Netherlands)
  • Maria Bolsinova (Tilburg University, the Netherlands)
  • Abe Hofman (University of Amsterdam, the Netherlands)
  • Matthieu Brinkhuis (Utrecht University, the Netherlands)
  • Alexander Savi (Amsterdam Center for Learning Analytics, the Netherlands)
  • Discussant: Gunter Maris (ACT-Next, USA, the Netherlands)