Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses (ICER 2023 - Research Papers)

Who

Jaromir Savelka, Arav Agarwal, Marshall An, Christopher Bogart, Majd Sakr

Track

ICER 2023 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 8 Aug 2023 13:00 - 13:25 - Large Language Models

Abstract

This paper studies recent developments in large language models’ (LLM) abilities to pass assessments in introductory and intermediate Python programming courses at the postsecondary level. The emergence of ChatGPT resulted in heated debates of its potential uses (e.g., exercise generation, code explanation) as well as misuses in programming classes (e.g., cheating). Recent studies show that while the technology performs surprisingly well on diverse sets of assessment instruments employed in typical programming classes the performance is usually not sufficient to pass the courses. The release of GPT-4 largely emphasized notable improvements in the capabilities related to handling assessments originally designed for human test-takers. This study is the necessary analysis in the context of this ongoing transition towards mature generative AI systems. Specifically, we report the performance of GPT-4, comparing it to the previous generations of GPT models, on three Python courses with assessments ranging from simple multiple-choice questions (no code involved) to complex programming projects with code bases distributed into multiple files (599 exercises overall). Additionally, we analyze the assessments that were not handled well by GPT-4 to understand the current limitations of the model, as well as its capabilities to leverage feedback provided by an auto-grader. We found that the GPT models evolved from completely failing the typical programming class’ assessments (the original GPT-3) to confidently passing the courses with no human involvement (GPT-4). While we identified certain limitations in GPT-4’s handling of MCQs and coding exercises, the rate of improvement across the recent generations of GPT models strongly suggests their potential to handle almost any type of assessment widely used in higher education programming courses. These findings could be leveraged by educators and institutions to adapt the design of programming assessments as well as to fuel the necessary discussions into how programming classes should be updated to reflect the recent technological developments. This study provides evidence that programming instructors need to prepare for a world in which there is an easy-to-use widely accessible technology that can be utilized by learners to collect passing scores, with no effort whatsoever, on what today counts as viable programming knowledge and skills assessments.

Jaromir Savelka

Carnegie Mellon University

Arav Agarwal

Carnegie Mellon University

Marshall An

Carnegie Mellon University

Christopher Bogart

Carnegie Mellon University

Majd Sakr

Carnegie Mellon University

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 8 Aug
Displayed time zone: Central Time (US & Canada) change

13:00 - 14:15	Large Language ModelsResearch Papers Session Chair: James Prather

13:00 25m Talk		Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses Research Papers Jaromir Savelka Carnegie Mellon University, Arav Agarwal Carnegie Mellon University, Marshall An Carnegie Mellon University, Christopher Bogart Carnegie Mellon University, Majd Sakr Carnegie Mellon University
13:25 25m Talk		Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests Research Papers Arto Hellas Aalto University, Juho Leinonen The University of Auckland, Sami Sarsa Aalto University, Charles Koutcheme Aalto University, Lilja Kujanpää Aalto University, Juha Sorva Aalto University
13:50 25m Talk		From "Ban It Till We Understand It" to "Resistance is Futile": How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot Research Papers Sam Lau University of California at San Diego, Philip Guo University of California at San Diego