Evaluating Distance Metrics for Program Repairs (ICER 2023 - Research Papers)

Who

Charles Koutcheme, Sami Sarsa, Juho Leinonen, Lassi Haaranen, Arto Hellas

Track

ICER 2023 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 10 Aug 2023 16:05 - 16:30 - Coding and Learning

Abstract

\textbf{Background and Context:} Struggling with programming assignments while learning to program is a common phenomenon in programming courses around the world. Supporting struggling students is a common theme in Computing Education Research (CER), where a wide variety of support methods have been created and evaluated. An important stream of research here focuses on program repair, where methods for automatically fixing erroneous code are used for supporting students as they debug their code. Work in this area has so far assessed the performance of the methods by evaluating the closeness of the proposed fixes to the original erroneous code. The evaluations have mainly relied on the use of edit distance measures such as the sequence edit distance and there is a lack of research on which distance measure is the most appropriate.

\noindent \textbf{Objectives:} In the present work, our objective is to provide insight into measures for quantifying the distance between erroneous code written by a student and a proposed change. We conduct the evaluation in an introductory programming context, where insight into the distance measures can provide help in choosing a suitable metric that can inform which fixes should be suggested to novices.

\noindent \textbf{Method:} A team of five experts annotated a subset of the Dublin dataset, creating solutions for over a thousand erroneous programs written by students. We evaluated how the prominent edit distance measures from the CER literature compare against measures used in Natural Language Processing (NLP) tasks for retrieving the experts’ solutions from a pool of proposed solutions. We also evaluated how the expert-generated solutions compare against the solutions proposed by common program repair algorithms. The annotated dataset and the evaluation code are published as part of the work.

\noindent \textbf{Findings:} Our results highlight that the ROUGE score, classically used for evaluating the performance of machine summarization tasks, performs well as an evaluation and selection metric for program repair. We also highlight the practical utility of NLP metrics, which allow an easier interpretation and comparison of the performance of repair techniques when compared to the classic methods used in the CER literature.

\noindent \textbf{Implications:} Our study highlights the variety of distance metrics used for comparing source codes. We find issues with the classically used distance measures that can be combated by using NLP metrics. Based on our findings, we recommend including NLP metrics, and in particular, the ROUGE metric, in evaluations when considering new program repair methodologies. We also suggest incorporating NLP metrics into other areas where source codes are compared, including plagiarism detection.

Charles Koutcheme

Aalto University

Finland

Sami Sarsa

Aalto University

Juho Leinonen

The University of Auckland

New Zealand

Lassi Haaranen

Aalto University

Arto Hellas

Aalto University

Finland

Time Zone

The program is currently displayed in (GMT-05:00) Central Time (US & Canada).

Use conference time zone: (GMT-05:00) Central Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 10 Aug
Displayed time zone: Central Time (US & Canada) change

15:15 - 16:30	Coding and LearningResearch Papers Session Chair: Neil Brown

15:15 25m Talk		Investigating the Impact of On-Demand Code Examples on Novices' Open-Ended Programming Experience Research Papers Wengran Wang North Carolina State University, John Bacher North Carolina State University, Amy Isvik North Carolina State University, Ally Limke North Carolina State University, Sandeep Sthapit North Carolina State University, Yang Shi North Carolina State University, Benyamin Tabarsi North Carolina State University, Keith Tran North Carolina State University, Veronica Catete North Carolina State University, Tiffany Barnes North Carolina State University, Chris Martens North Carolina State University, Thomas Price North Carolina State University
15:40 25m Talk		An Empirical Evaluation of Live Coding in CS1 Research Papers Anshul Shah University of California, San Diego, Emma Hogan University of California, San Diego, Vardhan Agarwal University of California, San Diego, John Driscoll University of California, San Diego, Leo Porter University of California San Diego, William G. Griswold University of California San Diego, Adalbert Gerald Soosai Raj University of California, San Diego
16:05 25m Talk		Evaluating Distance Metrics for Program Repairs Research Papers Charles Koutcheme Aalto University, Sami Sarsa Aalto University, Juho Leinonen The University of Auckland, Lassi Haaranen Aalto University, Arto Hellas Aalto University