Skip to content

This is the repository of the all of the data used in the paper "Evaluation of the code quality generated by Generative AI"

Notifications You must be signed in to change notification settings

salameaz/Evaluation-of-the-code-quality-generated-by-Generative-AI

Repository files navigation

Evaluation of the code quality generated by Generative-AI-Dataset

This is the repository of the all of the data used in the paper "Evaluation of the code quality generated by Generative AI", From the Promts for the generative AI to there Answers and the Evaluation Results

Instructions for recreating the experiment

Process Overview

  1. Data Collection

    • A dataset of problems is collected from LeetCode.(you can use the problems we chose availabe here)
    • Two subsets of problems are created: an experiment problems dataset with 30 problems and a benchmark problems dataset with 10 problems(we used systematic sampling with equal difficulty levels, you can use whatever you like).
  2. Code Generation

    • The AI models used in the benchmark are:
    • ChatGPT 4.0, Microsoft Copilot(free version), Claude 3.0 and google gemini are used to generate code solutions for the problems in the benchmark problem datasets.
  3. Generated Code Compilation

    • The generated code from each model is compiled into a one file(40 in total 10 foe each model)
  4. Code Quality Evaluation

    • The generated code is evaluated on three primary metrics:
      • Code correctness: will be measured on a scale of 5 levels

        • correct: The code passes all test cases
        • Prtially correct: the code has no errors, but the output differs from the expected output in at most 50% of the test cases.
        • incorrect: The code has no errors, but the output differs from the expected output in more than 50% of the test cases.
        • Compilation error: The submitted code cannot be compiled.
        • Runtime error: The code fails in at least one test case due to a runtime error (ie division by zero, etc.).
      • Code Readability: will be measured using a checklist as a tool for evaluating the generated code based on predefined rules.

        • The rules:
          • A line should not be longer than 120 characters.
          • The length of the functions should not be more than 20 lines.
          • Functions should have no more than three arguments.
          • There should be no nested loops more than one level deep.
          • There should be no more than one sentence per line.
          • Consistent naming conventions for variables, functions and classes
          • Clear and descriptive comments explaining the purpose and logic of the code
          • Correct indentation and formatting for better visual organization
          • Using meaningful and self-documenting variable and function names
          • Adherence to coding style guides and Python-specific conventions
      • The final grade will be calculated as the ratio between the number of criteria the code successfully met and the total of all criteria, multiplied by 100.

      • Code Efficiency: will be measured by analyzing the complexity of running time and space of the algorithm theoretically without running the code. The result is expressed using a large O notation, which represents the upper limit of the growth rate of the algorithm.

our benchmark result: code correctness code readability code complexity

  1. Ranking and Comparison
  • Choosing the TOP 2 form all of the evaluation metrics, and continue with the main experiment with bigger sample(30)
  1. Continue the same steps as before and at the end conduct a Statistical Testing using T-test for the readability and Mann-Whitney U for the correctness and the complexity

    -The T-test is easily done in excel

    -The Mann-Whitney U can be done in python(code for code complexity, code for correctness )

About

This is the repository of the all of the data used in the paper "Evaluation of the code quality generated by Generative AI"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages