Evaluation of the code quality generated by Generative-AI-Dataset

This is the repository of the all of the data used in the paper "Evaluation of the code quality generated by Generative AI", From the Promts for the generative AI to there Answers and the Evaluation Results

Instructions for recreating the experiment

Process Overview

Data Collection
- A dataset of problems is collected from LeetCode.(you can use the problems we chose availabe here)
- Two subsets of problems are created: an experiment problems dataset with 30 problems and a benchmark problems dataset with 10 problems(we used systematic sampling with equal difficulty levels, you can use whatever you like).
Code Generation
- The AI models used in the benchmark are:
- ChatGPT 4.0, Microsoft Copilot(free version), Claude 3.0 and google gemini are used to generate code solutions for the problems in the benchmark problem datasets.
Generated Code Compilation
- The generated code from each model is compiled into a one file(40 in total 10 foe each model)
Code Quality Evaluation
- The generated code is evaluated on three primary metrics:
  - Code correctness: will be measured on a scale of 5 levels
    - correct: The code passes all test cases
    - Prtially correct: the code has no errors, but the output differs from the expected output in at most 50% of the test cases.
    - incorrect: The code has no errors, but the output differs from the expected output in more than 50% of the test cases.
    - Compilation error: The submitted code cannot be compiled.
    - Runtime error: The code fails in at least one test case due to a runtime error (ie division by zero, etc.).
  - Code Readability: will be measured using a checklist as a tool for evaluating the generated code based on predefined rules.
    - The rules:
      - A line should not be longer than 120 characters.
      - The length of the functions should not be more than 20 lines.
      - Functions should have no more than three arguments.
      - There should be no nested loops more than one level deep.
      - There should be no more than one sentence per line.
      - Consistent naming conventions for variables, functions and classes
      - Clear and descriptive comments explaining the purpose and logic of the code
      - Correct indentation and formatting for better visual organization
      - Using meaningful and self-documenting variable and function names
      - Adherence to coding style guides and Python-specific conventions
  - The final grade will be calculated as the ratio between the number of criteria the code successfully met and the total of all criteria, multiplied by 100.
  - Code Efficiency: will be measured by analyzing the complexity of running time and space of the algorithm theoretically without running the code. The result is expressed using a large O notation, which represents the upper limit of the growth rate of the algorithm.

our benchmark result: code correctness code readability code complexity

Ranking and Comparison

Choosing the TOP 2 form all of the evaluation metrics, and continue with the main experiment with bigger sample(30)

Continue the same steps as before and at the end conduct a Statistical Testing using T-test for the readability and Mann-Whitney U for the correctness and the complexity

-The T-test is easily done in excel

-The Mann-Whitney U can be done in python(code for code complexity, code for correctness )

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Benchmark_evaluation_results		Benchmark_evaluation_results
Experiment_evaluation_results		Experiment_evaluation_results
Generative_AI_answers		Generative_AI_answers
Statistical_Analysis		Statistical_Analysis
The_Final_Paper		The_Final_Paper
README.md		README.md
original_prompts.csv		original_prompts.csv
original_prompts.xlsx		original_prompts.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of the code quality generated by Generative-AI-Dataset

Instructions for recreating the experiment

Process Overview

About

Releases

Packages

Languages

salameaz/Evaluation-of-the-code-quality-generated-by-Generative-AI

Folders and files

Latest commit

History

Repository files navigation

Evaluation of the code quality generated by Generative-AI-Dataset

Instructions for recreating the experiment

Process Overview

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages