Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/code eval metrics #29

Merged
merged 7 commits into from
Feb 16, 2024
Merged

Feature/code eval metrics #29

merged 7 commits into from
Feb 16, 2024

Conversation

yisz
Copy link
Contributor

@yisz yisz commented Feb 16, 2024

Deterministic Code Eval metrics

  • string match (Exact and Fuzzy match score)
  • Python AST similarity

Ellipsis 🚀 This PR description was created by Ellipsis for commit 967273e.

Summary:

This PR introduces CodeStringMatch and PythonASTSimilarity metrics for code evaluation, includes their tests, updates the relevant documentation, and makes unrelated changes to docs/README.md.

Key points:

  • Introduced CodeStringMatch and PythonASTSimilarity metrics in continuous_eval/metrics/code_deterministic_metrics.py.
  • Imported these metrics in continuous_eval/metrics/__init__.py.
  • Added tests for these metrics in tests/code_metrics_test.py.
  • Updated documentation in /docs/src/content/docs/metrics/Code/Deterministic/python_ast_similarity.md and /docs/src/content/docs/metrics/Code/Deterministic/string_match.md.
  • Made unrelated changes to docs/README.md.

Generated with ❤️ by ellipsis.dev

@yisz yisz requested a review from pantonante February 16, 2024 02:06
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested. Reviewed entire PR up to commit 665be35

Reviewed 394 lines of code across 4 files in 1 minute(s) and 10 second(s).

See details
  • Skipped files: 2 (please contact us to request support for these files): poetry.lock, pyproject.toml
  • Confidence threshold: 50%
  • Drafted 0 additional comments.
  • Workflow ID: wflow_SZUmwxIbI5H7NJ0C

Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

Modifications: Adjusted to be used in the context of generated code evaluation
'''

def _compare_ASTs(ast_a: AST, ast_b: AST, reorder_depth: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for the _compare_ASTs method says it returns a boolean, but the method actually returns an integer. Please update the docstring to reflect the actual return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellipsis-dev can you fix this docstrings for me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yisz, I have addressed your comments in pull request #30

"""
return len(list(ast.walk(root)))

def _apply_weights_to_subtrees(weight: float, subtree: AST) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights applied in the _apply_weights_to_subtrees method are hardcoded and not explained. Consider defining these weights as constants at the top of the file, with comments explaining their purpose and how they were determined.

]

similarity_scores = [
PythonASTSimilarity._compare_subtrees(answer_subtree, ground_truth_subtree, 1000)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reorder_depth parameter is hardcoded as 1000 in the calculate method. This could potentially lead to performance issues for large ASTs. Consider making this a configurable parameter.

yisz and others added 2 commits February 15, 2024 21:16
…_mj7A

[Ellipsis] Change return type of `_compare_ASTs` function in docstring
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested. Incremental review on commit 24518d0

Reviewed 401 lines of code across 3 files in 3 minute(s) and 1 second(s).

See details
  • Skipped files: 0
  • Confidence threshold: 50%
  • Drafted 2 additional comments.
  • Workflow ID: wflow_Kw60igPcehGZ3deG
View 2 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 2 additional comments

Comment at continuous_eval/metrics/code_deterministic_metrics.py:304

Consider logging the SyntaxError message for easier debugging when parsing the answer and ground truth code.

Comment at continuous_eval/metrics/code_deterministic_metrics.py:292

Consider making the reorder_depth in the _compare_many method configurable through method parameters or class attributes for greater flexibility.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

class PythonASTSimilarity(Metric):
"""
The following functions are adapted from python-ast-comparison by Pedro Salazar Paredes
Copyright (c) 2023 Pedro Salazar Paredes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year in the docstring seems to be incorrect. It is mentioned as 2023, which is a future year. Please correct it.

answer_tree = ast.parse(answer, mode="exec")
ground_truth_trees = [ast.parse(gt, mode="exec") for gt in ground_truths]
except SyntaxError as e:
return {"Python_AST_Similarity": -1.0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a Python_AST_Similarity score of -1.0 when a SyntaxError is caught could be misleading. Consider returning None or raising an exception to indicate that an error occurred.

"""
return len(list(ast.walk(root)))

def _apply_weights_to_subtrees(self, weight: float, subtree: ast.AST) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making the weights applied to the subtrees in the _apply_weights_to_subtrees method configurable through method parameters or class attributes for greater flexibility.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Incremental review on commit 0ab1174

Reviewed 121 lines of code across 3 files in 1 minute(s) and 38 second(s).

See details
  • Skipped files: 1 (please contact us to request support for these files): docs/astro.config.mjs
  • Confidence threshold: 50%
  • Drafted 3 additional comments.
  • Workflow ID: wflow_2wBmE1swc6G3R4Zo
View 3 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 3 comments under confidence threshold

Filtered comment at /docs/src/content/docs/metrics/Code/Deterministic/python_ast_similarity.md:11

Confidence changes required: 50%

Commentary: The documentation for the new metrics seems to be well written and clear. The examples provided are helpful and the definitions are concise. However, I would like to see the actual implementation of these metrics to ensure they are correctly implemented and match the descriptions provided in the documentation.

The documentation is clear and concise. Please ensure the implementation matches the descriptions provided here.

Filtered comment at /docs/src/content/docs/metrics/Code/Deterministic/string_match.md:9

Confidence changes required: 50%

Commentary: The documentation for the StringMatch metric is also clear and concise. The examples provided are helpful and the definitions are concise. However, I would like to see the actual implementation of these metrics to ensure they are correctly implemented and match the descriptions provided in the documentation.

The documentation is clear and concise. Please ensure the implementation matches the descriptions provided here.

Filtered comment at /docs/src/content/docs/metrics/Code/LLM-Based/llm-based.md:9

Confidence changes required: 33%

Commentary: The LLM-based Code Metrics documentation does not provide any details about the metrics. It only mentions that they are coming soon. This is not an issue, but it would be helpful to have a brief description of what these metrics will measure and how they will be used.

Consider providing a brief description of what the LLM-Based Code Evaluation Metrics will measure and how they will be used, even if they are not yet implemented.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Incremental review on commit 515d087

Reviewed 54 lines of code across 1 files in 1 minute(s) and 19 second(s).

See details
  • Skipped files: 0
  • Confidence threshold: 50%
  • Drafted 1 additional comments.
  • Workflow ID: wflow_RjiYXYv77PQIjqqx
View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at docs/README.md:1

This change removes a lot of information about the project structure. Consider keeping this information or ensuring it's available elsewhere in the documentation.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

@yisz yisz merged commit 967273e into main Feb 16, 2024
2 checks passed
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Incremental review on commit 967273e

Reviewed 606 lines of code across 8 files in 1 minute(s) and 25 second(s).

See details
  • Skipped files: 3 (please contact us to request support for these files): docs/astro.config.mjs, poetry.lock, pyproject.toml
  • Confidence threshold: 50%
  • Drafted 1 additional comments.
  • Workflow ID: wflow_5bztvcJN8lNxyPn6
View 1 draft comments

These comments were drafted by Ellipsis, but were filtered out of the final review. They're included here so you can see our internal thought process and help you configure your ellipsis.yaml.

Drafted 1 additional comments

Comment at continuous_eval/metrics/__init__.py:33

It's a good practice to end a file with a newline. Please add a newline at the end of this file.


Something look wrong? You can customize Ellipsis by editing the ellipsis.yaml for this repository.

Generated with ❤️ by ellipsis.dev

@pantonante pantonante deleted the feature/code-eval-metrics branch February 19, 2024 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants