Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#9570] Fix fixtures in fixtures/subfolders throwing parsing error #9714

Merged
merged 5 commits into from
Mar 13, 2024

Conversation

slothkong
Copy link
Contributor

@slothkong slothkong commented Feb 29, 2024

Resolves #9570

Problem

Given that:

  1. The user has a valid unit test definition using csv fixtures
  2. The user stores the csv fixtures under tests/fixtures directory
  3. The execution of dbt build succeeds (target/manifest.json is created)
  4. The execution of dbt test also succeeds
  5. The user moves the csv fixtures to a subdirectory, such as tests/fixtures/my_model

Then:
When the user attempts to run once more dbt test the execution fails with an unexpected error, like:

 File "/path/to/my/site-packages/dbt/parser/manifest.py", line 407, in load
    (_, line, method) = formatted_lines[-3].split(", ")
    ^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 1)

--
Although at first glance, this looked to me like the PartialParsing class was not doing a good job at detecting changes or deleting fixture nodes in-between compilations, after further testing I managed to reproduce the issue in two different scenarios:

  • When the csv fixtures are moved to a subdirectory.
  • When the csv fixtures are initially located in subdirectories but then are pulled up to test/fixtures.

So the issue is somewhat more general than was originally reported. That leads me to believe is rather related to the logic that triggers partial parsing and what happens after that logic exits, e.g. error handling within the ManifestLoader class. In said class, the traceback is being parsed under the assumption that frames in the stack conform to an strict format:

(_, line, method) = formatted_lines[-3].split(", ")

However, in this particular case the assumption does not seem to hold, plausibly due to the exception type, how/where it was thrown, or even the presence of extra newline ('\n') or caret ('^') characters in the traceback. As a result, the exception is not gracefully caught.

In this state, the user is left with only a few options, with the easier ones perhaps being:
a) Run dbt clean before proceeding (to trick the ManifestLoader into skipping partial parsing altogether)
b) Move the csv fixture back to their original directory — needless to say, a downgrade of the feature scope

Solution

Instead of making too many assumptions about what the traceback.format_exc() statement in line 405 could return (in regard to list item order or string content and form), I am suggesting to change the ManifestLoader to use Exception object attributes and traceback object functions/attributes to zero-in on the last frame in the stack of the traceback, and extract from it the last exception details, i.e. the values of code, line and method that should be added to the exc_info dictionary for graceful exception catching and logging.

I argue this should be a safer alternative to adding some additional logic like if-else statements and string comparisons to deal with edge-cases where extra newlines or caret characters are present or missing.

The approach does not affect any function API either, and the content of the exc_info dictionary is filled as I believe it should, something along the lines of:

# example evaluation of  `exc_info` dictionary in line 408 of manifest.py
exc_info = {
    "traceback": """Traceback (most recent call last):
      File "/path/to/my/site-packages/dbt-core/core/dbt/parser/manifest.py", line 398, in load

      ...

      unit_test = self.saved_manifest.unit_tests.pop(unique_id)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      KeyError: 'unit_test.my_dbt_project.my_model.test_is_valid_email_address'""",
    "exception": "KeyError: 'unit_test.my_dbt_project.my_model.test_is_valid_email_address'", 
    "code": "    unit_test = self.saved_manifest.unit_tests.pop(unique_id)", 
    "location": "line 616 in delete_fixture_node"
}

Last but not least, with the code changes I suggest, the user can of now freely move the csv fixture to subdirectories or out of them, since whenever partial parsing runs and fails due to a fixture node no longer existing, any exception traceback would be properly parsed and full parsing will kick in to ensure a working manifest.json is created as part of the unit test execution.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

Additional Context

  • I debugged, developed and successfully tested integration with help of the postgresql adapter.
  • After code changes where staged, make test and pre-commit run were successfully executed.
  • My development dbt project structure looks like:
.
├── dbt_project.yml
├── models
│   ├── my_model.sql
│   ├── my_model.yml
│   ├── top_level_domains.sql
│   ├── top_level_domains.yml
│   ├── unit_test.yml
│   ├── users.sql
│   └── users.yml
├── profiles.yml
└── tests
    └── fixtures
        ├── my_model_mock_output.csv
        ├── top_level_domains_mock_input.csv
        └── users_mock_input.csv
  • The development my_model.sql definition I used:
select
  users.user_id,
  users.email,
  case
    when regexp_like(users.email,'^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$') = true
      and accepted_email_domains.tld is not null
        then true
     else false
  end as is_valid_email_address
from {{ ref('users')}} users
left join {{ ref('top_level_domains') }} accepted_email_domains
on users.email_top_level_domain = lower(accepted_email_domains.tld)
  • The development unit_test.yml definition I used:
unit_tests:
  - name: test_is_valid_email_address
    model: my_model
    given:
      - input: ref('users')
        format: csv
        fixture: users_mock_input
      - input: ref('top_level_domains')
        format: csv
        fixture: top_level_domains_mock_input
    expect:
      format: csv
      fixture: my_model_mock_output

@slothkong slothkong requested a review from a team as a code owner February 29, 2024 21:54
@cla-bot cla-bot bot added the cla:yes label Feb 29, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@slothkong
Copy link
Contributor Author

slothkong commented Mar 1, 2024

Hello @graciegoheen, I believe this one fixes the issue you encountered, with about 5 code line changes only. Would you be so kind as to extend me a review? Or perhaps @jtcohen6 would be willing and able to share a thumbs up/ thumbs down? 🙈

Cheers

@graciegoheen graciegoheen added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label Mar 1, 2024
Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 88.37209% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.10%. Comparing base (e4fe839) to head (2c165fc).
Report is 6 commits behind head on main.

Files Patch % Lines
core/dbt/parser/manifest.py 88.37% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9714      +/-   ##
==========================================
+ Coverage   88.08%   88.10%   +0.01%     
==========================================
  Files         178      178              
  Lines       22433    22443      +10     
==========================================
+ Hits        19761    19774      +13     
+ Misses       2672     2669       -3     
Flag Coverage Δ
integration 85.48% <46.51%> (-0.15%) ⬇️
unit 62.36% <81.39%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@slothkong
Copy link
Contributor Author

I'll deal with coverage for the one line that is missing, shortly~

* Transform skip_parsing (private variable of ManifestLoader.load()) into instance-attribute of ManifestLoader(), with default value False
  (to enable splitting of ManifestLoader.load())

* Split ManifestLoader.load(), to extract operation of PartialParsing into new method called ManifestLoader.safe_update_project_parser_files_partially()
  (to simplify both cognitive complexity in the codebase and mocking in unittestest)

* Add "ignore" type-comments in new ManifestLoader.safe_update_project_parser_files_partially()
  (to silence mypy warnings regarding instance-attributes which can be initialized as None or as something else, e.g. self.saved_manifest)[1]

[1] Although I wanted avoid "ignore" type-comments, it seems like addressing these mypy warnings in a stricter sense requires technical alignment and broader code changes.
	For example, might need to initialize self.saved_manifest as Manifest, instead of Optional[Manifest], so that PartialParsing gets inputs with type it currently expects.
	... perhaps too far beyond the scope of this fix?
@slothkong
Copy link
Contributor Author

Hallo again @graciegoheen,

Hitting the missing line in the unittest was more difficult than I originally thought, hence the delay since my last comment.
Nevertheless, I've got good news:

  • I generated the coverage report locally and, as of the latest commit (2c165fc), lines with missing coverage in manifest.py went down from 491 to 466 📉
  • Naturally, the logic I had modified in previous commits is also part of the new bulk of lines with coverage.

I am happy to hear any feedback~

Regards

Copy link
Contributor

@gshank gshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Nice test case. Moving that code out into a separate function is definitely an improvement.

@gshank gshank merged commit 7b8ae21 into dbt-labs:main Mar 13, 2024
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] fixtures in fixtures/subfolders throwing parsing error
3 participants