-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to configure which fields are used to find duplicates #4199
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice; thank you for getting this started! Overall, this seems like a good direction—we can generalize chosen_ident
to include all the metadata we could ever want, and then construct a MatchQuery
using a chosen set of field names. 👍 Nice design!
This is a tricky part of the code overall to tweak, so I have several thoughts within about how to carry things out.
beets/importer.py
Outdated
keys = config['import']['duplicate_keys'].as_str().split() | ||
info = self.chosen_info().copy() | ||
info['albumartist'] = artist | ||
album = library.Album(None, **info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also not entirely clear to me why we need to construct this Album
object. Is there something that doesn't work about looking directly in info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is to allow using flexible attributes, I added a comment, is that okay for you?
docs/reference/config.rst
Outdated
The fields used to find duplicates in import task. | ||
If several items have the same value for each key, they will be considered duplicates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me here that we search for duplicates for both albums and items, and maybe we want the list of fields to be configurable separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it should use album keys only, I modified the line
Hi @sampsyo and thanks for the review |
It seems I can't trigger the CI checks somehow... |
Awesome! I believe the problem is that the CI doesn't trigger when the branch has conflicts. I've resolved the conflict (in the changelog file) and the tests are running now. |
This pull request would enable my use case where I want to import several releases of the same artist which are untitled. Other fields like the catalog number or year would identify the releases however. @jcassette After reading your pull request I tried to re-implement a very basic working example for the fields (artist, album, year). I found out that I can import the same album again and again without getting the duplicate warning if one of the fields is None, like the year for example. |
Hi @allesmi and thanks for trying this |
Sorry about the noise - I got a bit confused with git |
Unfortunately I do not have time to show this as code. Let me instead describe what I have manually tried with your most recent commit: My config file consists of these options for
The first import into an empty database completes successfully. Importing the album again does not warn me about the duplicate, does not ask me what to do with old and new files and instead imports as if it is not a duplicate. After several imports my library ends up like this:
It seems as if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is looking great! I like the simplification, and packaging up the query construction has made things significantly more readable.
I'm sorry for being slow, but I'm still having some trouble understanding why tmp_album
and tmp_item
are necessary. You mentioned above (and in the comments) that this is to make things work with flexible attributes. But I don't quite see why this is necessary for flexible attributes. Couldn't something like duplicates
and construct_match_queries
work using only information from the Album
and Item
classes rather from temporary instances? (Again, sorry for missing the insight here!)
One last thing I forgot to mention: it would be great to avoid the duplication in the model classes. At least some of |
# Conflicts: # docs/changelog.rst
Thanks for the review @sampsyo ! I have made the changes you requested.
No problem, I probably didn't explain well. |
Computed (inline) fields are an interesting point! I guess creating a temporary object may be the easiest/lowest-effort way to obtain those, if people want to use them as deduplication keys. I suppose there's an underlying confusion I have about flexible attributes, though: since we're dealing with newly-imported albums/items, how could they have flexible attributes on them in the first place? I thought we would only be able to distinguish newly-imported albums based on the data either from disk or from the metadata source… maybe one of the metadata sources uses a flexible attribute? But if so, I thought that would be in the |
Sorry about the confusion, I think I had been using the term "flexible attributes" to actually mean "inline fieds"...
If you mean user-defined fields, I was not trying to make this to work, only inline fields.
Yes, that's what is implemented. The data of |
Got it; thanks for clarifying! Computed fields (e.g., from the I'll look in more detail at the low-level utilities here shortly… |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
It's still very much on my to-do list to rearrange some of the low-level query manipulation stuff here… 😕 |
We now use somewhat more general query constructors in `dbcore`, avoiding the need for somewhat special-purpose `duplicates` methods on the model objects.
For consistency with the rest of the terminology in the docs/config. Also, correct the documentation (which previously only covered albums).
Thanks for your (extreme) patience while I finally got around to giving this PR the attention it deserved. It was 99% there, but I wanted to simplify the low-level query utilities that found their way into the Please let me know if you have any thoughts! Meanwhile, I'll merge this so it can go out in the upcoming release. |
commit e584b04 Merge: 7467bc3 2ebc28d Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:44:31 2022 -0700 Merge pull request beetbox#4199 from jcassette/duplicate Allow to configure which fields are used to find duplicates commit 2ebc28d Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:36:40 2022 -0700 Improve changelog for beetbox#4199 commit 1054b72 Merge: 3c945cb 6e0f7a1 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:34:15 2022 -0700 Merge branch 'master' into duplicate commit 3c945cb Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:31:45 2022 -0700 Change config key from "single" to "item" For consistency with the rest of the terminology in the docs/config. Also, correct the documentation (which previously only covered albums). commit bcc8903 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:27:31 2022 -0700 Refactor query utilities We now use somewhat more general query constructors in `dbcore`, avoiding the need for somewhat special-purpose `duplicates` methods on the model objects. commit ca38486 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:12:47 2022 -0700 Clarify some control flow commit 7467bc3 Merge: 6e0f7a1 8cb3143 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 10:01:37 2022 -0700 Merge pull request beetbox#4450 from beetbox/deprecations Resolve some deprecation warnings commit 8cb3143 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 09:50:53 2022 -0700 Avoid BeautifulSoup deprecation warning The `text` parameter to `SoupStrainer` was renamed to `string` in 2015 (4.4.0) and started producing a warning this year (4.11.0). https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG commit 8c84bae Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 08:18:49 2022 -0700 Remove `match_querystring` in `responses` Quoth the responses documentation: > querystring is matched by default Not sure how recent this is, unfortunately---but probably 0.17.0, since that's the version where `match_querystring` was deprecated. commit 63b7595 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 08:13:07 2022 -0700 Remove use of `imp` The replacements in `importlib.util` have been available since Python 3.5. commit 2c9f699 Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 08:06:10 2022 -0700 Use non-deprecated name for `notify_all` `notifyAll` was deprecated in: python/cpython#87889 The new name, `notify_all`, has been available since Python 3.0. commit 6e0f7a1 Merge: f0a6bbb bf8fbed Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 07:09:12 2022 -0700 Merge pull request beetbox#4412 from beetbox/album-items Document Album.items() / LibModel.items() conflict commit f0a6bbb Merge: 40d7fa6 fafddce Author: Adrian Sampson <adrian@radbox.org> Date: Sun Aug 21 07:07:23 2022 -0700 Merge pull request beetbox#4447 from wisp3rwind/pr_version_regex release.py: fix version regex (remove u'' string prefix) commit bf8fbed Author: Callum Brown <callum@calcuode.com> Date: Sun Aug 21 14:34:18 2022 +0100 Clarify Album.items() conflict commit 40d7fa6 Merge: 4761c35 fb9e95b Author: Adrian Sampson <adrian@radbox.org> Date: Sat Aug 20 17:14:02 2022 -0700 Merge pull request beetbox#4095 from Duncaen/formatted-modify Formatted modify and import --set-field. commit fb9e95b Author: Adrian Sampson <adrian@radbox.org> Date: Sat Aug 20 16:50:20 2022 -0700 Fix some long lines commit b207224 Author: Adrian Sampson <adrian@radbox.org> Date: Sat Aug 20 16:47:01 2022 -0700 Further document formatted modify with examples I think these can make it clearer why someone would want to use this feature. (Part of beetbox#4095.) commit dad918e Author: Adrian Sampson <adrian@radbox.org> Date: Sat Aug 20 16:43:55 2022 -0700 Out-of-date changelog fixes commit 7af40db Merge: 0456c8f 4761c35 Author: Adrian Sampson <adrian@radbox.org> Date: Sat Aug 20 16:37:52 2022 -0700 Merge branch 'master' into formatted-modify commit 4761c35 Merge: 18ab441 b7ff616 Author: Benedikt <wisp3rwind@posteo.eu> Date: Sat Aug 20 07:33:23 2022 +0200 Merge pull request beetbox#4395 from clach04/patch-1 Version bump to 1.6.1 commit fafddce Author: wisp3rwind <17089248+wisp3rwind@users.noreply.github.com> Date: Sat Aug 20 07:30:15 2022 +0200 release.py: fix version regex (remove u'' string prefix) commit 18ab441 Merge: 0ae7d66 93725c4 Author: Adrian Sampson <adrian@radbox.org> Date: Fri Aug 19 17:54:52 2022 -0700 Merge pull request beetbox#4444 from BinaryBrain/master Add Beetstream in the plugin list commit 93725c4 Author: Sacha Bron <me@sachabron.ch> Date: Sat Aug 20 01:30:38 2022 +0200 Add Beetstream in the plugin list commit 0ae7d66 Merge: e995019 32ce44f Author: Benedikt <wisp3rwind@posteo.eu> Date: Thu Aug 18 18:11:03 2022 +0200 Merge pull request beetbox#4441 from beetbox/exact-prefix Change the prefix for exact match queries commit 32ce44f Author: Adrian Sampson <adrian@radbox.org> Date: Wed Aug 17 16:25:17 2022 -0700 One more test fix commit 495c8ac Author: Adrian Sampson <adrian@radbox.org> Date: Wed Aug 17 16:11:16 2022 -0700 Update exact query prefix tests commit f71e503 Author: Adrian Sampson <adrian@radbox.org> Date: Wed Aug 17 16:05:33 2022 -0700 Change the prefix for exact match queries PR beetbox#4251 added exact match queries, which are great, but it was subsequently pointed out that the `~` query prefix was already in use: beetbox#4251 (comment) So this changes the prefix from `~` to `=~`. A little longer, but hopefully it makes the relationship to the similarly-new `=` prefix obvious. commit e995019 Author: Adrian Sampson <adrian@radbox.org> Date: Wed Aug 17 15:55:25 2022 -0700 Doc tweaks for beetbox#4438 commit fa81d6c Merge: 6eec17c 6aa9804 Author: Adrian Sampson <adrian@radbox.org> Date: Wed Aug 17 15:54:43 2022 -0700 Merge pull request beetbox#4438 from jaimeMF/singleton_unique_paths Add path template "sunique" to disambiguate between singleton tracks commit 6aa9804 Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net> Date: Wed Aug 17 17:03:16 2022 +0200 Document the %sunique template commit f641df0 Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net> Date: Tue Aug 16 17:54:12 2022 +0200 Encapsulate common code for the aunique and sunique templates in a single method commit 8d957f3 Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net> Date: Fri Aug 12 14:19:52 2022 +0200 Add path template "sunique" to disambiguate between singleton tracks commit 6eec17c Merge: 1dddcb8 6803ef3 Author: Adrian Sampson <adrian@radbox.org> Date: Fri Aug 5 09:15:00 2022 -0400 Merge pull request beetbox#4433 from vicholp/master Fix get item file in web plugin commit 6803ef3 Author: vicholp <vlinerospardo@gmail.com> Date: Wed Aug 3 01:22:45 2022 -0400 add test to get item file of web plugin commit fde2ad3 Author: vicholp <vlinerospardo@gmail.com> Date: Wed Aug 3 01:22:35 2022 -0400 fix get item file of web plugin commit 1cde938 Author: Callum Brown <callum@calcuode.com> Date: Tue Jul 12 11:21:52 2022 +0100 Document Album.items() / LibModel.items() conflict Closes: beetbox#4404 commit b7ff616 Author: clach04 <clach04@gmail.com> Date: Fri Jul 1 17:51:54 2022 -0700 Version bump to 1.6.1 Matche setup.py (package) version commit bf9bf48 Merge: bcf2e15 10338c2 Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sun Jan 30 16:47:44 2022 +0100 Merge branch 'master' into duplicate # Conflicts: # docs/changelog.rst commit bcf2e15 Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sun Jan 30 16:38:34 2022 +0100 Move construct_match_queries() to dbcore.Model commit 7633465 Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sat Jan 22 22:36:47 2022 +0100 Add duplicate_keys feature for singletons commit f50d250 Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sun Jan 2 17:25:30 2022 +0100 Review duplicate_keys feature commit 6ce29a6 Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sat Nov 27 14:36:59 2021 +0100 Allow to use flexible attributes in duplicate_keys commit 3fdfaaa Author: Julien Cassette <jcassette@users.noreply.github.com> Date: Sun Nov 21 18:41:06 2021 +0100 Allow to configure which fields are used to find duplicates commit 0456c8f Author: Duncan Overbruck <mail@duncano.de> Date: Wed Dec 15 14:32:11 2021 +0100 test multiple items in test_modify_formatted commit 795bc2e Author: Duncan Overbruck <mail@duncano.de> Date: Wed Dec 15 14:31:15 2021 +0100 compile modify templates only once commit a2030d1 Author: Duncan Overbruck <mail@duncano.de> Date: Wed Oct 6 15:52:08 2021 +0200 changelog: import/modify field formatting commit 5824d46 Author: Duncan Overbruck <mail@duncano.de> Date: Wed Oct 6 15:44:12 2021 +0200 changelog: rewrite permissions cover art change commit 819ba73 Author: Duncan Overbruck <mail@duncano.de> Date: Wed Oct 6 15:40:03 2021 +0200 allow templates/formatting of set_fields on import commit 636e36e Author: Duncan Overbruck <mail@duncano.de> Date: Wed Oct 6 15:14:34 2021 +0200 allow templates/formatting when setting fields with modify
Description
This adds an option
duplicate_keys
which allows to configure the behavior of the importer when it searches for duplicates.For example, one could add the
format
field so that the importer will always keep albums that have different formats.Can you review this please? I still have to implement it for singletons.
Also what do you think about allowing to use a Python function in
duplicate_action
for full customization?To Do