Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to configure which fields are used to find duplicates #4199

Merged
merged 11 commits into from
Aug 21, 2022

Conversation

jcassette
Copy link
Contributor

Description

This adds an option duplicate_keys which allows to configure the behavior of the importer when it searches for duplicates.
For example, one could add the format field so that the importer will always keep albums that have different formats.

Can you review this please? I still have to implement it for singletons.
Also what do you think about allowing to use a Python function in duplicate_action for full customization?

To Do

  • Documentation
  • Changelog
  • Tests

@jcassette jcassette marked this pull request as draft December 18, 2021 21:47
Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice; thank you for getting this started! Overall, this seems like a good direction—we can generalize chosen_ident to include all the metadata we could ever want, and then construct a MatchQuery using a chosen set of field names. 👍 Nice design!

This is a tricky part of the code overall to tweak, so I have several thoughts within about how to carry things out.

beets/importer.py Outdated Show resolved Hide resolved
beets/importer.py Outdated Show resolved Hide resolved
beets/importer.py Outdated Show resolved Hide resolved
beets/importer.py Outdated Show resolved Hide resolved
docs/changelog.rst Outdated Show resolved Hide resolved
beets/importer.py Outdated Show resolved Hide resolved
beets/importer.py Outdated Show resolved Hide resolved
keys = config['import']['duplicate_keys'].as_str().split()
info = self.chosen_info().copy()
info['albumartist'] = artist
album = library.Album(None, **info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also not entirely clear to me why we need to construct this Album object. Is there something that doesn't work about looking directly in info?

Copy link
Contributor Author

@jcassette jcassette Jan 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to allow using flexible attributes, I added a comment, is that okay for you?

beets/importer.py Outdated Show resolved Hide resolved
Comment on lines 678 to 679
The fields used to find duplicates in import task.
If several items have the same value for each key, they will be considered duplicates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me here that we search for duplicates for both albums and items, and maybe we want the list of fields to be configurable separately.

Copy link
Contributor Author

@jcassette jcassette Jan 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should use album keys only, I modified the line

@jcassette
Copy link
Contributor Author

jcassette commented Jan 2, 2022

Hi @sampsyo and thanks for the review
I have updated this with your comments

@jcassette jcassette marked this pull request as ready for review January 2, 2022 16:49
@jcassette jcassette requested a review from sampsyo January 2, 2022 16:52
@jcassette jcassette marked this pull request as draft January 2, 2022 16:52
@jcassette
Copy link
Contributor Author

jcassette commented Jan 2, 2022

It seems I can't trigger the CI checks somehow...

@sampsyo
Copy link
Member

sampsyo commented Jan 3, 2022

Awesome! I believe the problem is that the CI doesn't trigger when the branch has conflicts. I've resolved the conflict (in the changelog file) and the tests are running now.

@allesmi
Copy link
Contributor

allesmi commented Jan 19, 2022

This pull request would enable my use case where I want to import several releases of the same artist which are untitled. Other fields like the catalog number or year would identify the releases however.

@jcassette After reading your pull request I tried to re-implement a very basic working example for the fields (artist, album, year).

I found out that I can import the same album again and again without getting the duplicate warning if one of the fields is None, like the year for example.

@jcassette
Copy link
Contributor Author

Hi @allesmi and thanks for trying this
Can you provide a test case which reproduces this issue?
I tried to set "flex" fields to None in test_keep_when_extra_key_is_different and the album was not imported.

@jcassette
Copy link
Contributor Author

Sorry about the noise - I got a bit confused with git
I have fixed the tests and implemented the feature for singletons
@sampsyo can you review this please ?
I think this PR is complete now

@jcassette jcassette marked this pull request as ready for review January 22, 2022 21:49
@allesmi
Copy link
Contributor

allesmi commented Jan 23, 2022

Hi @allesmi and thanks for trying this Can you provide a test case which reproduces this issue? I tried to set "flex" fields to None in test_keep_when_extra_key_is_different and the album was not imported.

Unfortunately I do not have time to show this as code. Let me instead describe what I have manually tried with your most recent commit:
I have imported the album Ketsa - May Starlight Find You (musicbrainz release, download from Free Music Archive). Note that musicbrainz does not have a year for this release.

My config file consists of these options for duplicate_keys and is other than that empty:

import:
  duplicate_keys:
    album: albumartist album year

The first import into an empty database completes successfully. Importing the album again does not warn me about the duplicate, does not ask me what to do with old and new files and instead imports as if it is not a duplicate. After several imports my library ends up like this:

$ tree ~/Music/Ketsa/ -d
/home/me/Music/Ketsa/
├── May Starlight Find You
├── May Starlight Find You [2]
└── May Starlight Find You [3]

It seems as if the year field is an issue. If I remove it from duplicate_keys and import the album again I get asked what I want to do with three old albums and one new album.

Copy link
Member

@sampsyo sampsyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is looking great! I like the simplification, and packaging up the query construction has made things significantly more readable.

I'm sorry for being slow, but I'm still having some trouble understanding why tmp_album and tmp_item are necessary. You mentioned above (and in the comments) that this is to make things work with flexible attributes. But I don't quite see why this is necessary for flexible attributes. Couldn't something like duplicates and construct_match_queries work using only information from the Album and Item classes rather from temporary instances? (Again, sorry for missing the insight here!)

beets/importer.py Outdated Show resolved Hide resolved
beets/library.py Outdated Show resolved Hide resolved
@sampsyo
Copy link
Member

sampsyo commented Jan 29, 2022

One last thing I forgot to mention: it would be great to avoid the duplication in the model classes. At least some of duplicates and construct_match_queries seem like they could be moved to dbcore to be available on all Model subclasses.

@jcassette
Copy link
Contributor Author

Thanks for the review @sampsyo ! I have made the changes you requested.

I'm sorry for being slow, but I'm still having some trouble understanding why tmp_album and tmp_item are necessary. You mentioned above (and in the comments) that this is to make things work with flexible attributes. But I don't quite see why this is necessary for flexible attributes. Couldn't something like duplicates and construct_match_queries work using only information from the Album and Item classes rather from temporary instances? (Again, sorry for missing the insight here!)

No problem, I probably didn't explain well.
If you checkout the first commit and run the test case of the second commit with you will see an error related to the flexible attribute.
Also chosen_info only returns values contained in the tags of the audio files, but for inline fields I think I understood that the values are computed in Album and Item classes, that's why I needed to create the temporary instances.
I haven't dug into it so much so I hope this helps, or let me know if you want to discuss this further.

@sampsyo
Copy link
Member

sampsyo commented Jan 31, 2022

Computed (inline) fields are an interesting point! I guess creating a temporary object may be the easiest/lowest-effort way to obtain those, if people want to use them as deduplication keys.

I suppose there's an underlying confusion I have about flexible attributes, though: since we're dealing with newly-imported albums/items, how could they have flexible attributes on them in the first place? I thought we would only be able to distinguish newly-imported albums based on the data either from disk or from the metadata source… maybe one of the metadata sources uses a flexible attribute? But if so, I thought that would be in the chosen_info dict already? I guess I'm missing something about the underlying problem here…

@jcassette
Copy link
Contributor Author

jcassette commented Jan 31, 2022

Sorry about the confusion, I think I had been using the term "flexible attributes" to actually mean "inline fieds"...
I may have confused because they seem to work similarly for database queries.

since we're dealing with newly-imported albums/items, how could they have flexible attributes on them in the first place?

If you mean user-defined fields, I was not trying to make this to work, only inline fields.
But maybe one album/item could already have user-defined fields if it is being reimported?
In this case, with the current code these attributes are not retrieved, so I guess they should not be used in duplicate_keys.

I thought we would only be able to distinguish newly-imported albums based on the data either from disk or from the metadata source…

Yes, that's what is implemented. The data of chosen_info is provided by autotag.current_metadata (from disk) or self.match.info (from metadata source), and inline fields that can be derived from that data are provided by the Album/Item classes.

@sampsyo
Copy link
Member

sampsyo commented Feb 1, 2022

Got it; thanks for clarifying! Computed fields (e.g., from the inline plugin) seem like an interesting use case for this. I'm convinced, I think, that creating temporary objects is the easiest route to doing this.

I'll look in more detail at the low-level utilities here shortly…

@stale
Copy link

stale bot commented Jun 12, 2022

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@sampsyo
Copy link
Member

sampsyo commented Jun 12, 2022

It's still very much on my to-do list to rearrange some of the low-level query manipulation stuff here… 😕

@stale stale bot removed the stale label Jun 12, 2022
We now use somewhat more general query constructors in `dbcore`,
avoiding the need for somewhat special-purpose `duplicates` methods on
the model objects.
For consistency with the rest of the terminology in the docs/config.
Also, correct the documentation (which previously only covered albums).
@sampsyo
Copy link
Member

sampsyo commented Aug 21, 2022

Thanks for your (extreme) patience while I finally got around to giving this PR the attention it deserved. It was 99% there, but I wanted to simplify the low-level query utilities that found their way into the library and dbcore modules. I moved us toward some even simpler, broadly applicable utilities for constructing queries, which I can imagine using many other places throughout the codebase. I think this should also somewhat simplify the logic in the importer module itself.

Please let me know if you have any thoughts! Meanwhile, I'll merge this so it can go out in the upcoming release.

@sampsyo sampsyo merged commit e584b04 into beetbox:master Aug 21, 2022
arsaboo added a commit to arsaboo/beets that referenced this pull request Aug 26, 2022
commit e584b04
Merge: 7467bc3 2ebc28d
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:44:31 2022 -0700

    Merge pull request beetbox#4199 from jcassette/duplicate

    Allow to configure which fields are used to find duplicates

commit 2ebc28d
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:36:40 2022 -0700

    Improve changelog for beetbox#4199

commit 1054b72
Merge: 3c945cb 6e0f7a1
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:34:15 2022 -0700

    Merge branch 'master' into duplicate

commit 3c945cb
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:31:45 2022 -0700

    Change config key from "single" to "item"

    For consistency with the rest of the terminology in the docs/config.
    Also, correct the documentation (which previously only covered albums).

commit bcc8903
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:27:31 2022 -0700

    Refactor query utilities

    We now use somewhat more general query constructors in `dbcore`,
    avoiding the need for somewhat special-purpose `duplicates` methods on
    the model objects.

commit ca38486
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:12:47 2022 -0700

    Clarify some control flow

commit 7467bc3
Merge: 6e0f7a1 8cb3143
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 10:01:37 2022 -0700

    Merge pull request beetbox#4450 from beetbox/deprecations

    Resolve some deprecation warnings

commit 8cb3143
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 09:50:53 2022 -0700

    Avoid BeautifulSoup deprecation warning

    The `text` parameter to `SoupStrainer` was renamed to `string` in 2015
    (4.4.0) and started producing a warning this year (4.11.0).
    https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG

commit 8c84bae
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 08:18:49 2022 -0700

    Remove `match_querystring` in `responses`

    Quoth the responses documentation:

    > querystring is matched by default

    Not sure how recent this is, unfortunately---but probably 0.17.0, since
    that's the version where `match_querystring` was deprecated.

commit 63b7595
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 08:13:07 2022 -0700

    Remove use of `imp`

    The replacements in `importlib.util` have been available since Python
    3.5.

commit 2c9f699
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 08:06:10 2022 -0700

    Use non-deprecated name for `notify_all`

    `notifyAll` was deprecated in:
    python/cpython#87889

    The new name, `notify_all`, has been available since Python 3.0.

commit 6e0f7a1
Merge: f0a6bbb bf8fbed
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 07:09:12 2022 -0700

    Merge pull request beetbox#4412 from beetbox/album-items

    Document Album.items() / LibModel.items() conflict

commit f0a6bbb
Merge: 40d7fa6 fafddce
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sun Aug 21 07:07:23 2022 -0700

    Merge pull request beetbox#4447 from wisp3rwind/pr_version_regex

    release.py: fix version regex (remove u'' string prefix)

commit bf8fbed
Author: Callum Brown <callum@calcuode.com>
Date:   Sun Aug 21 14:34:18 2022 +0100

    Clarify Album.items() conflict

commit 40d7fa6
Merge: 4761c35 fb9e95b
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sat Aug 20 17:14:02 2022 -0700

    Merge pull request beetbox#4095 from Duncaen/formatted-modify

    Formatted modify and import --set-field.

commit fb9e95b
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sat Aug 20 16:50:20 2022 -0700

    Fix some long lines

commit b207224
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sat Aug 20 16:47:01 2022 -0700

    Further document formatted modify with examples

    I think these can make it clearer why someone would want to use this
    feature. (Part of beetbox#4095.)

commit dad918e
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sat Aug 20 16:43:55 2022 -0700

    Out-of-date changelog fixes

commit 7af40db
Merge: 0456c8f 4761c35
Author: Adrian Sampson <adrian@radbox.org>
Date:   Sat Aug 20 16:37:52 2022 -0700

    Merge branch 'master' into formatted-modify

commit 4761c35
Merge: 18ab441 b7ff616
Author: Benedikt <wisp3rwind@posteo.eu>
Date:   Sat Aug 20 07:33:23 2022 +0200

    Merge pull request beetbox#4395 from clach04/patch-1

    Version bump to 1.6.1

commit fafddce
Author: wisp3rwind <17089248+wisp3rwind@users.noreply.github.com>
Date:   Sat Aug 20 07:30:15 2022 +0200

    release.py: fix version regex (remove u'' string prefix)

commit 18ab441
Merge: 0ae7d66 93725c4
Author: Adrian Sampson <adrian@radbox.org>
Date:   Fri Aug 19 17:54:52 2022 -0700

    Merge pull request beetbox#4444 from BinaryBrain/master

    Add Beetstream in the plugin list

commit 93725c4
Author: Sacha Bron <me@sachabron.ch>
Date:   Sat Aug 20 01:30:38 2022 +0200

    Add Beetstream in the plugin list

commit 0ae7d66
Merge: e995019 32ce44f
Author: Benedikt <wisp3rwind@posteo.eu>
Date:   Thu Aug 18 18:11:03 2022 +0200

    Merge pull request beetbox#4441 from beetbox/exact-prefix

    Change the prefix for exact match queries

commit 32ce44f
Author: Adrian Sampson <adrian@radbox.org>
Date:   Wed Aug 17 16:25:17 2022 -0700

    One more test fix

commit 495c8ac
Author: Adrian Sampson <adrian@radbox.org>
Date:   Wed Aug 17 16:11:16 2022 -0700

    Update exact query prefix tests

commit f71e503
Author: Adrian Sampson <adrian@radbox.org>
Date:   Wed Aug 17 16:05:33 2022 -0700

    Change the prefix for exact match queries

    PR beetbox#4251 added exact match queries, which are great, but it was
    subsequently pointed out that the `~` query prefix was already in use:
    beetbox#4251 (comment)

    So this changes the prefix from `~` to `=~`. A little longer, but
    hopefully it makes the relationship to the similarly-new `=` prefix obvious.

commit e995019
Author: Adrian Sampson <adrian@radbox.org>
Date:   Wed Aug 17 15:55:25 2022 -0700

    Doc tweaks for beetbox#4438

commit fa81d6c
Merge: 6eec17c 6aa9804
Author: Adrian Sampson <adrian@radbox.org>
Date:   Wed Aug 17 15:54:43 2022 -0700

    Merge pull request beetbox#4438 from jaimeMF/singleton_unique_paths

    Add path template "sunique" to disambiguate between singleton tracks

commit 6aa9804
Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net>
Date:   Wed Aug 17 17:03:16 2022 +0200

    Document the %sunique template

commit f641df0
Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net>
Date:   Tue Aug 16 17:54:12 2022 +0200

    Encapsulate common code for the aunique and sunique templates in a single method

commit 8d957f3
Author: Jaime Marquínez Ferrándiz <jaime.marquinez.ferrandiz@fastmail.net>
Date:   Fri Aug 12 14:19:52 2022 +0200

    Add path template "sunique" to disambiguate between singleton tracks

commit 6eec17c
Merge: 1dddcb8 6803ef3
Author: Adrian Sampson <adrian@radbox.org>
Date:   Fri Aug 5 09:15:00 2022 -0400

    Merge pull request beetbox#4433 from vicholp/master

    Fix get item file in web plugin

commit 6803ef3
Author: vicholp <vlinerospardo@gmail.com>
Date:   Wed Aug 3 01:22:45 2022 -0400

    add test to get item file of web plugin

commit fde2ad3
Author: vicholp <vlinerospardo@gmail.com>
Date:   Wed Aug 3 01:22:35 2022 -0400

    fix get item file of web plugin

commit 1cde938
Author: Callum Brown <callum@calcuode.com>
Date:   Tue Jul 12 11:21:52 2022 +0100

    Document Album.items() / LibModel.items() conflict

    Closes: beetbox#4404

commit b7ff616
Author: clach04 <clach04@gmail.com>
Date:   Fri Jul 1 17:51:54 2022 -0700

    Version bump to 1.6.1

    Matche setup.py (package) version

commit bf9bf48
Merge: bcf2e15 10338c2
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sun Jan 30 16:47:44 2022 +0100

    Merge branch 'master' into duplicate

    # Conflicts:
    #	docs/changelog.rst

commit bcf2e15
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sun Jan 30 16:38:34 2022 +0100

    Move construct_match_queries() to dbcore.Model

commit 7633465
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sat Jan 22 22:36:47 2022 +0100

    Add duplicate_keys feature for singletons

commit f50d250
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sun Jan 2 17:25:30 2022 +0100

    Review duplicate_keys feature

commit 6ce29a6
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sat Nov 27 14:36:59 2021 +0100

    Allow to use flexible attributes in duplicate_keys

commit 3fdfaaa
Author: Julien Cassette <jcassette@users.noreply.github.com>
Date:   Sun Nov 21 18:41:06 2021 +0100

    Allow to configure which fields are used to find duplicates

commit 0456c8f
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Dec 15 14:32:11 2021 +0100

    test multiple items in test_modify_formatted

commit 795bc2e
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Dec 15 14:31:15 2021 +0100

    compile modify templates only once

commit a2030d1
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Oct 6 15:52:08 2021 +0200

    changelog: import/modify field formatting

commit 5824d46
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Oct 6 15:44:12 2021 +0200

    changelog: rewrite permissions cover art change

commit 819ba73
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Oct 6 15:40:03 2021 +0200

    allow templates/formatting of set_fields on import

commit 636e36e
Author: Duncan Overbruck <mail@duncano.de>
Date:   Wed Oct 6 15:14:34 2021 +0200

    allow templates/formatting when setting fields with modify
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants