Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valkyrie: Item contents snippet bug #769

Open
2 of 6 tasks
KatharineV opened this issue Aug 21, 2024 · 13 comments
Open
2 of 6 tasks

Valkyrie: Item contents snippet bug #769

KatharineV opened this issue Aug 21, 2024 · 13 comments
Assignees
Labels

Comments

@KatharineV
Copy link
Collaborator

KatharineV commented Aug 21, 2024

Summary

This ticket tracks discrepancies with existing works deposited prior to the Valkyrie sprint.

Acceptance Criteria

  • see comment
  • On catalog search, if the term matches OCR of any work, highlighted snippets will be displayed in the search results page. Otherwise, item contents shouldn't display at all.

Screenshots or Video

Screenshot of a work with "Item contents" showing complete OCR in the catalog search, yikes haha

Image

Testing Instructions

Testing note: the staging site was partially reindexed resulting in most existing records being updated but not all. Our assumption is that the cut over will handle the update of all works. Creating new works (or collections) should pass QA.

The works must be uploaded with UV turned on in order to get OCR processed.

Turn UV on in the tenant's feature settings within the admin dashboard
Create a work. Attach a multipage PDF. Wait for all of the jobs to complete.
Once the UV is loaded, scroll down to the items section. The file sets should have an ACTIONS drop down where you can select download txt file. You may have to click into the child work if you don't see it.
This should be a file of OCR words that you can search.
Pick a word and use it for a catalog search.
If there's a match, the word should be highlighted in a snippet of the catalog search results page.

Notes

Known remaining issues: #863

  • When there is a match, it should highlight the first one. ❌ (existing prod issue - M3)
  • When there is a match, it also finds more matches than it should. ❌ (existing prod issue - M3)
@KatharineV KatharineV converted this from a draft issue Aug 21, 2024
@laritakr
Copy link
Contributor

Regarding your second point... The term for Alternative Title was previously used to store the generated slug. This is no longer the case, but we can't show the alternative title on the works because of this prior data.

We could remove it entirely if you prefer, but it will require a migration to clean up the existing data if we leave it so you can use it in the future. This is why, for now, it is not included on the show views.

@KatharineV
Copy link
Collaborator Author

@laritakr thanks for that explanation. It makes perfect sense, and it helps me to understand that I am not seeing a bug to be concerned about. I struck out that part of the ticket above. We don't need to modify at this time.

@ShanaLMoore ShanaLMoore added the M1 Milestone 1 label Sep 3, 2024
@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Sep 4, 2024

cc @KatharineV for the acceptance criteria, should item contents not be displayed at all, or what do you expect to see?

@KatharineV
Copy link
Collaborator Author

@ShanaLMoore The way the feature has worked on production previously is that the Item Contents only show in the catalog search after a keyword search, in which case they display with the keyword highlighted in context.

Here's a visual from production after keyword search for "Duluth":
image

And here's a screenshot of items in the catalog search when I just entered the catalog to browse with no keywords. You'll see there are no Item Contents fields showing.
image

One side note: In the first screenshot, the third result is the only one showing item contents. The first two items have been split by Tesseract, so I would expect them to show Item Contents with keywords highlighted in context. Apparently there is a bug blocking them from displaying the field. So I want to mention that they aren't displaying as intended, but the third result is, so that's the one we'd want to emulate. Thanks!

@laritakr laritakr assigned laritakr and unassigned laritakr Sep 12, 2024
@ShanaLMoore ShanaLMoore self-assigned this Sep 16, 2024
@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Sep 16, 2024

dev notes

items contents is the label for file_set_text_tsimv index field. locally it is not displaying in catalog search when there is a match.

original implementation. knapsack may to be missing a few things?: scientist-softserv/adventist-dl@a5de938#diff-bd4eb77984b740347ff2aa902be664d1aa01addc73964897450d5bd3ff09b3c6

resources aren't using app indexer

To confirm:

are resources getting the following indexed?
file_set_text_tsimv
all_text_timv

hyku has a iiif print helper but isn't including it anywhere. do we need it for the render_ocr_snippet method?

does application controller need it and a rnder_ocr helper method?

@ShanaLMoore ShanaLMoore changed the title Valkyrie: Existing works Valkyrie: Item contents snippet bug Sep 17, 2024
@jillpe jillpe moved this from Ready for Development to In Development in Adventist Knapsack Sep 17, 2024
kirkkwang added a commit to samvera/hyku that referenced this issue Sep 17, 2024
This commit will introduce the Hyku::Indexers::FileSetIndexer to add
indexing logic for born digital PDFs when using PDF.js.  We also change
the works' indexing field to match the file sets' indexing field
(all_text_tsimv).  We also "valyrized" the logic in the HykuIndexing
module to accomplish this.

Ref:
- scientist-softserv/adventist_knapsack#769
@ShanaLMoore ShanaLMoore assigned kirkkwang and unassigned ShanaLMoore Sep 17, 2024
kirkkwang added a commit to samvera/hyku that referenced this issue Sep 18, 2024
This commit will introduce the Hyku::Indexers::FileSetIndexer to add
indexing logic for born digital PDFs when using PDF.js.  We also change
the works' indexing field to match the file sets' indexing field
(all_text_tsimv).  We also "valyrized" the logic in the HykuIndexing
module to accomplish this.

Ref:
- scientist-softserv/adventist_knapsack#769
kirkkwang added a commit to samvera/hyku that referenced this issue Sep 18, 2024
This commit will introduce the Hyku::Indexers::FileSetIndexer to add
indexing logic for born digital PDFs when using PDF.js.  We also change
the works' indexing field to match the file sets' indexing field
(all_text_tsimv).  We also "valyrized" the logic in the HykuIndexing
module to accomplish this.

Ref:
- scientist-softserv/adventist_knapsack#769
@ShanaLMoore ShanaLMoore moved this from In Development to Deploy to Staging in Adventist Knapsack Sep 19, 2024
@ShanaLMoore ShanaLMoore moved this from Deploy to Staging to SoftServ QA in Adventist Knapsack Sep 20, 2024
@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Sep 20, 2024

QA RESULTS: ✅ PASS

Acceptance Criteria

EMPTY CATALOG SEARCH

tested on STAGING

  • item contents shouldn't display at all.

note: the blank values will be handled in another ticket #819

Image

CATALOG SEARCH OCR MATCH

tested on STAGING

This work produced the following ocr: 2d-txt (2).txt
I searched for "SHADE"

  • On catalog search, if the term matches OCR of any work, highlighted snippets will be displayed in the search results page.

Image

CATALOG SEARCH NO MATCH

tested on STAGING

I searched for RAINBOW

Image

@ShanaLMoore ShanaLMoore added reindex blocked other work must be completed first labels Sep 20, 2024
@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Sep 20, 2024

blocked until resolution for error: multiple values encountered for non multiValued field date_issued_tesi: [unknown, ]

kirkkwang added a commit that referenced this issue Sep 24, 2024
This commit will add the index field for snippets onto the
CatalogControllerDecorator so ADL can see snippets.  We had to add this
because we remove all the add index fields prior and only add select
ones.  That means we have to manually add this one.

Ref:
- #769
kirkkwang added a commit that referenced this issue Sep 25, 2024
This commit will add the index field for snippets onto the
CatalogControllerDecorator so ADL can see snippets. We had to add this
because we remove all the add index fields prior and only add select
ones. That means we have to manually add this one.

Ref:
- #769

<img width="1032" alt="image"
src="https://github.com/user-attachments/assets/6d260506-0645-4ebf-ad4d-70b31c4ac2e7">
@ShanaLMoore ShanaLMoore removed the blocked other work must be completed first label Sep 25, 2024
@ShanaLMoore ShanaLMoore moved this from SoftServ QA to Client QA in Adventist Knapsack Sep 25, 2024
@KatharineV
Copy link
Collaborator Author

Appears to be working as expected in several staging tenants

@KatharineV
Copy link
Collaborator Author

Team, I uploaded a set of PDFs to test the UV with a compound work, and this particular work is reverting back to the OCR in search results bug.

https://adl.s2.adventistdigitallibrary.org/catalog?utf8=%E2%9C%93&locale=en&search_field=all_fields&q=compound+work

I did a keyword search for "compound work" and this work came up, of course, because that's the title. However, the keyword match section is displaying a huge block of irrelevant text. As I understand the feature, it is supposed to show a restricted number of characters, so this string is too much to begin with. Second, the keyword match field should only show if there is a keyword match, and IF there is a match, THEN the search terms would show in the snippet and they would be highlighted.

https://adl.s2.adventistdigitallibrary.org/concern/generic_works/600641e8-8cc7-453c-a12d-c0ad34a027cf?q=compound%20work&parent_query=compound+work

This work doesn't have useful OCR because the PDFs are handwritten. So, I know there is no chance that the OCR actually contains the words Compound and Work. That's why nothing is highlighted in the snippet. The behavior I would expect from this work is a) no keyword match field showing, OR b) keyword match snippet with fewer characters surrounding the highlighted search terms.

Image

@laritakr laritakr added the needs rework issue needs additional work label Oct 3, 2024
@laritakr laritakr moved this from Client QA to Ready for Development in Adventist Knapsack Oct 3, 2024
@laritakr
Copy link
Contributor

laritakr commented Oct 4, 2024

Another one with an issue is this one: https://adl.s2.adventistdigitallibrary.org/catalog?utf8=%E2%9C%93&search_field=all_fields&q=20088972

It looks like when there is any match to the search term, it includes the snippet text, but it doesn't limit the length unless there are matches in the text itself.

The snippet text should only be shown if there a match IN the text.

@ShanaLMoore
Copy link
Contributor

TODO: update logic so that we dont show snippets when it's not supposed to show.

@ShanaLMoore ShanaLMoore self-assigned this Oct 14, 2024
@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Oct 15, 2024

dev notes

This example should show no snippets. I uploaded it locally. The string doesn't match anything in the snippet.

doc['all_text_tsimv'].first.include?('20088972') => false

Relevant PR - supposed to return when no snippets.

scientist-softserv/iiif_print#260

  • uv search params should not search if no highlights exists ✅
  • #render_ocr_snippets should check for highlights and not snippets.blank? ✅
  • When there are no snippets the "keyword matches" label still appears. ❌ (M3) [SHANA]
  • When there is a match, it should highlight the first one. ❌ (existing prod issue - M3)
  • When there is a match, it also finds more matches than it should. ❌ (existing prod issue - M3)
  • Thumbnail link does not perform a UV search ❌ (M1)
  • Update Adventist knapsack and make any necessary additional changes (M1)

@ShanaLMoore ShanaLMoore assigned laritakr and unassigned ShanaLMoore Oct 24, 2024
@ShanaLMoore ShanaLMoore removed the needs rework issue needs additional work label Oct 24, 2024
ShanaLMoore added a commit that referenced this issue Oct 31, 2024
# Story

Refs

- #769

# Expected Behavior Before Changes

Snippets didn't work correctly

# Expected Behavior After Changes

- [ ] Search on catalog page performs full text search and shows
highlighted snippets.
- [ ] Title and thumbnail urls carry the search terms through to the
show pages and perform UV search automatically
- [ ] Search with no highlighting opens show page normally.

# Screenshots / Video

<details>
<summary></summary>

### Search on catalog page picks up terms in both full text and other
metadata
![Screenshot 2024-10-25 at 5 30
51 PM](https://github.com/user-attachments/assets/717d5b0b-92d8-4573-8408-825d7305e86d)
### clicking on work automatically searches the UV
![Screenshot 2024-10-25 at 5 31
17 PM](https://github.com/user-attachments/assets/d1963379-041e-4a35-bf5b-169782044634)

</details>

# Notes
@laritakr laritakr moved this from Ready for Development to Deploy to Staging in Adventist Knapsack Oct 31, 2024
@laritakr laritakr moved this from Deploy to Staging to Client QA in Adventist Knapsack Oct 31, 2024
@KatharineV
Copy link
Collaborator Author

Tested in SDAPI staging and the snippets worked as expected.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Deploy to Production
Development

No branches or pull requests

5 participants