Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Implement query filtering to the search endpoint #1455

Closed
1 task
chouinar opened this issue Mar 8, 2024 · 3 comments · Fixed by #1514
Closed
1 task

[Task]: Implement query filtering to the search endpoint #1455

chouinar opened this issue Mar 8, 2024 · 3 comments · Fixed by #1514
Assignees
Labels
project: grants.gov Grants.gov Modernization tickets

Comments

@chouinar
Copy link
Collaborator

chouinar commented Mar 8, 2024

Summary

As part of the search endpoint we are building out, we want to support a generic text box that a user can populate and it queries several different fields in the database. However, exactly how this works has to be sorted out and then implemented.

From the designs, it seems we want it to roughly query against the following (but I want to confirm this):

  • Assistance listing numbers (formerly CFDAs) -exact
  • Assistance listing program names?
  • Opportunity title - contains
  • Opportunity number - contains
  • Summary description - contains
  • Agency code - contains
  • Agency name - not sure - might not have the right value to query against
  • Anything else?

Additionally, do we need to consider anything regarding how the queries work against the DB?

  • Only support contains / exact match (case-insensitive)?
  • Something with regexes?
  • What happens if a user enters health 05.005 - this likely means they want to search health against one field (like the description), but search the numeric value against the assistance listing. Is this something we need to support now?

I'd prefer to keep this somewhat simple now, as a search index in the future just does a lot of this and it likely is better to build that out than spend too much time on this approach.

Acceptance criteria

  • Query filtering is implemented as agreed
@chouinar chouinar added the project: grants.gov Grants.gov Modernization tickets label Mar 8, 2024
@chouinar chouinar added this to the Search API - API development milestone Mar 8, 2024
@chouinar
Copy link
Collaborator Author

chouinar commented Mar 8, 2024

It might also be good to get some product thought on this - there are technically some very elaborate ways of building out a query approach in a Postgres database, but they would require a lot of setup. Is it better to go very simple now so that we can get to building out a search index sooner, or build something more elaborate here now?

@andycochran
Copy link
Collaborator

@chouinar / @crystabelrangel, y'all thought this thru and made some decisions, right? Can you document them in this ticket?

@crystabelrangel
Copy link
Collaborator

Assistance listing numbers (formerly CFDAs) - exact match. Partial may not make sense for now e.g. if someone searches 01 then they may get too many
Assistance listing program names - contains
Opportunity title - contains
Opportunity number - contains
Summary description - contains
Agency code - partial
Anything else?
Will not have Agency name
if someone types [keyword] [opp num] it will not return anything. This will be handled with search index

chouinar added a commit that referenced this issue Mar 18, 2024
## Summary
Fixes #1167

### Time to review: __15 mins__

## Changes proposed
Added filtering to the search endpoint, includes all but the query box
parameter which has its own follow-up ticket

Added utilities to help generate the search filter schema

Added indexes to improve the performance of search (see additional info
below for details)

Extensive additions to the tests

Added the ability to choose examples on the OpenAPI docs (included an
example with no filters, and one with many)

Fixed a bug in the Paginator for handling counts (will follow-up and fix
in the template repo)

## Context for reviewers
This change has been extensively tested, manually, and through an
enormous amount of new unit tests. As the change was already getting
quite large, a few things will be dealt with in follow-up tickets:
* Query filtering: #1455
* Fixing logging formatting:
#1466
* Additional order_by fields:
#1467

For the filters, they're all `one_of` filters which means that only one
of the supplied values needs to match for it to pass the where clause
(literally the where clauses generate as `where table.column in (1, 2,
3)`). You can see an example query below.

The agency filter is a bit odd as I made it a `startswith` style filter
instead to handle the way agency codes get nested. We may want to adjust
this further in the future, but this will at least technically handle
hierarchies of agencies right now.

## Additional information
I extensively tested the performance of the queries we run. I locally
loaded in ~11k records using our factories (ran the `seed-local-db`
script 300 times). With the API functioning, I make SQLAlchemy output
the queries it ran and did an `EXPLAIN ANALYZE ...` on the big ones. I
then added several indexes which improved the performance.

The primary query of the API looks like this:
```sql
SELECT
	opportunity.opportunity_id,
	opportunity.opportunity_number,
	opportunity.opportunity_title,
	opportunity.agency,
	opportunity.opportunity_category_id,
	opportunity.category_explanation,
	opportunity.is_draft,
	opportunity.revision_number,
	opportunity.modified_comments,
	opportunity.publisher_user_id,
	opportunity.publisher_profile_id,
	opportunity.created_at,
	opportunity.updated_at
FROM
	opportunity
	JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
	JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
	JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
	JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
	JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
WHERE
	opportunity.is_draft IS FALSE
	AND(EXISTS (
			SELECT
				1 FROM current_opportunity_summary
			WHERE
				opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
	AND current_opportunity_summary.opportunity_status_id IN(1,2)
	AND link_opportunity_summary_funding_instrument.funding_instrument_id IN(1,2)
	AND link_opportunity_summary_funding_category.funding_category_id IN(1,3,20)
	AND link_opportunity_summary_applicant_type.applicant_type_id IN(1, 2, 13)
	AND((opportunity.agency ILIKE 'US-ABC%')
	OR(opportunity.agency ILIKE 'HHS%'))
ORDER BY
	opportunity.opportunity_id DESC
LIMIT 25 OFFSET 25
```

Without any of the new indexes, `EXPLAIN ANALYZE` gives this a cost of
~1100 (non-specific unit). With the new indexes it becomes ~800. The
actual runtime of these queries is in the 5-10ms range with or without
the indexes, so it's minor either way. Note that querying the API
locally, this gives response times of 50-150ms (slower initially before
caching likely takes hold). Also if we're just filtering by something
like opportunity status, then the costs are around 10-15.
See:
https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE

---------

Co-authored-by: nava-platform-bot <platform-admins@navapbc.com>
@chouinar chouinar self-assigned this Mar 18, 2024
chouinar added a commit that referenced this issue Mar 22, 2024
## Summary
Fixes #1455

### Time to review: __10 mins__

## Changes proposed
Implements logic for the query field in the search API

Updated the seed script to allow you to generate more opportunities
locally

## Context for reviewers
This logic is deliberately very, very simple at the moment. It basically
just does a text contains check across several columns. This is the most
naive approach, and it is expected that if you tried to search two
separate words at the same time, you probably won't get any results.

A few options we may consider later (depending on the level of effort we
want to put into a non-search-index approach):
- Postgres text search / tsvectors:
https://www.postgresql.org/docs/current/textsearch.html
- Postgres text similarity:
https://www.postgresql.org/docs/current/pgtrgm.html

## Additional information
The `ids` function I added for the tests was because there are a lot of
tests and when test `16` fails, its really hard to figure out which one
that is. Now it outputs the request+expected results:
![Screenshot 2024-03-21 at 10 47 34
AM](https://github.com/HHS/simpler-grants-gov/assets/46358556/a7732375-00a8-4e4f-8e16-c00114a80342)

---------

Co-authored-by: nava-platform-bot <platform-admins@navapbc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project: grants.gov Grants.gov Modernization tickets
Projects
Development

Successfully merging a pull request may close this issue.

3 participants