Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Leader-Follower paradigm to Solr-on-ECS #3854

Closed
11 tasks
nickumia-reisys opened this issue Jun 9, 2022 · 9 comments
Closed
11 tasks

Add Leader-Follower paradigm to Solr-on-ECS #3854

nickumia-reisys opened this issue Jun 9, 2022 · 9 comments
Assignees
Labels
component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Feature

Comments

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Jun 9, 2022

User Story

In order to serve more traffic load, the Data.gov SSB Team wants to implement a leader-follower paradigm for standalone Solr configurations.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN [a contextual precondition]
    [AND optionally another precondition]
    WHEN [a triggering event] happens
    THEN [a verifiable outcome]
    [AND optionally another verifiable outcome]

Background

Parent Issue:

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

In order to implement this design, we would...

  • Leave all existing code as is
  • Create an n-count of ECS task definitions/services to start up each replica (very similar to current ecs.tf)
  • Create a secondary LB with target groups pointing to each ECS solr replica service (small subset of lb.tf)
  • Create a subdomain on the parent solr-[id number].ssb.data.gov domain and point it to the secondary LB just created
  • Duplicate admin.tf to initialize the admin user/password for each of the replicas by way of the same n-count parameter
    • Alternatively, modify the admin.tf service to optionally update the admin password for each of the replicas as well. (Be careful not to break existing functionality)
  • Output the secondary replica domain/user/pass for the user.
    • Presumably, all replicas will share the same admin user/pass combination
  • Update the bind methodology to add the new binding to every solr
    • In this case, an end user can connect with either the admin solr or the replica solr with the same password. If this is undesirable, complicated logical would need to be written to choose between binding to the admin solr or the replica solr -- fair warning, this may not be possible in CSB mentality
  • Make all of the above conditional on the replica parameter being > 0
  • Create a different startup script for a solr follower solr.xml or solrconfig.xml
  • Solr startup process that boots up from AWS backup vault (may not be possible?) or resyncs from SOLR leader
@hkdctol
Copy link
Contributor

hkdctol commented Jun 9, 2022

we will discuss but logs show we can proceed with Leader for now.

@nickumia-reisys
Copy link
Contributor Author

nickumia-reisys commented Jul 1, 2022

The easiest-to-implement, but more resource-intensive design is to replicate exactly what we have on FCS which is two version of the catalog application: (1) Public for end-users, (2) Admin console for harvesting

In order to implement this design, we would...

  • Leave all existing code as is
  • Create an n-count of EFS volumes for each solr replica desired (very similar to current efs.tf)
  • Create an n-count of ECS task definitions/services to start up each replica (very similar to current ecs.tf)
  • Reference the specific n-count of the EFS volumes by count_id for each of the n-count task definitions
  • Create a secondary LB with target groups pointing to each ECS solr replica service (small subset of lb.tf)
  • Create a subdomain on the parent solr-[id number].ssb.data.gov domain and point it to the secondary LB just created
  • Duplicate admin.tf to initialize the admin user/password for each of the replicas by way of the same n-count parameter
    • Alternatively, modify the admin.tf service to optionally update the admin password for each of the replicas as well. (Be careful not to break existing functionality)
  • Output the secondary replica domain/user/pass for the user.
    • Presumably, all replicas will share the same admin user/pass combination
  • Update the bind methodology to add the new binding to every solr
    • In this case, an end user can connect with either the admin solr or the replica solr with the same password. If this is undesirable, complicated logical would need to be written to choose between binding to the admin solr or the replica solr -- fair warning, this may not be possible in CSB mentality
  • Make all of the above conditional on the replica parameter being > 0
  • Create a different startup script for a solr follower solr.xml or solrconfig.xml

PRs are needed for both the datagov-brokerpak-solr repo and the catalog.data.gov repo.

I am against converting the existing solr-on-ecs plan to be the leader-follower plan. However, if we wanted to update existing Solrs with the new functionality this is the only way. I believe this leader/follower paradigm should be a separate CSB plan. In which case, we would provision a separate service to replace the existing one. If we never believe we'll have a use case for the two plans, updating the existing one is acceptable.

To ensure the above paragraph does not cause confusion, I'm not suggesting a new repository; "a separate CSB plan" is merely a new service directory in the existing datagov-brokerpak-solr repo. It is a point of discussion though.

@nickumia-reisys
Copy link
Contributor Author

The above outline is just the setup for Solr. Effort would need to be put into modifying the manifest.yml to duplicate all of the configuration and code to support a secondary catalog.

  • The manifest.yml
  • The .profile
  • The vars.[everything.yml
  • The startup scripts
  • The no need for a proxy (but can still use the same proxy if necessary)
  • The create-cloudgov-services.sh to create a secondary DB/Redis (not solr)
  • ...et cetera

@nickumia-reisys
Copy link
Contributor Author

Future work (maybe): https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service_definition_parameters.html#sd-networkconfiguration

If your service's tasks take a while to start and respond to health checks, you can specify a health check grace period of up to 2,147,483,647 seconds during which the ECS service scheduler ignores the health check status. This grace period can prevent the ECS service scheduler from marking tasks as unhealthy and stopping them before they have time to come up.

@FuhuXia
Copy link
Member

FuhuXia commented Jul 18, 2022

Changes to solrconfig.xml to enable replication.
https://solr.apache.org/guide/8_7/index-replication.html

@nickumia-reisys
Copy link
Contributor Author

This should be sufficiently complete now. To help categorize the above work for future reference, (the sub-bullets are mostly minor, sometimes breaking, fixes to the parent bullet)

Infrastructure Changes:

Solr/CKAN Configuration Changes:

PR that created appropriate tests for the above changes:

Each PR has specific information about what needed to be changed or added to get us further. The highest level design consists of:

  • One Solr Leader + (EFS/Ephemeral) Storage
  • Zero or more Solr Followers + (EFS/Ephemeral) Storage
  • One access point to communicate with leader
  • One access point that distributes traffic among followers
  • One access point per follower to communicate with a specific follower
  • Custom configuration to setup Solr Leader
  • Custom configuration to setup Solr Follower
  • Separate EFS Volume configuration options for Leader/Follower

image

@hkdctol
Copy link
Contributor

hkdctol commented Aug 4, 2022

@nickumia-reisys is the diagram above available for re-use? @mogul wanted to use for SSB documentation

@hkdctol hkdctol closed this as completed Aug 4, 2022
@nickumia-reisys
Copy link
Contributor Author

Unfortunately, it just the screenshot. I made it with a random online flowchart tool and didn't have a way of saving it without making an account. If it can be used as is, I see no problem adding it to the SSB documentation.

@mogul
Copy link
Contributor

mogul commented Aug 5, 2022

No worries, I think I can recreate it pretty easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/solr-service Related to Solr-as-a-Service, a brokered Solr offering component/ssb Feature
Projects
Archived in project
Development

No branches or pull requests

4 participants