Skip to content

Stuck Harvest Fix

jbrown-xentity edited this page Apr 26, 2022 · 17 revisions

The goal is to automate this process so that the system will automatically fix a stuck harvest source after 24 hours (or some amount of fixed time). This was originally done in the geodatagov extension, though the process is outdated and needs to be revamped.

This is set via https://github.com/GSA/catalog.data.gov/blob/main/ckan/setup/ckan.ini#L229-L230. The following is a discussion of how to check on this process, and/or what it would take to render this moot (no dataset harvests are left in a bad state).

Understanding what is stuck

Mostly, harvest jobs get stuck in the fetch stage. To determine what is stuck, consider:

  • Are the harvest-fetch workers idle?
  • For harvest jobs in the Running state, is the timestamp of gather_finished?

Review the output of the following query:

SELECT 
harvest_source.title, 
harvest_object.state, 
COUNT(*),
MAX(harvest_object.fetch_started) AS last_fetched,
MAX(harvest_object.gathered) AS last_gathered
FROM harvest_object 
LEFT JOIN harvest_source 
ON harvest_source.id = harvest_object.harvest_source_id 
WHERE 
harvest_job_id IN (SELECT id FROM harvest_job WHERE status = 'Running') 
GROUP BY 
harvest_source.title, 
harvest_object.state
ORDER BY last_gathered DESC;

Are there any jobs that have recent activity? Then the harvester is probably still running and processing. If a harvester was started but has no recent activity, then it probably is stuck.

Review the /var/log/harvest-fetch.log on catalog-harvest1p. Are harvest objects being processed, or does it seem like the workers are idle?

If you answer "yes" to both of these, the job is probably stuck. Otherwise, let the fetch workers continue working since "unstucking" the jobs that might finish could cause undesired side effects.

"Unsticking" the jobs

Restart Supervisor

Sometimes jobs are stalled out simply because supervisor didn't restart the jobs properly, and they are no longer picking up objects in the queue. Only restart supervisor if no objects are being processed, otherwise they will be lost and you will have to use the queries below to fix the state (see log files in /var/log/, check latest log date). To restart supervisor, simply run sudo supervisorctl restart all. If a gather or harvest_run jobs are running (check the log files /var/log/gather-consumer.log and /var/log/harvester_run.log), you can restart just the fetch processes: sudo supervisorctl restart harvest-fetch:*

Manual SQL intervention

To manually force a completion of a harvest job, the harvest object records in the DB that are not processed (marked as complete or error) will need to be marked as an error, and to save that error. The process looks like this:

  1. Create harvest error descriptions for the stuck harvest tasks (fetch or gather tasks).
  2. Mark the stuck harvest tasks as errored.
  3. Mark any harvest jobs that were not started properly as New.
  4. (automated) harvest run will see the errored tasks and cleanup or restart jobs as appropriate.

Get harvest_job_id's for running jobs

A few of the queries below require a unique harvest_job_id. Below is the query to get id's by job.

SELECT harvest_source.title,
harvest_object.state,
harvest_object.harvest_job_id,
COUNT(*)
FROM harvest_object
LEFT JOIN harvest_source
ON harvest_source.id = harvest_object.harvest_source_id
WHERE
harvest_job_id IN (SELECT id FROM harvest_job WHERE status = 'Running')
GROUP BY 1, 2, 3;

Create error descriptions for unprocessed/stuck tasks

The SQL for saving the unprocessed records is below (replace harvest_job_id text with the job that needs to be released).

Note: the query below will fail with a non-unique key error on harvest_object_error.id. The addition of a newly unique id for the harvest object error record should be created, but will need to be generated by python: it cannot be generated via SQL natively.

# Creates harvest_object_errors to be shown to the user after the job is complete.
INSERT INTO harvest_object_error ("harvest_object_id", "message", "stage", "id")
SELECT id, 'Unknown error occurred, object did not harvest',  'unknown', 'ERROR_TEST'
FROM harvest_object
WHERE state <> 'COMPLETE'
AND state <> 'ERROR'
AND state <> 'STUCK'
AND harvest_job_id = '';

The above query could be improved by extracting the stage of failure by evaluating which timestamp is null: gathered, fetch_started, fetch_finished, import_started, import_finished, and giving one of these 3 values: Import, Fetch, and Validation.

Mark stuck tasks as errored

The SQL for forcing the complete for a stuck task is below (replace job_id text with the job that needs to be released). Once harvester run checks for completed jobs, it will see all the tasks as errored or complete and mark the harvest job complete.

# Marks stuck task as error for given harvest job id
UPDATE harvest_object
SET state = 'ERROR'
WHERE state <> 'COMPLETE'
AND harvest_job_id = '';

Eventually these queries could be run on any jobs that have not had any activity in 12 or more hours. This would look something like:

# Marks all stuck tasks as errored after 12 hours
UPDATE harvest_object 
SET state='ERROR' 
WHERE 
state <> 'COMPLETE' AND state <> 'ERROR' 
AND import_finished IS NULL 
AND fetch_started < now() - '12 hours'::interval;

Mark harvest jobs that never started correctly as New

There are also use cases where a harvest is started, but the harvesters are so backed up that the gather process is never started, and there are no harvest objects to mark as error. At this point, you can either restart the job or mark it as completed. To force the restart, use the following SQL:

UPDATE harvest_job 
SET status = 'New' 
WHERE gather_started IS null
AND status = 'Running' 
AND created < now() - interval '1 day';

Redis

Sometimes the job will "choke" the harvester; the memory will run out and it will crash. The job will never leave the gather queue properly even if the job is cancelled through the UI. In this case, you may need to clean up redis manually. You should first ssh onto the redis host. You'll need to locate the harvest job (the CKAN UI should have it). You'll also need the password, located at /etc/redis/[something].conf. Then, you can run the following command to remove that job from the queue:

$ redis-cli -a password KEYS "*harvest-job-id*" | xargs redis-cli -a password DEL
Clone this wiki locally