Skip to content

Commit

Permalink
update ATspectrograph failed doc and created ATcamera-recovery doc
Browse files Browse the repository at this point in the history
  • Loading branch information
isotuela authored and JackieS-NL committed May 27, 2024
1 parent b6a5671 commit a169ff8
Show file tree
Hide file tree
Showing 6 changed files with 221 additions and 57 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
.. This is a template for troubleshooting when some part of the observatory enters an abnormal state. This comment may be deleted when the template is copied to the destination.
.. Review the README in this procedure's directory on instructions to contribute.
.. Static objects, such as figures, should be stored in the _static directory. Review the _static/README in this procedure's directory on instructions to contribute.
.. Do not remove the comments that describe each section. They are included to provide guidance to contributors.
.. Do not remove other content provided in the templates, such as a section. Instead, comment out the content and include comments to explain the situation. For example:
- If a section within the template is not needed, comment out the section title and label reference. Include a comment explaining why this is not required.
- If a file cannot include a title (surrounded by ampersands (#)), comment out the title from the template and include a comment explaining why this is implemented (in addition to applying the ``title`` directive).
.. Include one Primary Author and list of Contributors (comma separated) between the asterisks (*):
.. |author| replace:: *Tony Johnson*
.. If there are no contributors, write "none" between the asterisks. Do not remove the substitution.
.. |contributors| replace:: *Erik Dennihys*

.. This is the label that can be used as for cross referencing this procedure.
.. Recommended format is "Directory Name"-"Title Name" -- Spaces should be replaced by hyphens.
.. _Templates-Title-of-Troubleshooting-Procedure:
.. Each section should includes a label for cross referencing to a given area.
.. Recommended format for all labels is "Title Name"-"Section Name" -- Spaces should be replaced by hyphens.
.. To reference a label that isn't associated with an reST object such as a title or figure, you must include the link an explicit title using the syntax :ref:`link text <label-name>`.
.. An error will alert you of identical labels during the build process.
#########################
AT camera recovery
#########################


.. _Title-of-Troubleshooting-Procedure-Overview:

Overview
========

.. In one or two sentences, explain when this troubleshooting procedure needs to be used. Describe the symptoms that the user sees to use this procedure.
The camera is designed to go into ``FAULT`` state whenever a limit (temperature/voltage/current/etc)
goes out of tolerance (for limits there is typically a warning range before a hard error occurs),
or if some unexpected failure occurs during camera operation. Once the camera goes info fault state
it is necessary to diagnose the problem, fix it, and then put the camera back into ``ENABLED`` mode
before it is possible to resume operations. This document describes the general procedure for doing this,
and will document any known common failure modes.

This article was triggered by `OBS-97`_ - The LATISS camera got timeout from REB IN PROGRESS on 28 February 2023,
but is more general than that specific incident.

.. _OBS-97: https://rubinobs.atlassian.net/browse/OBS-97


.. Following note was below in the original page https://confluence.lsstcorp.org/display/OOD/ATCamera+Recovering+from+Fault+state
.. note::
The instructions below assume:
#. The ability to login to the AuxTel CCS computers,
#. Some familiarity with basic CCS commands/functionality.

We need a separate document to provide this background information since it will need to be referred to from multiple places.


.. _Title-of-Troubleshooting-Procedure-Error-Diagnosis:

Error diagnosis
===============

.. This section should provide simple overview of known or suspected causes for the error.
.. It is preferred to include them as a bulleted or enumerated list.
.. Post screenshots of the error state or relevant tracebacks.
.. Added error diagnosis
- ATCamera goes to `FAULT` state.

.. _Title-of-Troubleshooting-Procedure-Procedure-Steps:

Procedure Steps
===============
.. This section should include the procedure. There is no strict formatting or structure required for procedures. It is left to the authors to decide which format and structure is most relevant.
.. In the case of more complicated procedures, more sophisticated methodologies may be appropriate, such as multiple section headings or a list of linked procedures to be performed in the specified order.
.. For highly complicated procedures, consider breaking them into separate procedure. Some options are a high-level procedure with links, separating into smaller procedures or utilizing the reST ``include`` directive <https://docutils.sourceforge.io/docs/ref/rst/directives.html#include>.
.. In general the steps are:
#. Identify which **CCS subsystem triggered** the problem
#. Review the **raised alerts and/or log files**, and determine IF:
#. This was a transitory problem which can be documented (via JIRA ticket) and reset,
#. or something which requires a camera expert to diagnose.
#. Clear the raised alerts in both the CCS subsystem which triggered the problem and the Master Control Module (MCM) which tracks the overall camera state.
#. Clear the fault in the ocs-bridge, and switch it back of OFFLINE_AVAILABLE mode.

.. note::
In either case it is important that an **OBS ticket** be created so we can track how often specific problems occur, and whether software or hardware changes are needed to prevent future occurrences.

Specific CCS commands for performing these operations are documented below.


.. _Title-of-Troubleshooting-Procedure-tracking-down-a-CSC-problem:


Tracking down a CSC problem
--------------------------------
In general there are two approaches on tracking down a CCS problem,
either using the **ccs-shell** command line tool, or using the **ccs-console** graphical interface.
Currently we describe only the first approach.



.. warning::
Pending **TODO**: Simulate a fault and verify these commands are correct (perhaps on TTS) (plus highlight responses)

.. this note was added to be able to copy the commands without cs. BUT UNCERTATIN WHETHER IF IT'S CORRECT OR NEEDED.
.. admonition:: Important

The following commands have the prompt `ccs>`

#. Identify which CCS subsystem triggered the problem:

.. code-block:: bash
ats-mcm getRaisedAlertSummary
#. Review the raised alerts and log files

.. code-block:: bash
ats-fp getRaisedAlertSummary
#. Clear the alerts

.. code-block:: bash
ats-fp clearAllAlerts
ats-fp getRaisedAlertSummary
ats-mcm clearAllAlerts
ats-mcm getRaisedAlertSummary
#. Clear the ocs-bridge

.. code-block:: bash
ats-ocs-bridge clearFault
ats-ocs-bridge setAvailable
.. _Title-of-Troubleshooting-Procedure-Post-Condition:

Post-Condition
==============

.. This section should provide a simple overview of conditions or results after executing the procedure; for example, state of equipment or resulting data products.
.. It is preferred to include them as a bulleted or enumerated list.
.. Please provide screenshots of the software status or relevant display windows to confirm.
.. Do not include actions in this section. Any action by the user should be included in the end of the Procedure section below. For example: Do not include "Verify the telescope azimuth is 0 degrees with the appropriate command." Instead, include this statement as the final step of the procedure, and include "Telescope is at 0 degrees." in the Post-condition section.
- AT Camera can now be set to the `ENABLED` state.


.. _Title-of-Troubleshooting-Procedure-Contingency:

Contingency
===========

If the procedure was not successful, report the issue in `#summit_auxtel`_ and/or activate the :ref:`Out of hours support <Safety-out-of-hours-support>`.

.. _#summit_auxtel: https://lsstc.slack.com/archives/C01K4M6R4AH

Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,14 @@
ATSpectrograph failed - grating stage position and timed out
####################################################################################################

.. note::
This is a procedure template file that is associated with a template directory structure. This note should be deleted when the section is properly populated.

.. _Title-of-Troubleshooting-Procedure-Overview:

Overview
========

ATspectrograph failed during checkout_latiss after the power cut off due to the UPS failure issue at AuxTel on Nov. We have seen this error after hard resets of the ATSpectrograph.

.. In one or two sentences, explain when this troubleshooting procedure needs to be used. Describe the symptoms that the user sees to use this procedure.
ATspectrograph failed during ``latiss_checkout`` after the power cut off due to the UPS failure issue at AuxTel on November 2023. We have seen this error after hard resets of the ATSpectrograph.

.. _Title-of-Troubleshooting-Procedure-Error-Diagnosis:

Expand All @@ -46,86 +42,92 @@ Error diagnosis
.. It is preferred to include them as a bulleted or enumerated list.
.. Post screenshots of the error state or relevant tracebacks.
#. Check the error message from checkout_latiss.py or another script.
#. It appears that the encoder position is lost on a reset,
as the linear stage reported being at position -324mm
but it had not moved since the reset,
#. Check the **error message** from ``latiss_checkout.py`` or another script.
#. It appears that the **encoder position is lost** on a reset, as the linear stage reporte being at position -324mm but it had not moved since the reset, suggesting the encoder value position was incorrectly initialized. When the ATSpectrograph is first initialized after a power reset and a move is commanded (such as during the latiss_checkout when it tried to position the stage at +67mm), since the encoder position is incorrect the stage cannot arrive in position and times out during the move.


.. code-block:: text
:caption: Error message
2023/11/03 17:17:52 TAIError in runTraceback (most recent call last):
File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/asyncio/tasks.py", line 500, in wait_for return fut.result() ^^^^^^^^^^^^
File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py",
line 239, in _get_next_ackcmd await self._next_ack_event.wait()
suggesting the encoder value position was incorrectly initialized.
When the ATSpectrograph is first initialized after a power reset
and a move is commanded (such as during the latiss_checkout
when it tried to position the stage at +67mm),
since the encoder position is incorrect
the stage cannot arrive in position and times out during the move.
File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/asyncio/locks.py",
line 213, in wait await fut asyncio.exceptions.CancelledError The above exception was the direct cause of the following
exception: Traceback (most recent call last): File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py", line 189, in next_ackcmd ackcmd = await self._wait_task
^^^^^^^^^^^^^^^^^^^^^
File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py",
line 214, in _basic_next_ackcmd ackcmd = await asyncio.wait_for
( ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/asyncio/tasks.py", line 502,
in wait_for raise exceptions.TimeoutError() from exc TimeoutError During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/base_script.py",
line 603, in do_run await self._run_task File "/net/obs-env/auto_base_packages/ts_standardscripts/python/lsst/ts/standardscripts/auxtel/daytime_checkout/latiss_checkout.py", line 110, in run await self.latiss.setup_instrument( File "/net/obs-env/auto_base_packages/ts_observatory_control/python/lsst/ts/observatory/control/auxtel/latiss.py", line 176, in setup_instrument await self.setup_atspec( File "/net/obs-env/auto_base_packages/ts_observatory_control/python/lsst/ts/observatory/control/auxtel/latiss.py", line 242, in setup_atspec await asyncio.gather(*setup_coroutines) File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py", line 416, in set_start return await self.start(timeout=timeout, wait_done=wait_done) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py", line 487, in start return await cmd_info.next_ackcmd(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/lsst/software/stack/conda/envs/lsst-scipipe-7.0.1/lib/python3.11/site-packages/lsst/ts/salobj/topics/remote_command.py",
line 205,
in next_ackcmd raise base.AckTimeoutError( lsst.ts.salobj.base.AckTimeoutError: msg='Timed out waiting for command acknowledgement',
ackcmd=(ackcmd private_seqNum=1137560160, ack=<SalRetCode.CMD_NOACK: -301>, error=0, result='No command acknowledgement seen')
.. _Title-of-Troubleshooting-Procedure-Procedure-Steps:

Procedure Steps
===============

.. todo::
Make sure everything is in a safe or idle state before troubleshooting. Describe relevant safety steps if necessary.

.. This section should include the procedure. There is no strict formatting or structure required for procedures. It is left to the authors to decide which format and structure is most relevant.
.. In the case of more complicated procedures, more sophisticated methodologies may be appropriate, such as multiple section headings or a list of linked procedures to be performed in the specified order.
.. For highly complicated procedures, consider breaking them into separate procedure. Some options are a high-level procedure with links, separating into smaller procedures or utilizing the reST ``include`` directive <https://docutils.sourceforge.io/docs/ref/rst/directives.html#include>.
#. Step 1 for Condition A.
#. step 2 to test

.. warning::
.. For this example, this step is critical.
.. _Title-of-Troubleshooting-Procedure-Critical-Step-1:

#. Power cycle the ATSpectrograph
.. the link below gets an "This site can’t be reached"
#. Power cycle the ATSpectrograph
#. Check ATSpectrograph CSC is in Standby Status.

#. Connect to :ref: aux-pdu-spectrograph.cp.lsst.org
.. `[http://aux-pdu-spectrograph.cp.lsst.org/] <http://aux-pdu-spectrograph.cp.lsst.org/>`
fet
#. Connect to ``http://aux-pdu-spectrograph.cp.lsst.org/``
#. Log in with Username/PW (it is in the AuxTel 1Password vault).
#. Click On/Off button for **Outlet 2** only.

#. Click On/Off button for Outlet 2 only.


.. IMAGE HERE *****************
.. image:: ./_static/1_power_cycle_ATSpec.png
:width: 700px

#. THIS SHOULD BE STEP 2
#. Connect to AuxTel EUI desktop
#. **Install Microsoft Remote Desktop**
#. Setup to connect to AuxTel EUI desktop.
#. Click '+' button on the top menu.
#. Select "Add PC" from drop-down menu.
#. Click the drop-down menu "Gateway" > "Add Gateway"

#. Step 4 has two branches, but Step 5 is independent of Step 4.
.. image:: ./_static/2_connect_auxtel_EUI_part1.png
:width: 700px

a. If Condition A, do the following action in :ref:`Condition A Instructions <Title-of-Troubleshooting-Procedure-Condition-A-for-Step-4>`.
#. Put aux-brick01.cp.lsst.org on the Gateway name

b. If Condition B, do the following action in :ref:`Condition B instructions <Title-of-Troubleshooting-Procedure-Condition-B-for-Step-4>`.
.. image:: ./_static/3_connect_auxtel_EUI_part2.png
:width: 700px

.. _Title-of-Troubleshooting-Procedure-Final-Step:
#. **Login with Username/PW of ATMCS/ATSpectrograph/ATDome EUI access** on 1Password Vault.

#. Complete the procedure's final step.
#. **Open the new tab/window of the web browser on the remote desktop.**

#. Connect to 139.229.170.44:8000/Spectrograph.html

.. _Title-of-Troubleshooting-Procedure-Condition-A-for-Step-4:
.. image:: ./_static/4_ACE_spec_EUI_Labview.png
:width: 700px

Condition A for Step 4
----------------------
#. Click "**Re-init Axes**" Button on the EUI

This is an example of a sub-section, used when Condition A applied. Complete the steps in this section:
.. #. This clears the encoder position and should return it to its home position near 0mm.
#. Step 1 for Condition A.
#. Return to :ref:`Step 5 <Title-of-Troubleshooting-Procedure-Final-Step>` in the section above.
.. _Title-of-Troubleshooting-Procedure-Condition-B-for-Step-4:

Condition B for Step 4
----------------------

This is an example of a sub-section, used when Condition B applied. Complete the steps in this section:

#. Step 1 for Condition B.
#. Return to :ref:`Step 5 <Title-of-Troubleshooting-Procedure-Final-Step>` in the section above.
.. _Title-of-Troubleshooting-Procedure-Post-Condition:

Expand All @@ -137,12 +139,14 @@ Post-Condition
.. Please provide screenshots of the software status or relevant display windows to confirm.
.. Do not include actions in this section. Any action by the user should be included in the end of the Procedure section below. For example: Do not include "Verify the telescope azimuth is 0 degrees with the appropriate command." Instead, include this statement as the final step of the procedure, and include "Telescope is at 0 degrees." in the Post-condition section.
- This is an example bullet of a post-condition (Telescope azimuth is 0 degrees.)
- This is another example of a post-condition (This procedure leaves the telescope with the E-stop activated.)
- The AT Spectrograph's encoder has been cleared and is now in the home position, near 0 nm.

.. _Title-of-Troubleshooting-Procedure-Contingency:

Contingency
===========

If the procedure was not successful, report the issue in [relevant Slack channel] and/or activate the :ref:`Out of hours support <Safety-out-of-hours-support>`.
If the procedure was not successful, report the issue in `#summit_auxtel`_ and/or activate the :ref:`Out of hours support <Safety-out-of-hours-support>`.

.. _#summit_auxtel: https://lsstc.slack.com/archives/C01K4M6R4AH

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit a169ff8

Please sign in to comment.