Skip to content

Commit

Permalink
docs: add a troubleshooting guide
Browse files Browse the repository at this point in the history
Signed-off-by: Suleyman Akbas <sakbas@redhat.com>
  • Loading branch information
suleymanakbas91 committed May 24, 2023
1 parent e6814c5 commit c11f274
Show file tree
Hide file tree
Showing 4 changed files with 154 additions and 6 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,8 @@ To perform a full cleanup, follow these steps:
$ oc delete lvmclusters.lvm.topolvm.io my-lvmcluster
lvmcluster.lvm.topolvm.io "my-lvmcluster" deleted
```
If the previous command is stuck, it may be necessary to perform a [forced cleanup procedure](./docs/troubleshooting.md#forced-cleanup).
4. Verify that the only remaining resource in the `openshift-storage` namespace is the Operator.
Expand Down Expand Up @@ -350,6 +352,10 @@ LVMS does not support the reconciliation of multiple LVMCluster custom resources
It is not possible to upgrade from release-4.10 and release-4.11 to a newer version due to a breaking change that has been implemented. For further information on this matter, consult [the relevant documentation](https://github.com/topolvm/topolvm/blob/main/docs/proposals/rename-group.md).
## Troubleshooting
See the [troubleshooting guide](docs/troubleshooting.md).
## Contributing
See the [contribution guide](CONTRIBUTING.md).
7 changes: 4 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Contents

1. [Reconciler Design](design/architecture.md)
1. [Architecture](design/architecture.md)
2. [The LVM Operator Manager](design/lvm-operator-manager.md)
2. [The Volume Group Manager](design/vg-manager.md)
5. [Thin Provisioning](design/thin-provisioning.md)
3. [The Volume Group Manager](design/vg-manager.md)
4. [Thin Provisioning](design/thin-provisioning.md)
5. [Troubleshooting Guide](troubleshooting.md)
6 changes: 3 additions & 3 deletions docs/design/lvm-operator-manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The LVM Operator Manager runs the LVM Cluster controller/reconciler that manages the following reconcile units:

- [LVMCluster Custom Resource (CR)](#lvmcluster-custom-resource--cr-)
- [LVMCluster Custom Resource (CR)](#lvmcluster-custom-resource-cr)
- [TopoLVM CSI](#topolvm-csi)
* [CSI Driver](#csi-driver)
* [TopoLVM Controller](#topolvm-controller)
Expand All @@ -11,9 +11,9 @@ The LVM Operator Manager runs the LVM Cluster controller/reconciler that manages
- [Storage Classes](#storage-classes)
- [Volume Group Manager](#volume-group-manager)
- [LVM Volume Groups](#lvm-volume-groups)
- [Openshift Security Context Constraints (SCCs)](#openshift-security-context-constraints--sccs-)
- [Openshift Security Context Constraints (SCCs)](#openshift-security-context-constraints-sccs)

Upon receiving a valid [LVMCluster custom resource](#lvmcluster-custom-resource--cr-), the LVM Cluster Controller initiates the reconciliation process to set up the TopoLVM Container Storage Interface (CSI) along with all the required resources for using locally available storage through Logical Volume Manager (LVM).
Upon receiving a valid [LVMCluster custom resource](#lvmcluster-custom-resource-cr), the LVM Cluster Controller initiates the reconciliation process to set up the TopoLVM Container Storage Interface (CSI) along with all the required resources for using locally available storage through Logical Volume Manager (LVM).

## LVMCluster Custom Resource (CR)

Expand Down
141 changes: 141 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Troubleshooting Guide

## Persistent Volume Claim (PVC) is stuck in `Pending` state

There can be many reasons why a Persistent Volume Claim (PVC) is stuck in `Pending` state. Here we list a few reasons and some ways to troubleshoot them.

```bash
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lvms-test Pending lvms-vg1 11s
```

To troubleshoot the issue, inspect the events associated with the PVC. These events can provide valuable insights into any errors or issues encountered during the provisioning process.

```bash
$ oc describe pvc lvms-test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller storageclass.storage.k8s.io "lvms-vg1" not found
```

### `LVMCluster` CR or the Logical Volume Manager Storage (LVMS) components are missing

If you encounter a "storageclass.storage.k8s.io 'lvms-vg1' not found" error, verify the presence of the `LVMCluster` resource:

```bash
$ oc get lvmcluster -n openshift-storage
NAME AGE
my-lvmcluster 65m
```

If an LVMCluster resource is not found, you can create one based on your requirements. [Here](../config/samples/lvm_v1alpha1_lvmcluster.yaml) is a sample one you can use.

```bash
$ oc create -n openshift-storage -f https://github.com/openshift/lvm-operator/raw/main/config/samples/lvm_v1alpha1_lvmcluster.yaml
```

If an `LVMCluster` already exists, check if all the pods from LVMS are in `Running` state on `openshift-storage` namespace:

```bash
$ oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
topolvm-node-dr26h 4/4 Running 0 66m
vg-manager-r6zdv 1/1 Running 0 66m
```

There should be one running instance of `lvms-operator` and `vg-manager`, and multiple instances of `topolvm-controller` and `topolvm-node` depending on number of nodes.

#### `topolvm-node` is stuck in `Init:0/1`

This indicates a failure in locating an available disk for LVMS utilization. To investigate further and obtain relevant information, review the logs of the `vg-manager` pod.

```bash
$ oc logs vg-manager -n openshift-storage
```

### Disk failure

If you encounter a failure message such as "failed to check volume existence" while inspecting the events associated with the PVC, it indicates a potential issue related to the underlying volume or disk. This failure message suggests that there might be a problem with the availability or accessibility of the specified volume. Further investigation is recommended to identify the exact cause and resolve the underlying issue.

```bash
$ oc describe pvc lvms-test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller failed to provision volume with StorageClass "lvms-vg1": rpc error: code = Internal desc = failed to check volume existence
```

To investigate the issue further, you can establish a direct connection to the host where the problem is occurring. From there, you can proceed with creating a new file to gather additional information and gain insights into the underlying problem related to the disk. After resolving the underlying disk problem, if the recurring issue persists despite the resolution, it may be necessary to perform a [forced cleanup procedure](#forced-cleanup) for LVMS. After completing the cleanup process, re-create the LVMCluster. By re-creating the LVMCluster, all associated objects and resources are recreated, providing a clean starting point for the LVMS deployment. This helps to ensure a reliable and consistent environment.

### Node failure

If PVCs associated with a specific node remain in a `Pending` state, it suggests a potential issue with that particular node. To identify the problematic node, you can examine the restart count of the `topolvm-node` pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting.

```bash
$ oc get pods -n openshift-storage
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
topolvm-node-dr26h 4/4 Running 0 66m
topolvm-node-54as8 4/4 Running 0 66m
topolvm-node-78fft 4/4 Running 17 (8s ago) 66m
vg-manager-r6zdv 1/1 Running 0 66m
vg-manager-990ut 1/1 Running 0 66m
vg-manager-an118 1/1 Running 0 66m
```

After resolving the issue with the respective node, if the problem persists and reoccurs, it may be necessary to perform a [forced cleanup procedure](#forced-cleanup) for LVMS. After completing the cleanup process, re-create the LVMCluster. By re-creating the LVMCluster, all associated objects and resources are recreated, providing a clean starting point for the LVMS deployment. This helps to ensure a reliable and consistent environment.

## Forced cleanup

After resolving any disk or node related problem, if the recurring issue persists despite the resolution, it may be necessary to perform a forced cleanup procedure for LVMS. This procedure aims to comprehensively address any persistent issues and ensure the proper functioning of the LVMS solution.

1. Remove all the PVCs created using LVMS, and pods using those PVCs.
2. Switch to `openshift-storage` namespace:

```bash
$ oc project openshift-storage
```

3. Make sure there is no `LogicalVolume` CR left:

```bash
$ oc get logicalvolume
No resources found
```

If there is any `LogicalVolume` left, remove finalizers from these resources and delete them:
```bash
oc patch logicalvolume <name> -p '{"metadata":{"finalizers":[]}}' --type=merge
oc delete logicalvolume <name>
```

4. Make sure there is no `LVMVolumeGroup` CRs left:

```bash
$ oc get lvmvolumegroup
No resources found
```

If there is any `LVMVolumeGroup` left, remove finalizers from these resources and delete them,

```bash
$ oc patch lvmvolumegroup <name> -p '{"metadata":{"finalizers":[]}}' --type=merge
$ oc delete lvmvolumegroup <name>
```

5. Remove any `LVMVolumeGroupNodeStatus` CRs:

```bash
$ oc delete lvmvolumegroupnodestatus --all
```

6. Remove the `LVMCluster` CR:

```bash
oc delete lvmcluster --all
```

0 comments on commit c11f274

Please sign in to comment.