Client method to dump cluster state #5470

fjetter · 2021-10-27T13:46:04Z

This adds a client method to dump the entire cluster state in a file for debugging purposes. This has been incredibly handy for the deadlock scenarios I've been debugging recently.

This method is called automatically in case a test is running into a timeout and persists the content as part of a GH artefact. This should help us debug spurious, flaky test failures.

I implemented the test dump as a yaml for readability but for real world examples, yaml is not well suited. In my experience these dumps can grow several MB GB and a feasible approach, so far, was to use msgpack + gzip. That's all not set in stone.

The implementation is not very elegant but I added a to_dict method, similar but more verbose to identity to most relevant classes. If somebody has an idea about a more elegant approach, I'm all ears

Example output

scheduler_info:
  address: tcp://127.0.0.1:38993
  events:
    Client-cb05ef34-372a-11ec-881b-e9dd5dbb64c5: !!python/tuple
    - '(1635341747.6048865, {''action'': ''add-client'', ''client'': ''Client-cb05ef34-372a-11ec-881b-e9dd5dbb64c5''})'
    all: !!python/tuple
    - !!python/tuple
      - '1635341747.5759628'
      - action: add-worker
        worker: tcp://127.0.0.1:41249
    - !!python/tuple
      - '1635341747.5792954'
      - action: add-worker
        worker: tcp://127.0.0.1:42413
    - !!python/tuple
      - '1635341747.6048865'
      - action: add-client
        client: Client-cb05ef34-372a-11ec-881b-e9dd5dbb64c5
    stealing: !!python/tuple []
    tcp://127.0.0.1:41249: !!python/tuple
    - !!python/tuple
      - '1635341747.5758505'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '0.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '134877184'
        num_fds: '24'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341747.557952'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
    - !!python/tuple
      - '1635341747.5759585'
      - action: add-worker
    - !!python/tuple
      - '1635341747.5858035'
      - action: worker-status-change
        prev-status: undefined
        status: running
    - !!python/tuple
      - '1635341748.5936973'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '4.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135020544'
        num_fds: '39'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341748.5843215'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
    - !!python/tuple
      - '1635341749.5884714'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '4.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135135232'
        num_fds: '43'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341749.5846045'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
    tcp://127.0.0.1:42413: !!python/tuple
    - !!python/tuple
      - '1635341747.579201'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '0.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '134877184'
        num_fds: '25'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341747.5610273'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
    - !!python/tuple
      - '1635341747.579291'
      - action: add-worker
    - !!python/tuple
      - '1635341747.5859537'
      - action: worker-status-change
        prev-status: undefined
        status: running
    - !!python/tuple
      - '1635341748.5940423'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '11.4'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135020544'
        num_fds: '39'
        read_bytes: '16806.444267185172'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341748.085588'
        write_bytes: '26748.096587213608'
        write_bytes_disk: '0.0'
    - !!python/tuple
      - '1635341749.5887933'
      - action: heartbeat
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '4.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135135232'
        num_fds: '43'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341749.5856717'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
  id: Scheduler-87138d38-3606-49a3-b896-5d2d5d31bff7
  log: !!python/tuple []
  services:
    dashboard: '33755'
  started: '1635341747.549502'
  status: running
  tasks: {}
  thread_id: '140209022129984'
  transition_log: !!python/tuple []
  type: Scheduler
  workers:
    tcp://127.0.0.1:41249:
      host: 127.0.0.1
      id: '0'
      last_seen: '1635341749.5884442'
      local_directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-eru8w8qn
      memory_limit: '7291699200'
      metrics:
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '4.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135135232'
        num_fds: '43'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341749.5846045'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
      name: '0'
      nanny: null
      nthreads: '1'
      resources: {}
      services:
        dashboard: '34057'
      type: Worker
    tcp://127.0.0.1:42413:
      host: 127.0.0.1
      id: '1'
      last_seen: '1635341749.588775'
      local_directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-uj4ir590
      memory_limit: '7291699200'
      metrics:
        bandwidth:
          total: '100000000'
          types: {}
          workers: {}
        cpu: '4.0'
        executing: '0'
        in_flight: '0'
        in_memory: '0'
        memory: '135135232'
        num_fds: '43'
        read_bytes: '0.0'
        read_bytes_disk: '0.0'
        ready: '0'
        spilled_nbytes: '0'
        time: '1635341749.5856717'
        write_bytes: '0.0'
        write_bytes_disk: '0.0'
      name: '1'
      nanny: null
      nthreads: '2'
      resources: {}
      services:
        dashboard: '41997'
      type: Worker
worker_info:
  tcp://127.0.0.1:41249:
    address: tcp://127.0.0.1:41249
    config:
      array:
        chunk-size: 128MiB
        rechunk-threshold: '4'
        slicing:
          split-large-chunks: null
        svg:
          size: '120'
      dataframe:
        parquet:
          metadata-task-size-local: '512'
          metadata-task-size-remote: '16'
        shuffle-compression: null
      distributed:
        adaptive:
          interval: 1s
          maximum: inf
          minimum: '0'
          target-duration: 5s
          wait-count: '3'
        admin:
          event-loop: tornado
          log-format: '%(name)s - %(levelname)s - %(message)s'
          log-length: '10000'
          max-error-length: '10000'
          pdb-on-err: 'False'
          system-monitor:
            interval: 500ms
          tick:
            interval: 20ms
            limit: 3s
        client:
          heartbeat: 5s
          scheduler-info-interval: 2s
        comm:
          compression: auto
          default-scheme: tcp
          offload: 10MiB
          recent-messages-log-length: '0'
          require-encryption: null
          retry:
            count: '0'
            delay:
              max: 20s
              min: 1s
          shard: 64MiB
          socket-backlog: '2048'
          timeouts:
            connect: 5s
            tcp: 30s
          tls:
            ca-file: null
            ciphers: null
            client:
              cert: null
              key: null
            scheduler:
              cert: null
              key: null
            worker:
              cert: null
              key: null
          ucx:
            cuda_copy: 'False'
            infiniband: 'False'
            net-devices: null
            nvlink: 'False'
            rdmacm: 'False'
            reuse-endpoints: null
            tcp: 'False'
          websockets:
            shard: 8MiB
          zstd:
            level: '3'
            threads: '0'
        dashboard:
          export-tool: 'False'
          graph-max-items: '5000'
          link: '{scheme}://{host}:{port}/status'
          prometheus:
            namespace: dask
        deploy:
          cluster-repr-interval: 500ms
          lost-worker-timeout: 15s
        diagnostics:
          computations:
            ignore-modules: !!python/tuple
            - distributed
            - dask
            - xarray
            - cudf
            - cuml
            - prefect
            - xgboost
            max-history: '100'
          nvml: 'True'
        nanny:
          environ:
            MALLOC_TRIM_THRESHOLD_: '65536'
            MKL_NUM_THREADS: '1'
            OMP_NUM_THREADS: '1'
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
        rmm:
          pool-size: null
        scheduler:
          active-memory-manager:
            interval: 2s
            policies: !!python/tuple
            - class: distributed.active_memory_manager.ReduceReplicas
            start: 'False'
          allowed-failures: '3'
          allowed-imports: !!python/tuple
          - dask
          - distributed
          bandwidth: '100000000'
          blocked-handlers: !!python/tuple []
          dashboard:
            bokeh-application:
              allow_websocket_origin: !!python/tuple
              - '*'
              check_unused_sessions_milliseconds: '500'
              keep_alive_milliseconds: '500'
            status:
              task-stream-length: '1000'
            tasks:
              task-stream-length: '100000'
            tls:
              ca-file: null
              cert: null
              key: null
          default-data-size: 1kiB
          default-task-durations:
            rechunk-split: 1us
            split-shuffle: 1us
          events-cleanup-delay: 1h
          events-log-length: '100000'
          http:
            routes: !!python/tuple
            - distributed.http.scheduler.prometheus
            - distributed.http.scheduler.info
            - distributed.http.scheduler.json
            - distributed.http.health
            - distributed.http.proxy
            - distributed.http.statics
          idle-timeout: null
          locks:
            lease-timeout: 30s
            lease-validation-interval: 10s
          pickle: 'True'
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
          transition-log-length: '100000'
          unknown-task-duration: 500ms
          validate: 'False'
          work-stealing: 'True'
          work-stealing-interval: 100ms
          worker-ttl: null
        version: '2'
        worker:
          blocked-handlers: !!python/tuple []
          connections:
            incoming: '10'
            outgoing: '50'
          daemon: 'True'
          http:
            routes: !!python/tuple
            - distributed.http.worker.prometheus
            - distributed.http.health
            - distributed.http.statics
          lifetime:
            duration: null
            restart: 'False'
            stagger: 0 seconds
          memory:
            pause: '0.8'
            rebalance:
              measure: optimistic
              recipient-max: '0.6'
              sender-min: '0.3'
              sender-recipient-gap: '0.1'
            recent-to-old-time: 30s
            spill: '0.7'
            target: '0.6'
            terminate: '0.95'
          multiprocessing-method: spawn
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
          profile:
            cycle: 1000ms
            interval: 10ms
            low-level: 'False'
          resources: {}
          use-file-locking: 'True'
          validate: 'False'
      optimization:
        fuse:
          active: null
          ave-width: '1'
          max-depth-new-edges: null
          max-height: inf
          max-width: null
          rename-keys: 'True'
          subgraphs: null
      scheduler: dask.distributed
      shuffle: tasks
      temporary-directory: null
      tokenize:
        ensure-deterministic: 'False'
    constrained: !!python/tuple []
    executing_count: '0'
    id: Worker-5bd7c8a8-6ca3-4457-ac0f-0c49fa1e56f5
    in_flight_tasks: '0'
    in_flight_workers: {}
    incoming_transfer_log: !!python/tuple []
    log: !!python/tuple []
    logs: !!python/tuple
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41249'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41249'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          dashboard at:            127.0.0.1:34057'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -               Threads:                          1'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -                Memory:                   6.79
        GiB'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-eru8w8qn'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42413'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42413'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          dashboard at:            127.0.0.1:41997'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -               Threads:                          2'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -                Memory:                   6.79
        GiB'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-uj4ir590'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    long_running: !!python/tuple []
    memory_limit: '7291699200'
    memory_pause_fraction: '0.8'
    memory_spill_fraction: '0.7'
    memory_target_fraction: '0.6'
    ncores: '1'
    nthreads: '1'
    outgoing_transfer_log: !!python/tuple []
    ready: !!python/tuple []
    scheduler: tcp://127.0.0.1:38993
    status: '<Status.running: ''running''>'
    tasks: {}
    thread_id: '140209022129984'
    type: Worker
  tcp://127.0.0.1:42413:
    address: tcp://127.0.0.1:42413
    config:
      array:
        chunk-size: 128MiB
        rechunk-threshold: '4'
        slicing:
          split-large-chunks: null
        svg:
          size: '120'
      dataframe:
        parquet:
          metadata-task-size-local: '512'
          metadata-task-size-remote: '16'
        shuffle-compression: null
      distributed:
        adaptive:
          interval: 1s
          maximum: inf
          minimum: '0'
          target-duration: 5s
          wait-count: '3'
        admin:
          event-loop: tornado
          log-format: '%(name)s - %(levelname)s - %(message)s'
          log-length: '10000'
          max-error-length: '10000'
          pdb-on-err: 'False'
          system-monitor:
            interval: 500ms
          tick:
            interval: 20ms
            limit: 3s
        client:
          heartbeat: 5s
          scheduler-info-interval: 2s
        comm:
          compression: auto
          default-scheme: tcp
          offload: 10MiB
          recent-messages-log-length: '0'
          require-encryption: null
          retry:
            count: '0'
            delay:
              max: 20s
              min: 1s
          shard: 64MiB
          socket-backlog: '2048'
          timeouts:
            connect: 5s
            tcp: 30s
          tls:
            ca-file: null
            ciphers: null
            client:
              cert: null
              key: null
            scheduler:
              cert: null
              key: null
            worker:
              cert: null
              key: null
          ucx:
            cuda_copy: 'False'
            infiniband: 'False'
            net-devices: null
            nvlink: 'False'
            rdmacm: 'False'
            reuse-endpoints: null
            tcp: 'False'
          websockets:
            shard: 8MiB
          zstd:
            level: '3'
            threads: '0'
        dashboard:
          export-tool: 'False'
          graph-max-items: '5000'
          link: '{scheme}://{host}:{port}/status'
          prometheus:
            namespace: dask
        deploy:
          cluster-repr-interval: 500ms
          lost-worker-timeout: 15s
        diagnostics:
          computations:
            ignore-modules: !!python/tuple
            - distributed
            - dask
            - xarray
            - cudf
            - cuml
            - prefect
            - xgboost
            max-history: '100'
          nvml: 'True'
        nanny:
          environ:
            MALLOC_TRIM_THRESHOLD_: '65536'
            MKL_NUM_THREADS: '1'
            OMP_NUM_THREADS: '1'
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
        rmm:
          pool-size: null
        scheduler:
          active-memory-manager:
            interval: 2s
            policies: !!python/tuple
            - class: distributed.active_memory_manager.ReduceReplicas
            start: 'False'
          allowed-failures: '3'
          allowed-imports: !!python/tuple
          - dask
          - distributed
          bandwidth: '100000000'
          blocked-handlers: !!python/tuple []
          dashboard:
            bokeh-application:
              allow_websocket_origin: !!python/tuple
              - '*'
              check_unused_sessions_milliseconds: '500'
              keep_alive_milliseconds: '500'
            status:
              task-stream-length: '1000'
            tasks:
              task-stream-length: '100000'
            tls:
              ca-file: null
              cert: null
              key: null
          default-data-size: 1kiB
          default-task-durations:
            rechunk-split: 1us
            split-shuffle: 1us
          events-cleanup-delay: 1h
          events-log-length: '100000'
          http:
            routes: !!python/tuple
            - distributed.http.scheduler.prometheus
            - distributed.http.scheduler.info
            - distributed.http.scheduler.json
            - distributed.http.health
            - distributed.http.proxy
            - distributed.http.statics
          idle-timeout: null
          locks:
            lease-timeout: 30s
            lease-validation-interval: 10s
          pickle: 'True'
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
          transition-log-length: '100000'
          unknown-task-duration: 500ms
          validate: 'False'
          work-stealing: 'True'
          work-stealing-interval: 100ms
          worker-ttl: null
        version: '2'
        worker:
          blocked-handlers: !!python/tuple []
          connections:
            incoming: '10'
            outgoing: '50'
          daemon: 'True'
          http:
            routes: !!python/tuple
            - distributed.http.worker.prometheus
            - distributed.http.health
            - distributed.http.statics
          lifetime:
            duration: null
            restart: 'False'
            stagger: 0 seconds
          memory:
            pause: '0.8'
            rebalance:
              measure: optimistic
              recipient-max: '0.6'
              sender-min: '0.3'
              sender-recipient-gap: '0.1'
            recent-to-old-time: 30s
            spill: '0.7'
            target: '0.6'
            terminate: '0.95'
          multiprocessing-method: spawn
          preload: !!python/tuple []
          preload-argv: !!python/tuple []
          profile:
            cycle: 1000ms
            interval: 10ms
            low-level: 'False'
          resources: {}
          use-file-locking: 'True'
          validate: 'False'
      optimization:
        fuse:
          active: null
          ave-width: '1'
          max-depth-new-edges: null
          max-height: inf
          max-width: null
          rename-keys: 'True'
          subgraphs: null
      scheduler: dask.distributed
      shuffle: tasks
      temporary-directory: null
      tokenize:
        ensure-deterministic: 'False'
    constrained: !!python/tuple []
    executing_count: '0'
    id: Worker-bc9ba117-6ebd-4c25-8ff7-c181040cf688
    in_flight_tasks: '0'
    in_flight_workers: {}
    incoming_transfer_log: !!python/tuple []
    log: !!python/tuple []
    logs: !!python/tuple
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41249'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41249'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          dashboard at:            127.0.0.1:34057'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -               Threads:                          1'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -                Memory:                   6.79
        GiB'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-eru8w8qn'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42413'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42413'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -          dashboard at:            127.0.0.1:41997'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -               Threads:                          2'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -                Memory:                   6.79
        GiB'
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -       Local Directory: /home/runner/work/distributed/distributed/dask-worker-space/worker-uj4ir590'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    - !!python/tuple
      - INFO
      - 'distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:38993'
    - !!python/tuple
      - INFO
      - distributed.worker - INFO - -------------------------------------------------
    long_running: !!python/tuple []
    memory_limit: '7291699200'
    memory_pause_fraction: '0.8'
    memory_spill_fraction: '0.7'
    memory_target_fraction: '0.6'
    ncores: '2'
    nthreads: '2'
    outgoing_transfer_log: !!python/tuple []
    ready: !!python/tuple []
    scheduler: tcp://127.0.0.1:38993
    status: '<Status.running: ''running''>'
    tasks: {}
    thread_id: '140209022129984'
    type: Worker

… failures

jrbourbeau

Thanks for putting this together @fjetter. Overall this looks like a nice addition

distributed/worker.py

distributed/client.py

jrbourbeau · 2021-11-03T19:41:41Z

distributed/core.py

+    def to_dict(
+        self, comm: Comm = None, *, exclude: Container[str] = None
+    ) -> dict[str, str]:


Do we want users directly interacting with this (and other) to_dict methods? If not, I'd prefer to prepend a leading underscore to the method name. My sense is Client.dump_cluster_state is the main user-facing entrypoint for this feature

distributed/scheduler.py

distributed/tests/test_client.py

.github/workflows/tests.yaml

distributed/client.py

distributed/utils.py

distributed/utils_test.py

jrbourbeau · 2021-11-03T22:19:37Z

Would including the output of client.get_versions() be useful here? This is often one of the first questions I ask users reporting issues

fjetter · 2021-11-05T16:04:07Z

I still get broken docs builds due to typing_extensions but this has been imported before already 🤔
Note: I had the TYPE_CHECK guard in. Will wait for another build since I pushed more changes

Now it works

fjetter · 2021-11-09T14:54:23Z

Added the version, good point.

I think the only topic left to address is whether or not we prefix to_dict with underscores. I have a slight preference not to but if that's a blocking reason, I will change it.

Friendly ping @jrbourbeau if that's ok for you

…deadlock_reports

jrbourbeau · 2021-11-10T03:21:16Z

Thanks for all the updates @fjetter. For the sake of trying to be more intentional about our public API, I'd prefer to use leading underscores. I pushed a small commit to make the to_dict -> _to_dict changes. It will be relatively straightforward to move these methods into the public API in the future (I'm happy to handle that too)

fjetter added 5 commits October 26, 2021 17:08

dump and archive deadlock reports

2925776

Implement to_dict methods for almost everything

1d86a2f

python compat stuff

a0ca383

Deal with unset slots

b692c3d

Allow disabling the cluster_dump_directory to only have artifacts for…

1c5325e

… failures

jrbourbeau reviewed Nov 3, 2021

View reviewed changes

fjetter self-assigned this Nov 5, 2021

review comments

0fa1c08

fjetter added 3 commits November 8, 2021 10:37

Add versions

499c3e4

Ensure TaskState is properly included in cythonized scheduler

37bc556

don't use lambda inside of ccall

419949d

fjetter mentioned this pull request Nov 9, 2021

Refactor release key #5507

Merged

fjetter and others added 4 commits November 9, 2021 18:30

make tests more robust

02a5a84

Prepend underscore to to_dict methods

13f694d

Typo

e8db8c2

Merge branch 'main' of https://github.com/dask/distributed into dump_…

1d68b97

…deadlock_reports

fjetter merged commit a4f970d into dask:main Nov 10, 2021

This was referenced Nov 10, 2021

Add cluster dump to issue template #5514

Closed

Automatic cluster state export upon test timeout #5437

Closed

Dump cluster state for debugging #5068

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client method to dump cluster state #5470

Client method to dump cluster state #5470

fjetter commented Oct 27, 2021 •

edited

Loading

jrbourbeau left a comment

jrbourbeau Nov 3, 2021

jrbourbeau commented Nov 3, 2021

fjetter commented Nov 5, 2021 •

edited

Loading

fjetter commented Nov 9, 2021

jrbourbeau commented Nov 10, 2021

Client method to dump cluster state #5470

Client method to dump cluster state #5470

Conversation

fjetter commented Oct 27, 2021 • edited Loading

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Nov 3, 2021

Choose a reason for hiding this comment

jrbourbeau commented Nov 3, 2021

fjetter commented Nov 5, 2021 • edited Loading

fjetter commented Nov 9, 2021

jrbourbeau commented Nov 10, 2021

fjetter commented Oct 27, 2021 •

edited

Loading

fjetter commented Nov 5, 2021 •

edited

Loading