Improve dealing with blocking evaluate requests #157

jackwickham · 2020-02-19T13:12:46Z

Environment data

PTVSD version: 4.3.2
OS and version: MacOS and CentOS
Python version: 3.6 and 3.7
Using VS Code or Visual Studio: VS Code, and interacting with ptvsd directly

Actual behavior

When justMyCode is set to false, operations on a spark dataframe that has been constructed in python using session.createDataFrame will hang if run in the debugger repl, but not if run normally. Spark appears to hang, printing [Stage 0:> (0 + 1) / 1].

Expected behavior

The code should not hang, or it should not matter what the state of justMyCode is.

Steps to reproduce:

With pyspark installed (I have the latest spark 3.0 preview build), I can reproduce it by trying to debug the following script with "justMyCode": false in launch.json.

from pyspark.sql import Row, SparkSession

def my_compute_function(session):
    row1 = Row(id='1', name='name')
    df = session.createDataFrame([row1])
    df.show() # set a breakpoint on this or the next line, and evaluate `df.show()` in the repl when it breaks
    return df


session = SparkSession.builder.master("local").appName("df-show").getOrCreate()
my_compute_function(session)

In ptvsd 4.2.10, it also hangs if you step over the df.show() line, but that was fixed by moving to 4.3.2. I guess that is because of microsoft/ptvsd#1372, although I haven't verified that.

The text was updated successfully, but these errors were encountered:

fabioz · 2020-02-19T15:51:07Z

I'll investigate to see what could be causing that...

fabioz · 2020-02-19T16:56:49Z

I just checked with ptvsd 5 and it seems to work well for me, so, @jackwickham, please use ptvsd 5... to do that, please follow the steps on microsoft/ptvsd#1706 (comment) (please reopen if you still have some error after upgrading).

jackwickham · 2020-02-19T17:23:34Z

I've just tested using the insiders version with the experiments enabled (and verified that it's using new_ptvsd) with debug current file, as well as manually running ptvsd (python -m ptvsd --host localhost --port 12224 --wait --log-dir . [my_file]) and attaching the vs code debugger, and I still seem to be running into the issue. The relevant entries from my launch.json are

{
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": false,
            "redirectOutput": true
        },
        {
            "name": "Python: Remote Attach",
            "type": "python",
            "request": "attach",
            "port": 12224,
            "host": "localhost",
            "pathMappings": [
                {
                    "localRoot": "${workspaceFolder}",
                    "remoteRoot": "."
                }
            ],
            "justMyCode": false
        }

(@fabioz I don't seem to have permission to reopen)

fabioz · 2020-02-19T17:41:01Z

Can you provide the logs for the run?

i.e.:

Open VS Code
Select the command Extensions: Open Extensions Folder
Locate the Python extension directory, typically of the form ms-python.python-2020..***
In that directory ensure you do not have any debug*.log files, if you do, please delete them
Go back into VS Code and modify your launch.json to add the setting "logToFile": true, see below:

"version": "0.2.0",
"configurations": [
    {
        "name": "Python: Current File (Integrated Terminal)",
        "type": "python",
        "request": "launch",
        "program": "${file}",
        "stopOnEntry": true,
        "console": "integratedTerminal",
        "logToFile": true
    },

Start debugging
When done, go back into the extension directory and upload the file debug*.log into this GitHub issue.

jackwickham · 2020-02-19T18:05:35Z

From attaching to an existing ptvsd process:
ptvsd.adapter-85777.log
ptvsd.server-85761.log

From running using current file (I presume it's these ones that you want):
ptvsd.adapter-86448.log
ptvsd.launcher-86459.log
ptvsd.server-86467.log
debugger.vscode_fe239409-31cb-4137-8320-c731a6f94edf.log

fabioz · 2020-02-19T18:20:53Z

Thanks for the logs... I was able to reproduce it with that info (although I still don't know the reason). I'll investigate.

fabioz · 2020-03-07T14:17:23Z

Ok, I did the investigation here... what happens is that pyspark has 2 threads, the main thread, where commands are requested and a secondary thread which is responsible for handling commands and serving as an intermediate with a java process.

So, when you do df.show(), it creates a command and sends it to be processed in the secondary thread which communicates with java.

In the case where justMyCode=true, that secondary thread doesn't really pause because it only has library code, so, things work. Now, in the case where justMyCode=false, the debugger will pause that thread and it won't be able to process requests (which then makes the whole process deadlock because the evaluation will never finish because the thread that'd process it is paused).

Also, because the current policy is to always stop or run all the threads, you can't selectively run just that thread.

The current workarounds are using justMyCode=true, or locally asking the debugger to never trace that thread.

To ask the debugger to never trace that thread, change /site-packages/pyspark/accumulators.py -- in the function _start_update_server, after thread = threading.Thread(target=server.serve_forever), add a line with the contents thread.pydev_do_not_trace = True (any thread with a pydev_do_not_trace attribute is not traced, so, that thread will never be stopped).

For an actual fix, I think we'd need to run/stop threads independently... right now, we stop/resume all threads by design because it's the only option supported by Visual Studio, but I believe Visual Studio Code itself does support running/stopping threads independently (as does the debugger), so, maybe we could create an option for users to choose the policy here (it does need some investigation though and it's probably not a simple fix). @karthiknadig @int19h what do you think?

karthiknadig · 2020-03-07T19:33:57Z

@fabioz is steppingResumesAllThreads not enough here? If not then we should consolidate all other behaviors we want into some simple setting. I don't want to add something like controlThreadsIndependently. Because then we have to support all combinations of steppingResumesAllThreads and controlThreadsIndependently. Instead we could just have behavior based setting. something like threadControl: [all (default) , individual ]. all -> is the current behavior where all threads are paused and resumed. individual -> Each thread can be paused or resumed independently. May be also, user -> All user threads are paused and resumed together. I am not sure about the feasibility of this option. Does this work?

fabioz · 2020-03-08T11:43:51Z

@karthiknadig steppingResumesAllThreads is not enough because in this case the user is not stepping (he's just evaluating some code)... so, in this particular case, for things to work, just the thread with the breakpoint has to be stopped and the other thread must be kept running (maybe another option to have this work instead of having threads run independently would be having some setting as evaluateResumesAllThreads which would resume all other threads when the evaluation starts and then would pause them when evaluation finishes -- I think that having threads controlled independently would be a better solution, the downside being that it wouldn't work for Visual Studio, just for Visual Studio Code).

I don't think we can differentiate well the threads for user/system (because we can't tell whether a particular thread will touch user code or not given its current state).

As for the setting, I didn't really understand the difference from controlThreadsIndependently to threadControl:all|individual -- wouldn't we need to support all of the combinations of threadControl and steppingResumesAllThreads anyways? (i.e.: we still need to support stopping all threads but having step resume all threads -- although I agree that threadControl may be nicer as we can extend it in the future as it's not just a boolean switch).

Note that supporting that combination may not be all that troublesome as the code is independent from one another (although it probably doesn't make much sense to control threads independently and have step resume all threads).

karthiknadig · 2020-03-08T22:54:03Z

@fabioz I actually meant completely replace steppingResumesAllThreads, with threadControl or controlThreadsIndependently which ever makes more sense. I want to reduce the number of settings.

@int19h Let me know if you have any ideas.

fabioz · 2020-03-12T12:12:26Z

Well, the first thing I think is defining whether we should support Visual Studio or if it's ok to support just for Visual Studio Code for this use case -- if Visual Studio needs to be supported, then we need the feature that evaluateResumesAllThreads (and I don't think we can really remove the feature for steppingResumesAllThreads, so, I find it hard to unify under a single setting).

@int19h @karthiknadig waiting on feedback here.

int19h · 2020-04-07T18:26:43Z

As I recall, we originally decided to use the all-threads-paused model for VSC not just because it's what VS was doing, but also because it was felt to be more intuitive for the users. I think we should make a similar decision on behavior here first. There are obvious downsides to not letting threads run, as this bug shows; but there are also downsides to letting them run, especially if this means hitting breakpoints etc.

For what it's worth, in VS, other threads do not run, but it also has a timeout, which brings you to this help page:
https://docs.microsoft.com/en-us/visualstudio/debugger/error-evaluating-the-function-function-timed-out-and-needed-to-be-aborted-in-an-unsafe-way

Also, apparently the docs might not even be up to date, because trying this on VS 2019, I get this error once it times out:

So it looks like it's setup up to not resume threads by default, but if evaluation times out, then it assumes there's a deadlock, and resumes them. Perhaps we can do something similar?

fabioz · 2020-06-24T12:24:30Z

I'll start to work on this issue.

My current idea for the fix will be doing the following:

Detect it's taking to long to do some evaluation.
If it times out:
- notify the user
- make all other threads run
  - when the evaluation finishes, stop all other threads again
  - if the evaluation still doesn't finish after some timeout, notify the client (I'll also experiment with https://docs.python.org/3/c-api/init.html?highlight=pythreadstate_setasyncexc#c.PyThreadState_SetAsyncExc to see if it's possible to cancel the execution so that the program is not stuck forever).

As a note, I think that for the first one we can have something like a 10s. timeout (which the user could configure to be lower/higher/disabled) and the second should not be active by default since it's possible that the evaluation is really expected to be a long running operation (right now I'll only enable those to be set only through environment variables -- we can make a setting for that later if needed).

After that's done for the evaluation we should probably do something similar for the use case where just getting a representation of some object takes a really long time (which is probably the culprit in #274).

@int19h @karthiknadig I'll start to work on that approach, but I'm open to feedback if you think something should be different here.

int19h · 2020-06-24T15:15:37Z

Sounds good!

For the notification, we can always add a custom message and wire it up on VSCode side - indeed, we already discussed such a thing for #170. Then it could be an async message that does not block or otherwise interrupt the evaluation - merely notifies the user that it's still ongoing.

Also, DAP has added "progress" events not long ago - take a look at ProgressStartEvent in the spec. VSCode has some built-in UX for this, but it's more low-key, like a spinning circle. If you use that, I think we can reduce the timeout, so it kicks in after, say, 1s. If async exception injection works out, we can wire that one to CancelRequest. This would also fix #173.

int19h · 2020-06-24T15:17:00Z

To clarify: 1s is probably too low to try resuming other threads - 10s sounds more reasonable (we can experiment and figure out what works best). But 1s would be a better threshold for a low-key notification that evaluation is taking a long time.

fabioz · 2020-07-09T18:58:39Z

As a note, I'm still working on this issue.

I've just finished the part related to unblocking threads, which required dealing with asynchronous messages (both on pydevd as on debugpy), but there are things still missing:

provide a message saying that evaluation is taking a long time and how to mitigate that
provide thread interruption
more work on corner case where a breakpoint is hit in the secondary thread during the evaluation (unblocking the evaluate request from debugpy was the first step, but it still seems to deadlock going forward, so, this needs a bit more investigation).

fabioz self-assigned this Feb 19, 2020

fabioz closed this as completed Feb 19, 2020

fabioz reopened this Feb 19, 2020

int19h transferred this issue from microsoft/ptvsd May 4, 2020

int19h mentioned this issue Jun 16, 2020

Debugger gets stuck with python code #274

Closed

int19h added the bug Something isn't working label Jun 19, 2020

fabioz mentioned this issue Jul 5, 2020

"evaluate" should allow nested messages #326

Closed

fabioz changed the title ~~Evaluating spark code with justMyCode: false causes it to hang indefinitely~~ Provide a way to unblock secondary threads after a given timeout Jul 9, 2020

fabioz changed the title ~~Provide a way to unblock secondary threads after a given timeout~~ Provide a way to unblock secondary threads after a given timeout during evaluate request Jul 9, 2020

fabioz changed the title ~~Provide a way to unblock secondary threads after a given timeout during evaluate request~~ Improve dealing with blocking evaluate requests Jul 16, 2020

fabioz added a commit to fabioz/debugpy that referenced this issue Jul 16, 2020

Improve dealing with blocking evaluate requests. Fixes microsoft#157

a89daf8

fabioz added a commit to fabioz/debugpy that referenced this issue Jul 17, 2020

Improve dealing with blocking evaluate requests. Fixes microsoft#157

a403e5a

fabioz added a commit to fabioz/debugpy that referenced this issue Jul 17, 2020

Improve dealing with blocking evaluate requests. Fixes microsoft#157

9dfe832

fabioz closed this as completed in 1bef8e5 Jul 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dealing with blocking evaluate requests #157

Improve dealing with blocking evaluate requests #157

jackwickham commented Feb 19, 2020

fabioz commented Feb 19, 2020

fabioz commented Feb 19, 2020

jackwickham commented Feb 19, 2020 •

edited

Loading

fabioz commented Feb 19, 2020

jackwickham commented Feb 19, 2020

fabioz commented Feb 19, 2020

fabioz commented Mar 7, 2020

karthiknadig commented Mar 7, 2020

fabioz commented Mar 8, 2020

karthiknadig commented Mar 8, 2020

fabioz commented Mar 12, 2020 •

edited

Loading

int19h commented Apr 7, 2020

fabioz commented Jun 24, 2020

int19h commented Jun 24, 2020

int19h commented Jun 24, 2020

fabioz commented Jul 9, 2020

Improve dealing with blocking evaluate requests #157

Improve dealing with blocking evaluate requests #157

Comments

jackwickham commented Feb 19, 2020

Environment data

Actual behavior

Expected behavior

Steps to reproduce:

fabioz commented Feb 19, 2020

fabioz commented Feb 19, 2020

jackwickham commented Feb 19, 2020 • edited Loading

fabioz commented Feb 19, 2020

jackwickham commented Feb 19, 2020

fabioz commented Feb 19, 2020

fabioz commented Mar 7, 2020

karthiknadig commented Mar 7, 2020

fabioz commented Mar 8, 2020

karthiknadig commented Mar 8, 2020

fabioz commented Mar 12, 2020 • edited Loading

int19h commented Apr 7, 2020

fabioz commented Jun 24, 2020

int19h commented Jun 24, 2020

int19h commented Jun 24, 2020

fabioz commented Jul 9, 2020

jackwickham commented Feb 19, 2020 •

edited

Loading

fabioz commented Mar 12, 2020 •

edited

Loading