Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve dealing with blocking evaluate requests #157

Closed
jackwickham opened this issue Feb 19, 2020 · 16 comments
Closed

Improve dealing with blocking evaluate requests #157

jackwickham opened this issue Feb 19, 2020 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@jackwickham
Copy link

Environment data

  • PTVSD version: 4.3.2
  • OS and version: MacOS and CentOS
  • Python version: 3.6 and 3.7
  • Using VS Code or Visual Studio: VS Code, and interacting with ptvsd directly

Actual behavior

When justMyCode is set to false, operations on a spark dataframe that has been constructed in python using session.createDataFrame will hang if run in the debugger repl, but not if run normally. Spark appears to hang, printing [Stage 0:> (0 + 1) / 1].

Expected behavior

The code should not hang, or it should not matter what the state of justMyCode is.

Steps to reproduce:

With pyspark installed (I have the latest spark 3.0 preview build), I can reproduce it by trying to debug the following script with "justMyCode": false in launch.json.

from pyspark.sql import Row, SparkSession

def my_compute_function(session):
    row1 = Row(id='1', name='name')
    df = session.createDataFrame([row1])
    df.show() # set a breakpoint on this or the next line, and evaluate `df.show()` in the repl when it breaks
    return df


session = SparkSession.builder.master("local").appName("df-show").getOrCreate()
my_compute_function(session)

In ptvsd 4.2.10, it also hangs if you step over the df.show() line, but that was fixed by moving to 4.3.2. I guess that is because of microsoft/ptvsd#1372, although I haven't verified that.

@fabioz
Copy link
Collaborator

fabioz commented Feb 19, 2020

I'll investigate to see what could be causing that...

@fabioz fabioz self-assigned this Feb 19, 2020
@fabioz
Copy link
Collaborator

fabioz commented Feb 19, 2020

I just checked with ptvsd 5 and it seems to work well for me, so, @jackwickham, please use ptvsd 5... to do that, please follow the steps on microsoft/ptvsd#1706 (comment) (please reopen if you still have some error after upgrading).

@fabioz fabioz closed this as completed Feb 19, 2020
@jackwickham
Copy link
Author

jackwickham commented Feb 19, 2020

I've just tested using the insiders version with the experiments enabled (and verified that it's using new_ptvsd) with debug current file, as well as manually running ptvsd (python -m ptvsd --host localhost --port 12224 --wait --log-dir . [my_file]) and attaching the vs code debugger, and I still seem to be running into the issue. The relevant entries from my launch.json are

{
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "justMyCode": false,
            "redirectOutput": true
        },
        {
            "name": "Python: Remote Attach",
            "type": "python",
            "request": "attach",
            "port": 12224,
            "host": "localhost",
            "pathMappings": [
                {
                    "localRoot": "${workspaceFolder}",
                    "remoteRoot": "."
                }
            ],
            "justMyCode": false
        }

(@fabioz I don't seem to have permission to reopen)

@fabioz
Copy link
Collaborator

fabioz commented Feb 19, 2020

Can you provide the logs for the run?

i.e.:

  • Open VS Code
  • Select the command Extensions: Open Extensions Folder
  • Locate the Python extension directory, typically of the form ms-python.python-2020..***
  • In that directory ensure you do not have any debug*.log files, if you do, please delete them
  • Go back into VS Code and modify your launch.json to add the setting "logToFile": true, see below:
"version": "0.2.0",
"configurations": [
    {
        "name": "Python: Current File (Integrated Terminal)",
        "type": "python",
        "request": "launch",
        "program": "${file}",
        "stopOnEntry": true,
        "console": "integratedTerminal",
        "logToFile": true
    },
  • Start debugging
  • When done, go back into the extension directory and upload the file debug*.log into this GitHub issue.

@fabioz fabioz reopened this Feb 19, 2020
@jackwickham
Copy link
Author

From attaching to an existing ptvsd process:
ptvsd.adapter-85777.log
ptvsd.server-85761.log

From running using current file (I presume it's these ones that you want):
ptvsd.adapter-86448.log
ptvsd.launcher-86459.log
ptvsd.server-86467.log
debugger.vscode_fe239409-31cb-4137-8320-c731a6f94edf.log

@fabioz
Copy link
Collaborator

fabioz commented Feb 19, 2020

Thanks for the logs... I was able to reproduce it with that info (although I still don't know the reason). I'll investigate.

@fabioz
Copy link
Collaborator

fabioz commented Mar 7, 2020

Ok, I did the investigation here... what happens is that pyspark has 2 threads, the main thread, where commands are requested and a secondary thread which is responsible for handling commands and serving as an intermediate with a java process.

So, when you do df.show(), it creates a command and sends it to be processed in the secondary thread which communicates with java.

In the case where justMyCode=true, that secondary thread doesn't really pause because it only has library code, so, things work. Now, in the case where justMyCode=false, the debugger will pause that thread and it won't be able to process requests (which then makes the whole process deadlock because the evaluation will never finish because the thread that'd process it is paused).

Also, because the current policy is to always stop or run all the threads, you can't selectively run just that thread.

The current workarounds are using justMyCode=true, or locally asking the debugger to never trace that thread.

To ask the debugger to never trace that thread, change /site-packages/pyspark/accumulators.py -- in the function _start_update_server, after thread = threading.Thread(target=server.serve_forever), add a line with the contents thread.pydev_do_not_trace = True (any thread with a pydev_do_not_trace attribute is not traced, so, that thread will never be stopped).

For an actual fix, I think we'd need to run/stop threads independently... right now, we stop/resume all threads by design because it's the only option supported by Visual Studio, but I believe Visual Studio Code itself does support running/stopping threads independently (as does the debugger), so, maybe we could create an option for users to choose the policy here (it does need some investigation though and it's probably not a simple fix). @karthiknadig @int19h what do you think?

@karthiknadig
Copy link
Member

@fabioz is steppingResumesAllThreads not enough here? If not then we should consolidate all other behaviors we want into some simple setting. I don't want to add something like controlThreadsIndependently. Because then we have to support all combinations of steppingResumesAllThreads and controlThreadsIndependently. Instead we could just have behavior based setting. something like threadControl: [all (default) , individual ]. all -> is the current behavior where all threads are paused and resumed. individual -> Each thread can be paused or resumed independently. May be also, user -> All user threads are paused and resumed together. I am not sure about the feasibility of this option. Does this work?

@fabioz
Copy link
Collaborator

fabioz commented Mar 8, 2020

@karthiknadig steppingResumesAllThreads is not enough because in this case the user is not stepping (he's just evaluating some code)... so, in this particular case, for things to work, just the thread with the breakpoint has to be stopped and the other thread must be kept running (maybe another option to have this work instead of having threads run independently would be having some setting as evaluateResumesAllThreads which would resume all other threads when the evaluation starts and then would pause them when evaluation finishes -- I think that having threads controlled independently would be a better solution, the downside being that it wouldn't work for Visual Studio, just for Visual Studio Code).

I don't think we can differentiate well the threads for user/system (because we can't tell whether a particular thread will touch user code or not given its current state).

As for the setting, I didn't really understand the difference from controlThreadsIndependently to threadControl:all|individual -- wouldn't we need to support all of the combinations of threadControl and steppingResumesAllThreads anyways? (i.e.: we still need to support stopping all threads but having step resume all threads -- although I agree that threadControl may be nicer as we can extend it in the future as it's not just a boolean switch).

Note that supporting that combination may not be all that troublesome as the code is independent from one another (although it probably doesn't make much sense to control threads independently and have step resume all threads).

@karthiknadig
Copy link
Member

@fabioz I actually meant completely replace steppingResumesAllThreads, with threadControl or controlThreadsIndependently which ever makes more sense. I want to reduce the number of settings.

@int19h Let me know if you have any ideas.

@fabioz
Copy link
Collaborator

fabioz commented Mar 12, 2020

Well, the first thing I think is defining whether we should support Visual Studio or if it's ok to support just for Visual Studio Code for this use case -- if Visual Studio needs to be supported, then we need the feature that evaluateResumesAllThreads (and I don't think we can really remove the feature for steppingResumesAllThreads, so, I find it hard to unify under a single setting).

@int19h @karthiknadig waiting on feedback here.

@int19h
Copy link
Contributor

int19h commented Apr 7, 2020

As I recall, we originally decided to use the all-threads-paused model for VSC not just because it's what VS was doing, but also because it was felt to be more intuitive for the users. I think we should make a similar decision on behavior here first. There are obvious downsides to not letting threads run, as this bug shows; but there are also downsides to letting them run, especially if this means hitting breakpoints etc.

For what it's worth, in VS, other threads do not run, but it also has a timeout, which brings you to this help page:
https://docs.microsoft.com/en-us/visualstudio/debugger/error-evaluating-the-function-function-timed-out-and-needed-to-be-aborted-in-an-unsafe-way

Also, apparently the docs might not even be up to date, because trying this on VS 2019, I get this error once it times out:
image

So it looks like it's setup up to not resume threads by default, but if evaluation times out, then it assumes there's a deadlock, and resumes them. Perhaps we can do something similar?

@int19h int19h transferred this issue from microsoft/ptvsd May 4, 2020
@int19h int19h added the bug Something isn't working label Jun 19, 2020
@fabioz
Copy link
Collaborator

fabioz commented Jun 24, 2020

I'll start to work on this issue.

My current idea for the fix will be doing the following:

As a note, I think that for the first one we can have something like a 10s. timeout (which the user could configure to be lower/higher/disabled) and the second should not be active by default since it's possible that the evaluation is really expected to be a long running operation (right now I'll only enable those to be set only through environment variables -- we can make a setting for that later if needed).

After that's done for the evaluation we should probably do something similar for the use case where just getting a representation of some object takes a really long time (which is probably the culprit in #274).

@int19h @karthiknadig I'll start to work on that approach, but I'm open to feedback if you think something should be different here.

@int19h
Copy link
Contributor

int19h commented Jun 24, 2020

Sounds good!

For the notification, we can always add a custom message and wire it up on VSCode side - indeed, we already discussed such a thing for #170. Then it could be an async message that does not block or otherwise interrupt the evaluation - merely notifies the user that it's still ongoing.

Also, DAP has added "progress" events not long ago - take a look at ProgressStartEvent in the spec. VSCode has some built-in UX for this, but it's more low-key, like a spinning circle. If you use that, I think we can reduce the timeout, so it kicks in after, say, 1s. If async exception injection works out, we can wire that one to CancelRequest. This would also fix #173.

@int19h
Copy link
Contributor

int19h commented Jun 24, 2020

To clarify: 1s is probably too low to try resuming other threads - 10s sounds more reasonable (we can experiment and figure out what works best). But 1s would be a better threshold for a low-key notification that evaluation is taking a long time.

@fabioz fabioz changed the title Evaluating spark code with justMyCode: false causes it to hang indefinitely Provide a way to unblock secondary threads after a given timeout Jul 9, 2020
@fabioz fabioz changed the title Provide a way to unblock secondary threads after a given timeout Provide a way to unblock secondary threads after a given timeout during evaluate request Jul 9, 2020
@fabioz
Copy link
Collaborator

fabioz commented Jul 9, 2020

As a note, I'm still working on this issue.

I've just finished the part related to unblocking threads, which required dealing with asynchronous messages (both on pydevd as on debugpy), but there are things still missing:

  • provide a message saying that evaluation is taking a long time and how to mitigate that
  • provide thread interruption
  • more work on corner case where a breakpoint is hit in the secondary thread during the evaluation (unblocking the evaluate request from debugpy was the first step, but it still seems to deadlock going forward, so, this needs a bit more investigation).

@fabioz fabioz changed the title Provide a way to unblock secondary threads after a given timeout during evaluate request Improve dealing with blocking evaluate requests Jul 16, 2020
fabioz added a commit to fabioz/debugpy that referenced this issue Jul 16, 2020
fabioz added a commit to fabioz/debugpy that referenced this issue Jul 17, 2020
fabioz added a commit to fabioz/debugpy that referenced this issue Jul 17, 2020
@fabioz fabioz closed this as completed in 1bef8e5 Jul 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants