Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: System.Diagnostics.Tests.ProcessTests/TestCheckChildProcessUserAndGroupIds #28922

Closed
ghost opened this issue Mar 11, 2019 · 21 comments · Fixed by #46302
Closed
Assignees
Labels
area-System.Diagnostics.Process disabled-test The test is disabled in source code against the issue test-bug Problem in test source code (most likely)
Milestone

Comments

@ghost
Copy link

ghost commented Mar 11, 2019

Opened on behalf of @wfurt

The test System.Diagnostics.Tests.ProcessTests/TestCheckChildProcessUserAndGroupIds has failed.

Failure Message:

System.Diagnostics.RemoteExecutorTestBase+RemoteInvokeHandle+RemoteExecutionException : Remote process failed with an unhandled exception.

Stack Trace:


Child exception:
  Xunit.Sdk.XunitException: Expected: 4, 20, 24, 25, 27, 29, 30, 44, 46, 109, 110, 998, 1000
Actual: 4, 20, 24, 25, 27, 29, 30, 44, 46, 109, 110, 1000
   at System.AssertExtensions.Equal[T](HashSet`1 expected, HashSet`1 actual) in /__w/1/s/src/CoreFx.Private.TestUtilities/src/System/AssertExtensions.cs:line 349
   at System.Diagnostics.Tests.ProcessTests.CheckUserAndGroupIds(String userId, String groupId, String groupIdsJoined, String checkGroupsExact) in /__w/1/s/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs:line 515

Child process:
  System.Diagnostics.Process.Tests, Version=4.2.1.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51 System.Diagnostics.Tests.ProcessTests Int32 CheckUserAndGroupIds(System.String, System.String, System.String, System.String)

Child arguments:
  1000, 1000, 4,20,24,25,27,29,30,44,46,109,110,998,1000, True

Build : 3.0 - 20190310.5 (Core Tests)
Failing configurations:

  • Ubuntu.1604.Amd64-x64
    • Release
@wfurt
Copy link
Member

wfurt commented Mar 11, 2019

cc: @tmds It seems like 998 is missing for some reason. That looks different from the duplicates.

@stephentoub
Copy link
Member

That looks different from the duplicates.

I believe this is the first time we've seen it fail again since I added the extra diagnostics to output the full contents of both sets.

@stephentoub
Copy link
Member

cc: @danmosemsft

@danmoseley
Copy link
Member

In a build the previous day, identical result:

https://mc.dot.net/#/product/netcore/30/source/official~2Fdotnet~2Fcorefx~2Frefs~2Fheads~2Fmaster/type/test~2Ffunctional~2Fcli~2F/build/20190309.4/workItem/System.Diagnostics.Process.Tests/analysis/xunit/System.Diagnostics.Tests.ProcessTests~2FTestCheckChildProcessUserAndGroupIds

Xunit.Sdk.XunitException: Expected: 4, 20, 24, 25, 27, 29, 30, 44, 46, 109, 110, 998, 1000
                                              Actual: 4, 20, 24, 25, 27, 29, 30, 44, 46, 109, 110, 1000

So what is the mysterious 998

@wfurt
Copy link
Member

wfurt commented Mar 11, 2019

I can get repro machine tomorrow and can check if I can make it fail again.

@tmds
Copy link
Member

tmds commented Mar 11, 2019

Thanks @wfurt .

The list that contains 998 is determined by calling id -G <username>.
The child process will launch successfully when the actual groups is a subset of the requested groups. The test is more strict and requires an exact match (checkGroupsExact is true on Linux):

https://github.com/dotnet/corefx/blob/fd0911e27f6c6f958f213bac8b3fb99b7700167b/src/System.Diagnostics.Process/tests/ProcessTests.Unix.cs#L513-L520

@wfurt
Copy link
Member

wfurt commented Mar 11, 2019

the mysterious 998 is 'docker'. So far I was not able to reproduce the failure using repro machine. At last, the groups and OS should be identical to official machines. The test output is sorted but in 'id -G' output the missing group is last.

helixbot@a001DOJ:/mnt/workspace/Work/d5bc6e62-4804-4bcd-a225-9386663e3c88/Exec$ id -G
1000 4 20 24 25 27 29 30 44 46 109 110 998
helixbot@a001DOJ:/mnt/workspace/Work/d5bc6e62-4804-4bcd-a225-9386663e3c88/Exec$ grep 998 /etc/group
docker:x:998:helixbot

@tmds
Copy link
Member

tmds commented Mar 11, 2019

I'm also part of a docker group (981 on my machine).
What does getgroups return when you strace the groups command?

$ id -G tmds
1000 10 981 1001
$ strace -e getgroups groups 
getgroups(0, NULL)                      = 4
getgroups(4, [10, 981, 1000, 1001])     = 4

@wfurt
Copy link
Member

wfurt commented Mar 11, 2019

I did 100+ rounds of the test and I did not see any failure so far.

getgroups(0, NULL)                      = 13
getgroups(13, [4, 20, 24, 25, 27, 29, 30, 44, 46, 109, 110, 998, 1000]) = 13

@tmds
Copy link
Member

tmds commented Mar 11, 2019

One thing that may be causing this is the helixbot user was added to the docker group, but the process that is launching the tests isn't aware of this. This may happen if that process was already running before the helixbot user was added to the docker group. Is this something that could occur in the CI environment?

@danmoseley
Copy link
Member

@MattGal do you know the answer to @tmds question here?

@MattGal
Copy link
Member

MattGal commented Mar 11, 2019

I don't think that'd be the case, but I will investigate after 4 PM or so PST.

@wfurt
Copy link
Member

wfurt commented Mar 12, 2019

I did ~ 2000 runs of whole test set on repro machine and I did not get single failure.
I don't know if helixbot user is part of base image @MattGal but if not that would be place to look.
To generalize on @tmds comment, adding user to group will not be visible unless you login again. (or re-parse /etc/groups

furt@net-dale:/tmp$ ./test
gr = 6
furt@net-dale:/tmp$ sudo usermod -a -G foo furt
furt@net-dale:/tmp$ ./test
gr = 6

<login/logout>

furt@net-dale:/tmp$ ./test
gr = 7

test simply primps number of groups user is in printf("gr = %d\n",getgroups(0, NULL));

@MattGal
Copy link
Member

MattGal commented Mar 12, 2019

I'm trying to chat with @wfurt offline but for the queue mentioned by @stephentoub while docker is on the machine in question, I see no evidence that the work item was attempted to be run inside a docker container. Machines in Ubuntu.1604.Amd64.Open are just plain Azure D2_V3 vms. Still looking...

@MattGal
Copy link
Member

MattGal commented Mar 12, 2019

We think this should be an easy reordering fix in machine provisioning (though it does seem odd that this test is that sensitive to slight machine changes), I filed https://github.com/dotnet/core-eng/issues/5470 to track tweaking the way we add the user to that group.

@tmds
Copy link
Member

tmds commented Mar 12, 2019

though it does seem odd that this test is that sensitive to slight machine changes

The test verifies whether the process is running with expected groups. It detects a mismatch between the groups the user (process) has, and the groups that the user is configured to be in (/etc/group).
The corefx implementation doesn't fail as long as the groups the user has is a subset of the configured set (which is the case here). The test is more strict and expects an exact match.

I filed dotnet/core-eng#5470 to track tweaking the way we add the user to that group.

Great, thank you. It is preferable we can keep the strict check in the test.

@ViktorHofer
Copy link
Member

Marked it as 3.0. Test still failing?

@wfurt
Copy link
Member

wfurt commented Apr 3, 2019

I no longer see failures. This was fixed by infrastructure changes.

@wfurt wfurt closed this as completed Apr 3, 2019
@wtgodbe wtgodbe reopened this Jun 25, 2019
@wtgodbe
Copy link
Member

wtgodbe commented Jun 25, 2019

The test is actually disabled: dotnet/corefx#35949. May be the same issue as https://github.com/dotnet/corefx/issues/38833

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Jul 6, 2020
@adamsitnik adamsitnik added test-bug Problem in test source code (most likely) disabled-test The test is disabled in source code against the issue labels Dec 16, 2020
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 16, 2020
@adamsitnik adamsitnik self-assigned this Dec 16, 2020
@adamsitnik adamsitnik modified the milestones: Future, 6.0.0 Dec 16, 2020
@adamsitnik
Copy link
Member

I was able to reproduce the failure using our CI in #46138

I've verified that it's caused by the difference in what id -G (what we use for obtaining the expected value) and getgroups (what we use for getting the actual value) return even without starting any new processes:

eba057f

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-46138-merge-21aa167a19b94858a8/System.Diagnostics.Process.Tests/console.aa0d1c27.log?sv=2019-07-07&se=2021-01-07T10%3A48%3A30Z&sr=c&sp=rl&sig=yOZ94Wil3yQaONO%2FVo4b293jCZUZ4jwNFtdVd7K33pY%3D

    System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIds [FAIL]
      id -G returned: 4
      getgroups returned: 
      Stack Trace:
        /_/src/libraries/System.Diagnostics.Process/tests/ProcessTests.Unix.cs(557,0): at System.Diagnostics.Tests.ProcessTests.TestCheckChildProcessUserAndGroupIds()

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 24, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Diagnostics.Process disabled-test The test is disabled in source code against the issue test-bug Problem in test source code (most likely)
Projects
None yet
10 participants