Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore whitespace text nodes when parsing projects #1187

Merged
merged 2 commits into from
Oct 12, 2016
Merged

Conversation

jeffkl
Copy link
Contributor

@jeffkl jeffkl commented Oct 12, 2016

This just ignores purely whitespace text nodes. The XML reader only looks at the first 4k of text to determine if it should be ignored. Our tests are generating 70k of whitespace so the XML reader gives up after 4k and assumes its a text node.

Closes #270

Fixes test failures in #1004 when this change is ported to master.

This just ignores purely whitespace text nodes which seem to be part of the parsing in .NET Core and newer versions of System.Xml

Closes dotnet#270, fixes test failures in dotnet#1004
@jeffkl
Copy link
Contributor Author

jeffkl commented Oct 12, 2016

@rainersigwald figured out the problem. The XML parser only looks at the first 4k of text to see if it's whitespace or not. Since our whitespace is 70k, it gives up and assumes it's a text node instead of ignoring it. Our call to IsNullOrWhitespace() could be a little slow but the case is going to be very rare.
https://github.com/dotnet/corefx/blob/bffef76f6af208e2042a2f27bc081ee908bb390b/src/System.Private.Xml/src/System/Xml/Core/XmlTextReaderImpl.cs#L5571

Copy link
Member

@rainersigwald rainersigwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits. Awesome to finally have this root-caused and fixed!

@@ -48,6 +48,13 @@ private static List<XmlElementWithLocation> GetChildElements(XmlElementWithLocat
VerifyThrowProjectValidNamespace(childElement);
children.Add(childElement);
break;
case XmlNodeType.Text:
// Whitespace is read as a #text element so if the node is purely whitespace, just ignore it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that we're getting a whitespace-only text node when it should come back as a Whitespace node. But the reason is that there's a limit in XmlTextReaderImpl's determine-node-type code: if it's greater than 4kB of whitespace, it'll return as Text.

But given that it works this way, this check sounds good.

if (!String.IsNullOrWhiteSpace(child.InnerText) && throwForInvalidNodeTypes)
{
ThrowProjectInvalidChildElement(child.Name, element.Name, element.Location);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer

if (string.IsNullOrWhitespace(child.InnerText)) {
    // long whitespace comes back as text, see [link]
    break;
}
goto default;

and rely on the default case, instead of having to check whether or not to throw and throwing in this case.
with a fall

@@ -48,8 +48,15 @@ private static List<XmlElementWithLocation> GetChildElements(XmlElementWithLocat
VerifyThrowProjectValidNamespace(childElement);
children.Add(childElement);
break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHH more whitespace, i'll remove this. #whitespacematters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants