Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Public error reporting API #39

Closed
craigbarnes opened this issue Aug 22, 2013 · 12 comments
Closed

Public error reporting API #39

craigbarnes opened this issue Aug 22, 2013 · 12 comments

Comments

@craigbarnes
Copy link
Contributor

Is there anything a contributor could do to help make the error reporting API public? Any tips on issues I might come accross if I just go ahead and try? I'd like to use Gumbo as the basis for a lint tool, so it seems helping with this would be a good start.

@gsnedders
Copy link
Contributor

Parse errors, I presume you mean? Anything else? OOM?

@nostrademons
Copy link
Contributor

The most important thing you could do is to come up with a detailed list of your requirements. The reason the error reporting API didn't make it into the initial release is because I don't have enough data points to triangulate an API that's reasonably simple and yet convenient for all the folks who might use it.

What I'd do is to build the API that would be most convenient for your tool, as a change to gumbo.h and possibly to the Python bindings, and send me a pull request for it. I'll likely sit on that pull request for a bit while I figure out how it dovetails with a couple groups within Google that have expressed interest in having error information available, and also how it might integrate with html5lib (ideally I'd like to be able to mimic the html5lib error API from the information Gumbo exposes: it doesn't have to be a 1:1 correspondence, but it should be possible to reconstruct the errors that html5lib reports with a reasonable amount of adapter code). Meanwhile, you'll be able to make progress on your lint tool, using your own fork. In my experience, the API you need tends to change a lot as you explore the problem space, and so it's good not to nail it down formally until you've written a bunch of real client code. But having the current requirements captured as code lets me take a look at the usage, perhaps suggest ways to make it more general, and figure out how to implement it. Then once we have an API that works reasonably well, I/we can start working on changing the internal calls to add_parse_error to record everything needed to support that API.

Responding to Geoffrey as well: I'd like to support OOM errors as well, but I'm unsure what the ideal API and contract with the client should be. Should it be a flag in GumboOutput? Part of the return value of gumbo_parse? Both of these are slightly painful since there are already 7 third-party bindings out there whose APIs might need to be updated.

Or should it be a special type of GumboError that'll appear in GumboOutput? OOM errors are a bit tricky to deal with because at the point you get them, you have no more memory available, and so you have no way to allocate an error object or do any significant error recovery. But that can usually be gotten around by pre-allocating the error object when parsing starts and setting a flag or longjmping back to the entry point. Should Gumbo free the partial parse tree when an OOM occurs, or should it simply report the error and leave the partial parse tree in the output structure? The latter will often mean that client code will OOM immediately if it's not built to handle OOM errors. The former complicates memory management, as sometimes you'll need to gumbo_destroy_output while other times it'll already be destroyed.

@galdor
Copy link

galdor commented Aug 23, 2013

I would suggest the following:

  • When parsing fails or an allocation error occurs, report an error code, either GUMBO_ERR_PARSING or GUMBO_ERR_ALLOCATION. It could be a field in GumboOutput.
  • Fill a buffer in GumboOutput with a description on the error (either "cannot parse document" or "cannot allocate memory: %m".
  • If this is a parsing error, the error list contains all parsing errors with for each one, the start and end position in the input buffer, and a string describing the error.
  • If this is an allocation error, there is no point is keeping a half-parsed document. It cannot be used on its own, and any processing will probably need to allocate memory. Just free any memory allocated.

The following code is then possible:

GumboParser *parser;

parser = gumbo_parse(input);
if (parser->error_code != GUMBO_ERR_OK) {
    fprintf(stderr, "Cannot parse input: %s\n", parser->error_desc);

    if (parser->err_code == GUMBO_ERR_PARSING) {
        for (int i = 0; i < parser->parse_errors.length; i++) {
            GumboParseError *err = parser->parse_errors.data[i];
            fprintf(stderr, "%d:%d: %s\n", err->line, err->column, err->desc);
        }
    }
}

@craigbarnes
Copy link
Contributor Author

The above seems more or less like what I was thinking too. Making GumboError available in gumbo.h seems like the simplest API, unless I'm missing something.

@craigbarnes
Copy link
Contributor Author

Hmm, I should have looked a little closer at the code before I spoke. I guess separating the error representation from all the internal structures will require another layer. I see the point about defining an API now.

@gsnedders
Copy link
Contributor

Also worthwhile to note that the parse error expectations in html5lib-tests are likely wrong in the python-0.95 tag (nobody had used them for ages!). @nolanw updated them in html5lib/html5lib-tests@acaf74d which is the current tip of the master branch — nobody apart from him has checked they're valid, though, so I won't guarantee they're all right.

@nostrademons
Copy link
Contributor

Yeah, my main concern about any error API is to not encode structure into string error messages, because inevitably someone will come to depend upon that functionality, and then we can never change the error messages. I'd much rather expose a bunch of structs and enums where all useful information can be extracted programmatically, and then provide a layer to convert that to a string representation (like gumbo_error_to_string and gumbo_caret_diagnostic_to_string in the not-quite-public error.h). That makes it reasonably easy to just print out the errors if that's what your tool needs to do, but it also makes it possible to do other things, like:

  1. Count the number of errors of a certain type, if eg. you need to prepare a report to management to justify a cleanup effort.
  2. Analyze which of a set of pages contain errors of a certain type, if eg. you want to improve development practices across an organization.
  3. Present errors as part of a GUI or non-console app.
  4. Automatically suggest fixes for errors. You could build a tool that searches for errors and if there's a single unambiguous fix, it just does it. If there're multiple possible interpretations, it prompts the user for their intent and then applies the fix.

The key to those use-cases is that I'd like the errors to have some sort of semantic meaning in developer's heads, much more like "You put non-table content in a table" or "You closed a body tag early" rather than "In state in_table_row of the parser, encountered a

tag." The latter is unlikely to be helpful to a developer, while something like "Encountered a starting
tag in a table. The only legal tags here are , , , , , etc." could be quite helpful.

Or should I just expose the insertion mode and current input and let translating that into useful error messages be a value-add for a third-party library?

As for OOM errors, my current thinking is to have it be a normal GumboError object in the GumboOutput.errors vector with type GUMBO_ERR_ALLOCATION, pre-allocated and inserted at the end of the errors vector if an OOM occurs. The parse tree won't be freed, it'll just be returned in whatever state it was in when the OOM occurred. That requires some complexity to pre-allocate and hold it, and it also means that the errors vector will always need to reserve one spot more than its current size (since it can't be re-allocated if an OOM occurs). But it keeps the current API intact (so nobody who's done bindings for another language will need to re-do them), it keeps memory ownership consistent between the "OOM occurred" and "OOM didn't occur" cases, it lets clients do a best-effort analysis of the parse-tree returned even if an OOM occurred (sometimes that's all that's necessary, if eg. you're counting nodes or features of the document), and it provides a consistent test for checking if an OOM occurred. (output->errors->length > 0 && ((GumboError*) output->errors->data[output-errors->length]).type == GUMBO_ERR_ALLOCATION).

Sound good to everyone? I'd like to get OOM handling done ASAP because it's blocking a couple security bugs that really need to be freed before 1.0.

@galdor
Copy link

galdor commented Aug 26, 2013

As long as I can get error descriptions as strings without having to generate the strings from structures myself, I'm good. Having a formal error representation may be useful, but it seems like a lot of work.

Regarding OOM error handling, I only care about them being detected. I fail to see the point in keeping an uncomplete html tree, since no matter how simple your usage is (counting the nodes), it will yield a wrong result. But it does not impact me as a library user; it's totally up to you :)

@gsnedders
Copy link
Contributor

So the huge dict at the start of https://github.com/html5lib/html5lib-python/blob/master/html5lib/constants.py may well be worth looking through — notably, there are things where there are nothing formatted strings (e.g., "Unexpected End of file. Expected DOCTYPE."), where there is one (e.g., "Unexpected start tag (%(name)s)."), and where there is two (e.g., "Unexpected end tag (%(gotName)s). Expected end tag (%(expectedName)s).").

@kevinhendricks
Copy link
Contributor

A public interface for errors is still really needed. To do basic sanity checks on epub xhtml pages before loading then in a QWebView widget, I woud love to get back a basic "its okay" check that lays out nicely where the error is (line number and column) and a basic error message.

Right now to do that I have to include error.h and do something like the following just to get the errors in my own well_formed.cc example:

[CODE]
...
GumboOutput* output = gumbo_parse_with_options(&myoptions, contents.data(), contents.length());
GumboParser parser;
parser._options = &myoptions;
const GumboVector* errors = &output->errors;
for (int i=0; i< errors->length; ++i) {
GumboError* er = static_cast<GumboError*>(errors->data[i]);
unsigned int linenum = er->position.line;
unsigned int colnum = er->position.column;
unsigned int typenum = er->type;
GumboStringBuffer text;
gumbo_string_buffer_init(&parser, &text);
gumbo_error_to_string(&parser, er, &text);
std::string errmsg(text.data, text.length);
fprintf(stdout, "line: %d col: %d type %d %s\n", linenum, colnum, typenum, errmsg.c_str());
gumbo_string_buffer_destroy(&parser, &text);
// gumbo_print_caret_diagnostic(&parser, er, contents.c_str());
}
gumbo_destroy_output(&myoptions, output);
[/CODE]

I used to stop at the first error by setting that in options, but there is a bug someplace in the code that sets has_error to true when there is no actual errors stored by the parser. Making stopping, less than worthwhile.

I was thinking that has_error at the end of parser.c might want to simply check to see if the length of the error vector is > 0 before actually stopping on the first error.

So I was wondering what your plans and timeline are for making an error interface available?

Thanks,

Kevin

@nostrademons
Copy link
Contributor

What is it that you'd like to see, ideally, for an API? The biggest blocker is that we don't really know what such an API should look like. So if you could give us an idea of what the use-case is, what sort of errors would be most helpful, and how you plan to display or otherwise handle them, that'd be helpful.

error.h is currently considered internal/unstable - it provides a low-level API that does what the spec says, but isn't all that useful for client code. It's probably possible to recover most of the useful error information from it, but not easy.

@stevecheckoway
Copy link

I found the api in error.h to be almost exactly what I wanted for Nokogumbo. The one thing that seemed awkward was constructing the fake GumboParser to pass the options in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@nostrademons @galdor @gsnedders @craigbarnes @stevecheckoway @kevinhendricks @awhalley and others