-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ER: curl #650
Comments
Hey, this is very cool! Thanks! I'll do my best to integrate this soon. Also, I've started on a streaming parser, which will help with huge JSON |
I'm thinking that for GET everything should be passed as an input, For POST/PUT the interface design gets trickier. Also, these have to be builtins, but I really want them in their own But again, I'm not sure how best to express sandbox vs. not-sandbox. |
@nicowilliams asked:
It seems to me that it might be best to introduce support for these enhancements in (well-thought-out) phases, with Phase 1, for example, being confined to GET. To me, GET is similar to As for factorization -- I don't understand the concern about cross-product behaviors. Anyway, my main hope is that traversing a "graph database" will be straightforward. For this and other scenarios, the only thing that typically changes is the URL (by which I mean to include all the forms allowed by curl). That's why I suggested it be the input. (However, as I mentioned, I am not advocating conformance to libcurl for its own sake. For example, I could see the input being an array composed of bits of a curl URL.) As for the timeout, it seems to me that that is something that might need to be tweaked independently of everything else and possibly even dynamically. |
Some unfortunate servers use GET inappropriately... Also, for the I/O builtins I have a variety of options: by default you I think perhaps that is as fine-grained as we can hope to get.
It's just that jq function call argument lists generally result in Basically, it's best to think of jq def function arguments as |
Python "requests" is rightly well-regarded, and looking over the documentation (http://docs.python-requests.org/en/latest), a few thoughts occur to me. First, in a nutshell, Python requests work like this:
Notice that the "what" (the URL) is kept separate from the "how" (here, the authorization information). More significantly, the returned value neatly encapsulates a ton of information. Perhaps jq should have something like Requests::get as well as a slightly higher-level function (Requests::json) for URLs that are supposed to return JSON. For Requests::json, jq's try/catch mechanism could be used to provide information about errors. Something like:
|
Given that in jq the best way to think about function arguments is as follows:
I think I'd prefer something like this:
or something like that. Among the available options would be whether to treat the response body as raw, raw and line-oriented (raw and slurped), or as JSON (and whether to stream and/or slurp), though perhaps we should separate jq from curl options, so we'd have:
If we make anything a closure argument, it should be the URL, in which case we'd have a "get all of these URLs" function:
Now, how to output the response?? The caller might not want to "slurp" the entire response body, instead outputting a value per-item in the response (e.g., if the response is a jq-like JSON text sequence like But there's more than the response body: there's also the headers and status code, and even things like the server's certificate and cert chain, trust anchor to which the server cert was validated, ... One option would be to output headers and other response metadata as the first value, then the response body as zero, one, or more values as appropriate. Another is to slurp the response body and then output |
To flesh that out a bit more, we might have:
|
Also, we might have a form where a closure is passed that decides whether the metadata is acceptable and returns true or false (or calls
Now we see that we have an ambiguity w.r.t. the other HTTP::GET/1 mentioned earlier, though the input array's length and the |
@nicowilliams observed:
Exactly! HTTP::GET(a_list_of_URLs) in my opinion was not a good idea to begin with. Your GET/0 and simple_GET/0 have it exactly right w.r.t. URLs. |
@pkoppstein This line of argument also leads one to apply the same design to the regexp builtins, no? Have we made a mistake with those? But for 1.6 my hope is to move all builtins into appropriate builtin modules, with all the ones that people expect from 1.5 in a "jq" module that could be imported as |
@nicowilliams asked:
We have not made a mistake with regexp, which can be thought of as adhering to a "substrate-as-input" design, which calls for URLs-as-input in the case of filters supporting HTTP GET. That is, "STRING | regexp(RE)" is entirely analogous to "URL | get( OBJ )". By the way, since libcurl is not restricted to HTTP/HTTPS, and since I'm hoping that the new functionality will support the "file URI scheme" (file:///), I'm not sure that emphazising HTTP and GET in the module or function names is such a good idea. |
In EDIT: As I remember, @pkoppstein first pointed out the ability to "flip" functions quite a while back. As to all the protocols that curl supports, you're quite right, but there will be cases where a specific HTTP verb is desired, and while the name of the module and builtin might not say "HTTP", the details of HTTP will probably leak (e.g., headers). How about:
and so on. |
@nicowilliams wrote:
Yes -- you may have forgotten that we explored this together around the time of #524. Regarding the jack-of-all-trades function, you most recently suggested that the input (.) have the form:
Previously you had suggested an array. To me, the most important considerations would be (1) efficiency, and (2) ease of error-checking and minimizing the likelihood of errors in the first place. (Maybe the array-format would make it less likely that a URL would be missing?) I was also wondering whether it mightn't be better to avoid nesting, except for "headers"; for example:
|
Yes, I remember (see my edit above, alluding to that). I agree that an array will perform better than an object, but if we're going over the wire it might not matter. OTOH, an array with just 4 things will have a memorable form, so, sure, but I do want to separate curl options from jq options. jq options here would be the equivalent of the command-line (E.g., if you're GETting a text file, you probably want raw input, and possibly slurp (if the text is line-oriented). If you're GETting an |
I just discovered jq (d'oh!). How long has this been going on, i.e. when was jq first made available to the world at large? |
@stedolan's first commit on the public repo was on July 18, 2012.
|
Hey! I'm very interested on this discussion. This jq+shell sample application we made for Typeform I/O would be way cleaner if it lived entirely inside jq: https://github.com/TypeformIO/JQ-FormCreation |
I think I'd rather see file/popen I/O builtins than have jq link with libcurl and OpenSSL and such. A module system extension for C-coded modules would also work. |
I agree with Nico that popen builtin or C-coded module are the best ways to implement this. As for design, I think we should focus on a general, low-level API analogous to Go's (*http.Client) Do. My weak preference is
|
@dtolnay Welllll, if you wanted to call jq from Java, say, then you'd be unhappy with popen on systems where the kernel/libc don't support vfork() and use it in posix_spawn(). Also, a curl jq interface must not expose the mess of CLI options that is curl. That said, we'll get a lot of mileage out of a shell-out, so we should do it. As for a libcurl interface, if we ever do it at all then I'd like to do that via dynamically loaded jq functions. It's reasonable to have a build dependency on Oniguruma (or descendant) for regexp as that brings in no further dependencies, but once we're talking about about curl we also get OpenSSL and/or friends and things begin to get ugly. Also, we'd have to finish the C-coded generators business if we're going to talk to libcurl in any way other than through curl(1). |
glibc might support vfork() on some kernels nowadays since about a year ago, IIUC, but I'm probably not looking in the right place, and I'm just wasting my time. We should hope popen() uses posix_spawn(), and that the latter uses a non-COW vfork(), and if anyone is unhappy with the lack of a true vfork() then we can tell them to complain to their OS vendor/distro. |
There has been some discussion about endowing jq with curl-like
support for retrieving information (and especially JSON) from remote
resources. In order to expedite enhancements in this direction (as
well as to provide support for reading from LOCAL files), I would like
to propose that jq use the "easy interface" of libcurl. To this end,
I am appending a stand-alone C program, parts of which could be used
for integrating jq and libcurl to support synchronous retrieval.
I realize that libcurl may be overkill, so if there are better
alternatives that would enable similar support for synchronous
retrieval of JSON documents (both locally and remotely) to be provided
expeditiously, that is fine. However, I'd also like to point out that
libcurl could be used in the short-run without without making a
long-term commitment to using libcurl: stability is only needed for
the jq filters.
Still, libcurl has its advantages. It is widely used and will
presumably be supported indefinitely. It is sophisticated and will
provide a path for jq to follow as its own capabilities become more
sophisticated.
Motivation
One of the reasons for requesting this enhancement is that I am using
a RESTful collection of tens of thousands of JSON documents, linked
together by ids in a kind of "graph database". For example, an id
field in one document might be "1622", referring to another JSON
document at
http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622
Similarly, there are references within this "graph database" to entities
available elsewhere on the web as JSON objects.
The ability to query local files is also a major motivation.
Non-JSON resources
For the sake of simplicity, the following assumes that the resource being
queried will return a single JSON entity.
Specification
In order to accommodate future enhancements, I would tentatively
propose that all parameters except the URL and timeout be passed in
via a JSON object, e.g.
One possibility along these lines would be for jq initially to support
two jq filters:
These would of course either fail (or return null) or return a JSON entity.
Questions
What should the name of the jq filter for retrieving JSON entities be?
How should non-JSON resources be supported?
First Steps
The following is a standalone C program with functions that could be used to integrate jq with libcurl. Please feel free to use it.
The text was updated successfully, but these errors were encountered: