Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ER: curl #650

Closed
pkoppstein opened this issue Dec 18, 2014 · 21 comments
Closed

ER: curl #650

pkoppstein opened this issue Dec 18, 2014 · 21 comments

Comments

@pkoppstein
Copy link
Contributor

There has been some discussion about endowing jq with curl-like
support for retrieving information (and especially JSON) from remote
resources. In order to expedite enhancements in this direction (as
well as to provide support for reading from LOCAL files), I would like
to propose that jq use the "easy interface" of libcurl. To this end,
I am appending a stand-alone C program, parts of which could be used
for integrating jq and libcurl to support synchronous retrieval.

I realize that libcurl may be overkill, so if there are better
alternatives that would enable similar support for synchronous
retrieval of JSON documents (both locally and remotely) to be provided
expeditiously, that is fine. However, I'd also like to point out that
libcurl could be used in the short-run without without making a
long-term commitment to using libcurl: stability is only needed for
the jq filters.

Still, libcurl has its advantages. It is widely used and will
presumably be supported indefinitely. It is sophisticated and will
provide a path for jq to follow as its own capabilities become more
sophisticated.

Motivation

One of the reasons for requesting this enhancement is that I am using
a RESTful collection of tens of thousands of JSON documents, linked
together by ids in a kind of "graph database". For example, an id
field in one document might be "1622", referring to another JSON
document at

http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622

Similarly, there are references within this "graph database" to entities
available elsewhere on the web as JSON objects.

The ability to query local files is also a major motivation.

Non-JSON resources

For the sake of simplicity, the following assumes that the resource being
queried will return a single JSON entity.

Specification

In order to accommodate future enhancements, I would tentatively
propose that all parameters except the URL and timeout be passed in
via a JSON object, e.g.

 { "username": "jqUser", "password": "secret", 
   "headers": { "User-Agent": "jq" } }

One possibility along these lines would be for jq initially to support
two jq filters:

def curl(obj; timeout): # timeout in seconds; input is a string specifying the URL with query parameters

def curl(obj): curl(obj;10);

These would of course either fail (or return null) or return a JSON entity.

Questions

What should the name of the jq filter for retrieving JSON entities be?

How should non-JSON resources be supported?

First Steps

The following is a standalone C program with functions that could be used to integrate jq with libcurl. Please feel free to use it.

// The following is based on http://curl.haxx.se/libcurl/c/getinmemory.html

// gcc -lcurl curl.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <curl/curl.h>

struct MemoryStruct {
  char *memory;
  size_t size;
};

static size_t
WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
  size_t realsize = size * nmemb;
  struct MemoryStruct *mem = (struct MemoryStruct *)userp;

  mem->memory = realloc(mem->memory, mem->size + realsize + 1);
  if (mem->memory == NULL) {
    /* out of memory! */ 
    fprintf(stderr, "WriteMemoryCallback: not enough memory (realloc returned NULL)\n");
    return 0;
  }

  memcpy(&(mem->memory[mem->size]), contents, realsize);
  mem->size += realsize;
  mem->memory[mem->size] = 0;

  // To verify mem->memory has the string:
  // printf("%s\n", mem->memory); 
  return realsize;
}

// timeout in seconds
long jv_curl(char *url, long timeout, void *userp) {
  struct MemoryStruct *chunk = (struct MemoryStruct *)userp;

  struct curl_slist *headers = NULL;

  CURL *curl = curl_easy_init();
  if (curl) {
    CURLcode res;
    curl_easy_setopt(curl, CURLOPT_URL, url);
    /* follow redirection */ 
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);

    curl_easy_setopt(curl, CURLOPT_TIMEOUT, timeout); 

    /* github requires User-Agent so for now ...*/
    headers = curl_slist_append(headers, "User-Agent: jq");
    curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);

    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);
    /* pass our 'chunk' struct to the callback function */ 
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, (void *)chunk);

    /* Perform the request, res will get the return code */ 
    res = curl_easy_perform(curl);

    /* Check for errors */ 
    if(res != CURLE_OK) {
      fprintf(stderr, "curl_easy_perform() failed: %s\n",
              curl_easy_strerror(res));
    } else {
       /* chunk->memory now points to a memory block that is chunk->size
        * bytes big and contains the remote file.
    */
      printf("TEST: %lu bytes retrieved\n", (long)chunk->size);
      printf("TEST: %s\n", chunk->memory);

    }
    /* always cleanup */ 
    curl_easy_cleanup(curl);
  }
  return (long)chunk->size;
}

int main() {
  struct MemoryStruct chunk;
  chunk.memory = malloc(1);  /* will be grown as needed by the realloc above */ 
  chunk.size = 0;            /* no data at this point */ 
  char *url = 
    // "http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622";
       "http://apicommons.org/api-commons-manifest.json";
  printf("%lu bytes retrieved\n", jv_curl(url, 10L, &chunk));
  printf("%s\n", chunk.memory); 

  return 0;

}
@nicowilliams
Copy link
Contributor

Hey, this is very cool! Thanks! I'll do my best to integrate this soon.
I can probably do this a bit over break, and if not the week after. It's
also a kick in the pants to finish the module system.

Also, I've started on a streaming parser, which will help with huge JSON
text inputs...

@nicowilliams
Copy link
Contributor

I'm thinking that for GET everything should be passed as an input,
URL, timeout, everything, otherwise we get nasty cross-product
behaviors... A /1 builtin could take a stream of URLs and GET them
all. We could call this GET/0 and GET/1.

For POST/PUT the interface design gets trickier.

Also, these have to be builtins, but I really want them in their own
module namespace, partly so we can do something about authorizing
programs to use modules, so that we can retain the current behavior of
sandboxing by default. I'm not sure how to do this yet, but I really
like the idea that jq programs are filters with no more harmful
side-effects than local resource consumption (of course, a malicious
jq program could do more, such as observe timing effects to steal
secrets, for example, but let's leave that aside for now), so I'd like
that to continue to be the default.

But again, I'm not sure how best to express sandbox vs. not-sandbox.
What do you think?

@pkoppstein
Copy link
Contributor Author

@nicowilliams asked:

What do you think?

It seems to me that it might be best to introduce support for these enhancements in (well-thought-out) phases, with Phase 1, for example, being confined to GET.

To me, GET is similar to env -- that is, no special flag is required to use env and I don't see any real need to add a special flag for GET in general, or for GET with non-"file:///" requests. If the clamor for such a flag (or flags) becomes deafening, it can always be added after Phase 1.

As for factorization -- I don't understand the concern about cross-product behaviors. Anyway, my main hope is that traversing a "graph database" will be straightforward. For this and other scenarios, the only thing that typically changes is the URL (by which I mean to include all the forms allowed by curl). That's why I suggested it be the input. (However, as I mentioned, I am not advocating conformance to libcurl for its own sake. For example, I could see the input being an array composed of bits of a curl URL.)

As for the timeout, it seems to me that that is something that might need to be tweaked independently of everything else and possibly even dynamically.

@nicowilliams
Copy link
Contributor

What do you think?

[...]
To me, GET is similar to env -- that is, no special flag is required
to use env and I don't see any real need to add a special flag for
GET in general, or for GET with non-"file:///" requests. If the clamor
for such a flag (or flags) becomes deafening, it can always be added
after Phase 1.

Some unfortunate servers use GET inappropriately...

Also, for the I/O builtins I have a variety of options: by default you
get to read from stdin, write to stdout/stderr, but the jq program can
be given the right to open files for reading, to open them for writing,
and to popen() (execute stuff). HEAD/GET would be like opening
arbitrary files, so that falls into the open-files-for-reading
permission, and POST and friends fall into the open-files-for-writing
permission.

I think perhaps that is as fine-grained as we can hope to get.

As for factorization -- I don't understand the concern about
cross-product behaviors. [...]

It's just that jq function call argument lists generally result in
cartesian product behavior that often surprises users. A GET/0 that
takes URL, headers, and options in an input object (or perhaps an array
of headers, options, and URL) would be easy to use. A GET/1 that takes
as an argument a stream of URLs would GET each of them with the headers
and options from ..

Basically, it's best to think of jq def function arguments as
streams/futures and of . as the "plain" arguments of a jq def
function, with many inputs == as many calls. We should design
interfaces with this pattern in mind, and perhaps some syntactic sugar
might help.

@pkoppstein
Copy link
Contributor Author

Python "requests" is rightly well-regarded, and looking over the documentation (http://docs.python-requests.org/en/latest), a few thoughts occur to me.

First, in a nutshell, Python requests work like this:

r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

Notice that the "what" (the URL) is kept separate from the "how" (here, the authorization information). More significantly, the returned value neatly encapsulates a ton of information.

Perhaps jq should have something like Requests::get as well as a slightly higher-level function (Requests::json) for URLs that are supposed to return JSON. For Requests::json, jq's try/catch mechanism could be used to provide information about errors.

Something like:

# input is the URL
def Requests::json(obj):
  Requests::get(obj) as $r
  | $r.status_code
  | if . == 200 then $r.json
    elif . == 400 then error("HTTP ERROR 400: Bad request")
    elif . == 401 then error("HTTP ERROR 401: Unauthorized")
    else error("HTTP STATUS \(.)")
    end;

@nicowilliams
Copy link
Contributor

@pkoppstein

Given that in jq the best way to think about function arguments is as follows:

  • the "input" (.) is an argument, possibly many arguments by convention (e.g., if an array or an object)
  • the "arguments" are closures -- always closures

I think I'd prefer something like this:

["<URL>", {<headers>}, {<options>}] | HTTP::GET

or something like that. Among the available options would be whether to treat the response body as raw, raw and line-oriented (raw and slurped), or as JSON (and whether to stream and/or slurp), though perhaps we should separate jq from curl options, so we'd have:

["<URL>", {<headers>}, {<curl options>}, {<jq options>}] | HTTP::GET

If we make anything a closure argument, it should be the URL, in which case we'd have a "get all of these URLs" function:

[{<headers>}, {<curl options>}, {<jq options>}] | HTTP::GET(a_list_of_URLs)

Now, how to output the response??

The caller might not want to "slurp" the entire response body, instead outputting a value per-item in the response (e.g., if the response is a jq-like JSON text sequence like 0\n1\n2\n you might want HTTP::GET to output a stream of these values: 0, 1, 2).

But there's more than the response body: there's also the headers and status code, and even things like the server's certificate and cert chain, trust anchor to which the server cert was validated, ...

One option would be to output headers and other response metadata as the first value, then the response body as zero, one, or more values as appropriate. Another is to slurp the response body and then output [<response metadata>, <response body>]. Another is to output a stream of [<URL>, <response metadata>, <value from response body>], one per-value in the response body. These could all be options to the jq function, or we could have a different function for each of these options.

@nicowilliams
Copy link
Contributor

To flesh that out a bit more, we might have:

def simple_GET:
    {slurp: true, stream: false, raw: false, metadata_first: false, metadata_always: false} as $jq_opts |
    {timeout: 3} as $curl_opts |
    {Accept: "application/json"} as $request_headers |
    [., $request_headers, $curl_opts, $jq_opts] | HTTP::GET;

# Get /this, /that, and /other relative to ., whatever that is, with one output per-resource
. + ("/this", "/that", "/other") | simple_GET

@nicowilliams
Copy link
Contributor

Also, we might have a form where a closure is passed that decides whether the metadata is acceptable and returns true or false (or calls error):

def simple_GET:
    def check_it:
        ...;
    {check_metadata: true, slurp: true, stream: false, raw: false, metadata_first: false, metadata_always: false} as $jq_opts |
    {timeout: 3} as $curl_opts |
    {Accept: "application/json"} as $request_headers |
    [., $request_headers, $curl_opts, $jq_opts] | HTTP::GET(check_it);

Now we see that we have an ambiguity w.r.t. the other HTTP::GET/1 mentioned earlier, though the input array's length and the $jq_opts resolves it (though that feels like a hack).

@pkoppstein
Copy link
Contributor Author

@nicowilliams observed:

Now we see that we have an ambiguity w.r.t. the other HTTP::GET/1 mentioned earlier ...

Exactly! HTTP::GET(a_list_of_URLs) in my opinion was not a good idea to begin with.

Your GET/0 and simple_GET/0 have it exactly right w.r.t. URLs.

@nicowilliams
Copy link
Contributor

@pkoppstein This line of argument also leads one to apply the same design to the regexp builtins, no? Have we made a mistake with those?

But for 1.6 my hope is to move all builtins into appropriate builtin modules, with all the ones that people expect from 1.5 in a "jq" module that could be imported as import jq {version:1.5};. So there's no harm to our builtin design mistakes. We'll be able to fix them later.

@pkoppstein
Copy link
Contributor Author

@nicowilliams asked:

This line of argument also leads one to apply the same design to the regexp builtins, no? Have we made a mistake with those?

We have not made a mistake with regexp, which can be thought of as adhering to a "substrate-as-input" design, which calls for URLs-as-input in the case of filters supporting HTTP GET. That is, "STRING | regexp(RE)" is entirely analogous to "URL | get( OBJ )".

By the way, since libcurl is not restricted to HTTP/HTTPS, and since I'm hoping that the new functionality will support the "file URI scheme" (file:///), I'm not sure that emphazising HTTP and GET in the module or function names is such a good idea.

@nicowilliams
Copy link
Contributor

In <string> | regexp(<re>) <re> isn't an input but a stream of regexps -- it looks like an argument, but it can become a cartesian product. OTOH, {subject:<string>, re:<re>} | regexp has the same problem anyways, so it really all comes down to: what should be the input ., and which other things should be arguments. That requires thinking about what filters one might want to build, but in practice we can flip things around pretty easily anyways, so it probably doesn't matter (i.e., given def foo(a): ...;we can def bar(b): . as $dot | b | a($dot); to define the "flipped" version of foo). OK, good.

EDIT: As I remember, @pkoppstein first pointed out the ability to "flip" functions quite a while back.

As to all the protocols that curl supports, you're quite right, but there will be cases where a specific HTTP verb is desired, and while the name of the module and builtin might not say "HTTP", the details of HTTP will probably leak (e.g., headers).

How about:

# we shouldn't call it "url", as we'll probably want a module just for URI/URN/URL manipulations
module curl;

# The jack of all trades, takes a description of what to do on input.
# 
# Inputs are of the form:
# {url: <URL>, verb: <verb>, headers: {<headers>}, curlopts: ..., jqopts: ...}
#
# (verb being scheme-specific, and optional; if absent it
# defaults to GET or scheme-specific equivalent)
def perform: ...;

# Only fetches resources:
def get: ...;

# Like get, but inputs are strictly URLs:
def get(headers; curlopts; jqopts): ...;

# Only fetches resource metadata:
# def head: ...;
# def headers(headers; curlopts; jqopts): ...;

and so on.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

.. isn't an input but a stream of regexps ...

Yes -- you may have forgotten that we explored this together around the time of #524.

Regarding the jack-of-all-trades function, you most recently suggested that the input (.) have the form:

{url: , verb: , headers: {}, curlopts: ..., jqopts: ...}

Previously you had suggested an array. To me, the most important considerations would be (1) efficiency, and (2) ease of error-checking and minimizing the likelihood of errors in the first place. (Maybe the array-format would make it less likely that a URL would be missing?)

I was also wondering whether it mightn't be better to avoid nesting, except for "headers"; for example:

  • If the top-level is an object: {url: _, verb: _, headers: {_}, ....}
    where for each supported CURL option (http://curl.haxx.se/libcurl/c/options-in-examples.html), CURL_<OPTION> becomes { "OPTION": _} and we choose any other tag names carefully.
  • It the top-level is an array: [ <URL>, headers: {}, options: {} ] | MODULE::VERB

@nicowilliams
Copy link
Contributor

Yes, I remember (see my edit above, alluding to that).

I agree that an array will perform better than an object, but if we're going over the wire it might not matter. OTOH, an array with just 4 things will have a memorable form, so, sure, but I do want to separate curl options from jq options.

jq options here would be the equivalent of the command-line --slurp, --raw-input, --stream, --seq, and so on, both for input processing (GET and such) and for output (POST/PUT/PATCH and such). Curl and jq options shouldn't get mixed up, as preventing future collisions would be difficult. The two need distinct namespaces.

(E.g., if you're GETting a text file, you probably want raw input, and possibly slurp (if the text is line-oriented). If you're GETting an application/json resource, then you probably don't want raw, and if it's huge then you'll want streaming. And so on.)

@ModalUrsine
Copy link

I just discovered jq (d'oh!). How long has this been going on, i.e. when was jq first made available to the world at large?
thanx

@nicowilliams
Copy link
Contributor

nicowilliams commented Apr 1, 2015 via email

@ghost
Copy link

ghost commented Apr 14, 2015

Hey!

I'm very interested on this discussion. This jq+shell sample application we made for Typeform I/O would be way cleaner if it lived entirely inside jq: https://github.com/TypeformIO/JQ-FormCreation

@nicowilliams
Copy link
Contributor

I think I'd rather see file/popen I/O builtins than have jq link with libcurl and OpenSSL and such. A module system extension for C-coded modules would also work.

@dtolnay dtolnay removed this from the 1.6 release milestone Aug 16, 2015
@dtolnay
Copy link
Member

dtolnay commented Oct 24, 2015

I agree with Nico that popen builtin or C-coded module are the best ways to implement this.

As for design, I think we should focus on a general, low-level API analogous to Go's (*http.Client) Do. My weak preference is curl/0 with an input map similar to http.Request and an output map similar to http.Response. Then higher-level helpers can be built on top as we figure out which ones would be most useful.

{   url: "http://api.legiscan.com/?key=InsertKeyHere&op=getSponsor&id=1622",
    method: http::GET,
    header: {
        Accept: ["application/json"]
    },
    timeout: 5 * time::second}
| curl
| [.statuscode, .header["Content-Type"], .body|fromjson]

@nicowilliams
Copy link
Contributor

@dtolnay Welllll, if you wanted to call jq from Java, say, then you'd be unhappy with popen on systems where the kernel/libc don't support vfork() and use it in posix_spawn(). Also, a curl jq interface must not expose the mess of CLI options that is curl. That said, we'll get a lot of mileage out of a shell-out, so we should do it.

As for a libcurl interface, if we ever do it at all then I'd like to do that via dynamically loaded jq functions. It's reasonable to have a build dependency on Oniguruma (or descendant) for regexp as that brings in no further dependencies, but once we're talking about about curl we also get OpenSSL and/or friends and things begin to get ugly. Also, we'd have to finish the C-coded generators business if we're going to talk to libcurl in any way other than through curl(1).

@nicowilliams
Copy link
Contributor

glibc might support vfork() on some kernels nowadays since about a year ago, IIUC, but I'm probably not looking in the right place, and I'm just wasting my time. We should hope popen() uses posix_spawn(), and that the latter uses a non-COW vfork(), and if anyone is unhappy with the lack of a true vfork() then we can tell them to complain to their OS vendor/distro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants