Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/az/api entrypoints coverage #1084

Closed
wants to merge 1 commit into from

Conversation

hmmr
Copy link
Contributor

@hmmr hmmr commented Feb 19, 2015

This PR depends on basho/riak_pb#106, in turn depending on the bunch of PRs by @zeeshanlakhani.

]).

-record(ctx, {
api_version, %% integer() - Determine which version of the API to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are not specifying the types in the signature instead of just in a comment? For example: api_version :: integer().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually from my PR... but was keeping consistency w/ https://github.com/basho/riak_kv/blob/develop/src/riak_kv_wm_object.erl#L157 and other files. The better question is why we've been doing that I reckon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmeiklejohn @hmmr I've made an update to https://github.com/basho/riak_kv/blob/develop/src/riak_kv_wm_object.erl#L157 that updates types for wm_preflist. Still relies on riak_core pr -> https://github.com/basho/riak_core/pull/705/files (for funs and types).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have no function that takes a preflist as an argument or returns one, so I should be fine. It's good to know where those types are, however.

An unrelated question: in newly written code, there is only one api_version I have to deal with (the current one). Do I still need to future-proof it by introducing api_version into my ctx record?

@hmmr hmmr force-pushed the feature/az/api-entrypoints-coverage branch from 2ea87e9 to d5feabb Compare February 28, 2015 01:38
@hmmr
Copy link
Contributor Author

hmmr commented Feb 28, 2015

It still merges cleanly, so no changes; just a rebase onto HEAD of develop. Failed due to required bits (basho/riak_pb#106) still awaiting approval.

@seancribbs
Copy link
Contributor

I disagree generally with the approach. We already have preflist calculation as an API request, why not simply provide the nodename + host/port/protocol for every interface?

The client can cache the network information and simply request the preflist for the desired bucket-type/bucket/key, and then match up the nodenames to existing connections or new connections to be opened. Obviously, the client should also periodically re-request for the live nodes' interfaces, or by be triggered some signal from the server.

Also, I'm doubful about how you're deriving the host interface information, because it is totally valid for a listener to start up with 0.0.0.0 or ::0 as the address, which binds all interfaces but is not routable. Also, you don't have adequate information about how the client can (or cannot) be routed to the bound interfaces.

@hmmr
Copy link
Contributor Author

hmmr commented Mar 3, 2015

@seancribbs The issue you point out concerning listening ip where the external ip is meant, is now quite embarrassingly obvious to me. It's a blunder I've just committed (no, I have sat on it feeling good for a week!). What's needed here is some code path eventually calling getifaddrs() and picking the addr of the interface we assume to be the one externally accessible.

I will review the whole thing tomorrow and report back.

There are accompanying changes in newly created branches:

 * riak_pb/feature/az/api-entrypoints-coverage, and
 * riak_api/feature/az/api-entrypoints-coverage

that will have to be merged first.

The following HTTP requests are accepted (values for parameter P are
"http", "pbc", assuming the latter by default):

1) /ring/coverage/bucket/B/key/K?proto=P

  returning a JSON of the form:

   [{Host:[Port]}]

indicating Host:Port is where an entry point is for optimal data access
to the key K in bucket B, via the given protocol API.

2) /ring/coverage?proto=P

  returning a JSON of the form:

  [{Host:[Port]}]

providing a complete API entry point map for access to the entire
cluster via that protocol.

HTTP responses are cached until the next multiple of (a default of) 15 secs,
configurable via parameter 'ring_coverage_cache_expiry_time'.

For pb, two new messages are provided:

 RpbApiEpReq (code 90),
 RpbApiEpResp(code 91),
 RpbApiEpMapReq (code 92),
 RpbApiEpMapResp(code 93),

with the relevant snippet in riak_pb/src/riak_kv.proto:

 enum RpbApiProto {
     pbc = 0;
     http = 1;
 }
 message RpbApiEp {
     required bytes host = 1;
     repeated int32 port = 2;
 }

 message RpbApiEpReq {
     required bytes bucket = 1;
     required bytes key = 2;
     optional RpbApiProto proto = 3 [default = pbc];
 }
 message RpbApiEpResp {
     repeated RpbApiEp eplist = 1;
 }

 message RpbApiEpMapReq {
     optional RpbApiProto proto = 1 [default = pbc];
 }

 message RpbApiEpMapResp {
     repeated RpbApiEp eplist = 1;
 }

The coverage information is obtained in two steps:

1. using riak_core_apl:get_apl_ann(), get active preflist for given
   Bucket and Key, and extract riak nodes from it;

2. make an rpc call on those nodes to:

   (a) determine which of the relevant listener apps (i.e., riak_api_pb_listener,
       riak_api_web) are running on those nodes,
       and find the ports these are listening on.

   (b) get the bound ip address of the first non-loopback interface on the
       underlying os.

The corresponding new riak python client methods are

* client.get_api_entry_point(bucket, key, proto),
* client.get_api_entry_points_map(proto)

included in
github.com/basho/riak-python-client/tree/feature/az/api-entrypoints-coverage
@hmmr
Copy link
Contributor Author

hmmr commented Mar 3, 2015

Reworked, now the functions return the proper external ip instead of 127.0.0.1.

@seancribbs concerning your suggestion to return the nodename and host/port, much as I'd like to just go with that, the java guys seem to only want the external ip and the port(s), not the nodenames proper or the part after '@' in those. I've been scratching my head hard, but other than calling getifaddrs on location, I just don't see how one can figure out the ip to which external clients can connect. I'll be happy to stand corrected, but at any rate, the scope of DP-6 is precisely about that.

I grepped the entire riak/deps/ and found no matches on getifaddr (except an odd one in erlang_js).

I am provisionally force-pushing the amended commit, for @srgg to continue his work on riak java client for the Thursday demo. There are a couple of issues to resolve, notably what 'heuristics' to apply in the case of multihomed hosts (such as those with infiniband in addition to plain eth0).

@hmmr hmmr force-pushed the feature/az/api-entrypoints-coverage branch from 43176af to 99613c9 Compare March 3, 2015 18:52
@seancribbs
Copy link
Contributor

I suppose this is fine for a demo, but before this merges, I'd like to see the following changes:

  1. The host interface information is at a fundamentally different granularity than the preflist information. Since we already have requests for fetching the preflist, there's no reason to duplicate that effort here; all we need is the host/port/protocol information.
  2. The code in this pull request RPCs to every node in the preflist on every request. The host interface information doesn't change often enough to warrant this, and under load RPC will become a bottleneck. Instead, each host should publish its listener information in cluster metadata, which will be gossiped around automatically. Fetching the data is an inexpensive, local operation.

@hmmr
Copy link
Contributor Author

hmmr commented Mar 5, 2015

To go the whole hog, I now see I'll have to:

  1. Provide a function to enumerate the interfaces and export the bound addresses to the preflist (still have to figure out how to do the latter).
  2. For the first time, call that function at listener app start, then call it once every minute (or not, with a big fat warning to whoever will be moving hosts around, to take care).
  3. If a listener is configured to listen on an address other than 0.0.0.0 that ends up being different from what has been discovered, reconfigure that listener.
  4. I will scrap my bucket+key coverage call. The other call, which returns the full coverage map, would remain.

Is that a plan?

@seancribbs
Copy link
Contributor

  1. I don't know what the preflist has to do with the bound addresses, but yes, you will need to export them to cluster metadata.
  2. I don't think you need to check every minute. Why would a listener address be changing that often? Or are you talking about from the client? If the latter, then perhaps that could be configurable. Polling might be ok, or triggered on a failed connection to a previously discovered host.
  3. Not sure what you mean here. Can you elaborate?
  4. When you say "coverage map", what do you mean exactly? "Coverage" usually means "a list of partition/node pairs covering the entire keyspace", and doesn't require a bucket/key to seed it. The "preflist (preference list)" usually refers to the replicas calculated for an individual key, based on consistent hashing.

@hmmr
Copy link
Contributor Author

hmmr commented Mar 6, 2015

@seancribbs Thanks for commenting and dropping precious hints.

I. My line of reasoning was about this call: https://github.com/basho/riak_kv/blob/develop/src/riak_kv_web.erl#L115. It returns (on my devrel setup):

$ curl localhost:10018//buckets/B/keys/K/preflist
{"preflist":[{"partition":29,"node":"dev5@127.0.0.1","primary":true},...]}

where I expect these new properties to appear: "http_ip":"192.168.33.44", "http_port":10018 (plus same for pbc). I see I used the word 'preflist' meaning 'cluster metadata' in my previous comment, inaccurate me. But, are these new entry point details now in the proper place as returned by this call?

II. Indeed, let it be triggered by a dropped connection (and failure to establish one initially, too). Then, when a client times out or sees it gone, it will re-request the entire entry points map, possibly with &force_refresh=1 (or some such) to cause it to be regenerated. And then no periodic autorefresh is necessary.

III. Suppose riak.conf initially has, e.g, listener.protobuf.internal = 10.10.10.10:10017, and then the host migrates from 10.10.10.10 to 10.10.10.11. If the configuration had 0.0.0.0:10017, the listener would just keep going. Otherwise, the listener would listen on an address which is no longer routed (it would even fail to restart, correct?).

My proposition is to check whether the configured listening address, unless it is 0.0.0.0, does match the bound address on some ethX in the system, and take steps to correct the situation.

IV. It's simply the complete list of all API entry points into the cluster. The idea is to extend this list with details on what key ranges are quickest accessed via each entry point -- and thus attain the meaning of "coverage". Properly naming things goes a long way, indeed.

@hmmr
Copy link
Contributor Author

hmmr commented Mar 11, 2015

Another reworking effort to address @seancribbs latest comments. Due to massive changes over what went into this branch, I pushed the new work into https://github.com/basho/riak_kv/tree/feature/az/api-entrypoints-coverage-rebased.

@hmmr
Copy link
Contributor Author

hmmr commented Mar 12, 2015

Work and further discussions moved to #1087; closing this one then.

@hmmr hmmr closed this Mar 12, 2015
@hmmr hmmr deleted the feature/az/api-entrypoints-coverage branch December 22, 2015 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants