Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for data transfer encryption via rc4 and aes #236

Closed
wants to merge 1 commit into from

Conversation

zeroshade
Copy link
Contributor

addresses #145

I've only implemented rc4 encryption here as i haven't figured out 3des / des yet, but this at least solves the use case in my own environment which is nice.

For references for the implementation I used:

I was able to test this with my own set up using encrypted data transfer and it works! huzzah!

@zeroshade
Copy link
Contributor Author

Hmm @colinmarc I haven't been able to reproduce the failure being seen on the Travis run and have no idea what is causing it, any ideas as to what i can do to try to track it down? :)

@colinmarc
Copy link
Owner

Wooh nice! This looks pretty good - I think shifting some abstractions may be in order, though. And we'll need to figure out some testing. Can you add a case to travis.yml for data transfer encryption?

The Append error is unrelated.

@zeroshade
Copy link
Contributor Author

I'll see about adding a case to the travis.yml for testing the data transfer encryption, let me know what shifting around of the abstractions you're thinking of. There were a couple aspects here that i wasn't completely sure about design-wise, but I wanted to keep things as unobtrusive as possible.

My testing was also only against Hadoop 2.X and all the test cases on travis failing with the Append thing and timing out are for CDH6, which is Hadoop 3.X, is there any way we can resolve that Append and timing out issue so that I can ensure that the problem isn't some incompatibility with my change and hadoop 3.x?

@colinmarc
Copy link
Owner

colinmarc commented Jul 1, 2020

Go ahead and add t.Skip() to the append tests that are failing. I have been trying to figure those out for a while and will do so in a different branch.

@zeroshade
Copy link
Contributor Author

W00t! Figured out the cause of the failure and put a fix in. :) I'll play around with getting a good test case added for the case of encrypted data tomorrow

@zeroshade
Copy link
Contributor Author

@colinmarc unit test added for the digestmd5, travis updated with test cases for the data encryption transfer which are now passing and even found a couple bugs in my implementation that i've since fixed.

I see a couple avenues for cleaning up some of the abstractions and interfaces so i'll take a look at that, in the meantime if you have any suggestions let me know.

@zeroshade
Copy link
Contributor Author

Hey @colinmarc, anything else you need from me here to get this merged?

@colinmarc
Copy link
Owner

Hey @zeroshade,

Thanks for the ping. I’ll give this a review in the next few days (if I don’t, please harass me). Excited to have this feature!

Colin

@zeroshade
Copy link
Contributor Author

@colinmarc harass harass harass harass harass :)

consider this me harassing you :)

@colinmarc
Copy link
Owner

Ok, looking through - I think the raw material is really good here but I think it would be better if the functionality was broken up a bit into composable pieces.

Here's how I would organize this:

  • Rather than having a DataTransporter struct, just give blockReader/Writer/etc a net.Conn (which does the wrapping we need), and let them be blissfully unaware.
  • Avoid wrapping the connection in the default unencrypted case. This is important to avoid the allocation and copy in Decode/Encode, which will probably have a negative impact on performance.
  • While you're at it, reusing a buffer in Decode/Encode is probably called for.
  • Break DigestMD5 into digestMD5PrivacyConn and digestMD5IntegrityConn, or, if those are impossible to untangle, just make the digestMd5Conn struct implement net.Conn directly. (It's actually not clear to me from the rfc whether those two things can be combined, and what that looks like, but we're not handling that case here anyway.)

Some other general comments:

  • I saw one or two cases where we silently fall back to sending unencrypted data if negotiation fails. That's a really big security hole, and we definitely shouldn't do that.
  • Please avoid exporting names (eg NewDigestMD5, DataTransporter) that don't need to be exported.
  • Please write comments in full sentences and with the godoc conventions in mind. Sorry for being a stickler about that, and don't stress about it overmuch - I will make a comment pass before I merge in any case.

Thanks so much for your work on this! Based on the number of people that have been asking for it, it's sorely needed.

@colinmarc
Copy link
Owner

One more performance suggestion: note that XORKeyStream wants an overlapping dst and src.

@zeroshade
Copy link
Contributor Author

The reason i created the DataTransporter struct was because I wanted to consolidate the connection logic rather than duplicating the key, secure and token members in each of blockReader/blockWriter/etc. Those structs already had a net.Conn object, the DataTransporter struct just gives them a connect function which will handle or wrap the net.Conn object in order to let them all be completely unaware because i realized that the code to do the wrapping ended up being identical in all those cases. I'm not really sure how to do what you're asking without ending up with a bunch of duplication amongst those structs.

Avoid wrapping the connection in the default unencrypted case. This is important to avoid the allocation and copy in Decode/Encode, which will probably have a negative impact on performance.

If you look at transport.go:127 I don't wrap it in the default case, if you're not using Privacy or Token, then it won't wrap it. In the meantime i'll see about reusing the buffers where I can.

I saw one or two cases where we silently fall back to sending unencrypted data if negotiation fails. That's a really big security hole, and we definitely shouldn't do that.

That's the situation where the client config is specifying encryption and the server config doesn't match and doesn't respond with the encryption, (which is a common implementation which I pulled from libhdfs3), from a consumer standpoint I'd prefer for it to work than to die when the client config supports encryption but the server has it disabled.

@colinmarc
Copy link
Owner

colinmarc commented Jul 11, 2020

I'm not really sure how to do what you're asking without ending up with a bunch of duplication amongst those structs.

I see what you mean. Let's encapsulate that in a datanodeSaslDialer closure:

BlockReader{
    DialFunc: (&datanodeSaslDialer {
        token: block.GetBlockToken(),
        dialer: f.client.options.DatanodeDialFunc,
        ...
    }).DialContext
    ...
}

I don't think it's abusing that interface to do negotiation as part of the dial; that seems to be a pattern elsewhere.

As a side note, I've been writing rust for a bit and it just about broke my brain to remember that you can put the receiver in a closure like that: https://play.golang.org/p/OXQBmmsdLEG

@colinmarc
Copy link
Owner

That's the situation where the client config is specifying encryption and the server config doesn't match and doesn't respond with the encryption, (which is a common implementation which I pulled from libhdfs3)...

I really think this is a bad idea, regardless of what libhdfs does. If we're talking to a malicious interloper, they can just respond in plaintext and we'll send them the nuclear access codes anyway. On the other hand, the hadoop docs for data.transfer.encrypted say:

Whether or not actual block data that is read/written from/to HDFS should be encrypted on the wire. This only needs to be set on the NN and DNs, clients will deduce this automatically.

Am I missing something, or is that really poorly designed? I'd say for now we require clients to be explicit, and fail if things don't match up. If someone shows up to complain that we're missing a feature for "deducing" server encryption settings, then maybe they can explain to me why it's fine.

@zeroshade
Copy link
Contributor Author

@colinmarc From a consumer standpoint it can make sense, essentially that you don't need to worry about configuring the client, but rather the client will essentially ask the serve whether or not it wants to use encryption and then will do what the server requests. At the same time, I can agree with where you're coming from, it makes sense for a client to be able to say "I want to use an encrypted connection, and i do not want to allow my data across an unencrypted one". I'm fine with requiring clients to be explicit for now and failing if the settings don't match up, I'll make that change.

@zeroshade
Copy link
Contributor Author

So quick question, when you say:

While you're at it, reusing a buffer in Decode/Encode is probably called for.

Do you mean reusing the input buffer for the output? I'm not sure how i feel about that one as far as overwriting the input buffer with the modified data, I think I prefer doing the copy. Or was there a different reusing you were referring to?

@colinmarc
Copy link
Owner

colinmarc commented Jul 11, 2020

I meant that we shouldn't be doing any allocation inside net.Conn.Write or net.Conn.Read, since that's a super duper hot path. The best way to reuse an allocation is with bytes.Buffer. So instead of:

b := make([]byte, n)
conn.Read(b)

It's much better to have a bytes.Buffer saved on the struct, and do something like (on read, for example):

d.buf.Reset()
io.Copy(d.buf, conn) // This will use ReadFrom under the hood
b := d.buf.Bytes()

@zeroshade
Copy link
Contributor Author

Ah, gotcha. that makes sense.

@zeroshade
Copy link
Contributor Author

zeroshade commented Jul 11, 2020

Hmm, @colinmarc any reason you can think of to not just do the check and wrap of the dialer inside of NewClient? Looking at stuff, i think that If I wrap the provided DatanodeDialFunc in the ClientOptions during NewClient (or wrap the default dialer) then basically everything gets it for free without having to care :)

@colinmarc
Copy link
Owner

any reason you can think of to not just do the check and wrap of the dialer inside of NewClient?

Sounds elegant to me!

@colinmarc
Copy link
Owner

If you rebase, you can remove the t.Skip for those append tests.

@zeroshade
Copy link
Contributor Author

zeroshade commented Jul 13, 2020

@colinmarc I still need to go through and put more comments in and such, but i'd like your opinion on the refactor / redesign based on our conversations. Let me know what you think. thanks!

Also added a new unit test for the encrypted case to ensure it doesn't get broken without having to run the travis tests to find out :)

@zeroshade
Copy link
Contributor Author

Hmmm don't know why that writeLeaseRenewal function is failing the test now, I'll have to look at this is the morning

@zeroshade
Copy link
Contributor Author

looked at master, that test is currently skipped. so the failure is not my fault :) i just got overzealous when you said i could rebase and remove the skips haha. I've put the skip back and hopefully everything should succeed now.

@zeroshade
Copy link
Contributor Author

@colinmarc Finally! :) Passing all tests, and redesigned based on your comments. Consider this me harassing you to take a look and let me know what you think now :)

Copy link
Owner

@colinmarc colinmarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, did a first pass through. This is looking really good.

.travis.yml Outdated Show resolved Hide resolved
client.go Outdated Show resolved Hide resolved
EncryptedDataNode bool
// SecureDataNode specifies whether we're using block access tokens to
// communicate with datanodes, specified by dfs.block.access.token.enable in the config
SecureDataNode bool
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was something else, actually? We've always sent the block token, and I think this is just a server-side option for verifying them (with or without kerberos). In any case it should be EnableBlockAccessToken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out you're right, this should be changed to reference dfs.data.transfer.protection not dfs.block.access.token.enable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the new comments / updated name/comment is good.

client.go Outdated Show resolved Hide resolved
options.EncryptedDataNode = true
options.SecureDataNode = true
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for this method needs to be updated (to say that we munge these fields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment updated. let me know if this is good.

origHashStart := msgTypeStart - macHMACLen

if !bytes.Equal(hmac, input[origHashStart:origHashStart+macHMACLen]) ||
!bytes.Equal(macMsgType[:], input[msgTypeStart:msgTypeStart+macMsgTypeLen]) ||
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the [:] necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes.Equal needs a slice, but we static initialized macMsgType as a [2]byte, so you need the [:] for it to be passed to bytes.Equal.

internal/rpc/digest_md5_privacy.go Outdated Show resolved Hide resolved
internal/rpc/digest_md5_privacy.go Show resolved Hide resolved
@@ -223,6 +223,24 @@ func (c *NamenodeConnection) Execute(method string, req proto.Message, resp prot
return nil
}

// GetEncryptionKeys will use the `getDataEncryptionKey` operation on the
// namenode in order to fetch the current data encryption keys
func (c *NamenodeConnection) GetEncryptionKeys() *EncryptionKey {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about exporting this. Does it have utility to users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now, i kept it exported because it's Client which is using it to pass to the DatanodeSaslDialer as the KeyFunc so that the dialer doesn't need to have a reference to the namenode connection directly, just a function which gives it the key. So since it's being used in the client.go file, it needs to be exported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, maybe just pass the two options into NamenodeConnection and then put the wrapDatanodeDialer func there?

but then i'd have to export wrapDatanodeDialer since NamenodeConnection is in internal/rpc and this needs to be called from file_reader / file_writer. Either way something would need to get exposed since the block reader / writer don't have a reference to the namenode, which they shouldn't need anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured it made more sense to export this utility function which is just a function call on the Namenode than it would have been to export a wrapping function for the datanode dialer.

internal/rpc/sasl_dialer.go Outdated Show resolved Hide resolved
@zeroshade
Copy link
Contributor Author

@colinmarc holy crap all the tests pass still and i've addressed the feedback. W00t w00t :) Just waiting for a few more comments from you on some of the things to let me know if they are good.

@zeroshade
Copy link
Contributor Author

@colinmarc well, i'm stumped, something about how i've changed things now is changing the way the allocations happen for the size of that []byte slice, and now it's not exhibiting that issue anymore -_- oy vey. i'm so confused, but at least it all works now haha

@colinmarc
Copy link
Owner

Can you rebase in 15f6da0 (and squash) to make sure those tests pass too?


// if the server didn't send us a Nonce, then the data isn't encrypted
// but we will still attempt to authenticate
if useSecure && len(key.Nonce) == 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case does the server not send a nonce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the client has the data transfer protection set to privacy, but the server is configured for auth-integrity or authentication but not auth-privacy, then there is no key to send and as such the nonce will be empty. In that case we fall back to what the server's qop is.

@colinmarc
Copy link
Owner

colinmarc commented Jul 15, 2020

Just found this in the RFC:

Digest authentication is vulnerable to "man in the middle" (MITM)
attacks. Clearly, a MITM would present all the problems of
eavesdropping. But it also offers some additional opportunities to
the attacker.

A possible man-in-the-middle attack would be to substitute a weaker
qop scheme for the one(s) sent by the server; the server will not be
able to detect this attack. For this reason, the client should always
use the strongest scheme that it understands from the choices
offered, and should never choose a scheme that does not meet its
minimum requirements.

Sounds like we can just use the server's QOP, and hope we're not mitm'd. I'm munging some comments now, so I'll make the change.

d = newDigestMD5IntegrityConn(conn, kic[:], kis[:])
}

if len(msg.GetCipherOption()) > 0 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this always be true if the qop is privacy? Put another way, is digestMD5PrivacyConn ever actually used (beyond the call to decode below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if for some reason the server doesn't support AES then this would be empty and it would only use the rc4 that was negotiated during the digest authentication.

@zeroshade
Copy link
Contributor Author

@colinmarc Rebased and Squashed

@zeroshade
Copy link
Contributor Author

@colinmarc

according to travis this completed successfully, i don't know why it still is listed as in progress here :(

colinmarc pushed a commit that referenced this pull request Jul 17, 2020
This implements the hadoop version of DIGEST-MD5 SASL data protection, loosely
based on RFC 2831. Authentication/encryption using rc4 is supported, but not
3des.

Fixes #145, closes #236.
@colinmarc
Copy link
Owner

colinmarc commented Jul 17, 2020

When I extend the matrix to test different rpc.protection and data.transfer.protection values, I get test failures: https://travis-ci.org/github/colinmarc/hdfs/jobs/709051530

Edit: actually, my fault. I thought the configuration could be entirely determined at connection time, but looks like that's not the case.

colinmarc pushed a commit that referenced this pull request Jul 17, 2020
This implements the hadoop version of DIGEST-MD5 SASL data protection, loosely
based on RFC 2831. Authentication/encryption using rc4 is supported, but not
3des.

Fixes #145, closes #236.
@colinmarc colinmarc closed this in 1596ee1 Jul 17, 2020
@colinmarc
Copy link
Owner

Merged :) thanks so much for your work on this!!!

@colinmarc
Copy link
Owner

Just a note I've been trying to track down a flakey test failure in privacy mode - in case you have any insight: https://travis-ci.org/github/colinmarc/hdfs/jobs/709475318

This is the error from the datanode logs:

java.lang.NumberFormatException: For input string: "3817551356"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:583)
	at java.lang.Integer.parseInt(Integer.java:615)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getEncryptionKeyFromUserName(SaslDataTransferServer.java:266)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$000(SaslDataTransferServer.java:73)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$1.apply(SaslDataTransferServer.java:177)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
	at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
	at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getEncryptedStreams(SaslDataTransferServer.java:180)
	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:112)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
	at java.lang.Thread.run(Thread.java:748)

I haven't started looking through the traceback, but I will tomorrow.

@zeroshade
Copy link
Contributor Author

@colinmarc yeah, you can't use the hdfs.DataEncryptionKeyProto directly unfortunately. Turns out that even though the protobuf has the key id has a uint32 you need to treat it as an int32 during the digest handshake generating the username and token. The key id can be negative -_- that's why I had a separate struct for it

@colinmarc
Copy link
Owner

Aha! I missed that detail when futzing with your code. Sorry about that.

@zeroshade
Copy link
Contributor Author

@colinmarc any chance of seeing a release with this? I'm having to use mod replace to point to the master branch right now since there's no release with the changes yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants