-
-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added transform option #195
base: master
Are you sure you want to change the base?
Conversation
@@ -455,37 +460,37 @@ | |||
|
|||
* update range-parser and fresh | |||
|
|||
0.1.4 / 2013-08-11 | |||
0.1.4 / 2013-08-11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my editor removed the trailing white space. This seems like a good thing, but I can make a separate commit if you think it worth it.
Thanks @wesleytodd for helping pish this forward. I am generally skeptical that this type of option is truly worth all the complexity added, ultimately. Did you have a specific use case you are trying to achieve by this feature you can describe? |
3faa226
to
7553109
Compare
I do have a specific use case. I have an npm registry proxy, it caches the packuments and for now is re-writing the tarball urls before it writes them to disk. This is for robust npm client testing, and so I often open the registry on port 0. So every time I restart I have to blow away the cache because it points to the wrong port. I prefer to re-write on the other end, before sending it from the cache, which would enable me to re-use the cache. I was using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not all my comments yet, as it seems when you force push and i was in the middle of the review, all my pending comments are erased... i assumed it was ready to review by the review assignment.
`etag` and `lastModified` and to use the underlying file as it normally would. | ||
|
||
*Note on range requests:* When a range request is sent, the read stream will | ||
only read part of the file. If your transform is preforming replacements, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would violate the actual http spec, as the range is on the actual recieved output. This will be a large issue for example if the transform doubled the output (maybe encode as hex); the web browser will send a range way past the end of the actual underlying file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think defaulting acceptRanges: false
when transform
is passed is a good compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If transform starts removing pretty much all the features of this module, at what point is the feature actually just a separate functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, default is not removal. In my use case I actually would force set etag
and lastModified
to true and acceptRanges
to false anyway. To me that is not removing, just normal options setting for a specific use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, to clarify i mean the removal as in what happens by default when the transform option is enabled. Yes, folks can then understand the foot guns and re enable certain things, but many will use the defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think the defaults are good mostly with the transform
option, especially if we turn off range requests by default. FWIW, range requests are so uncommon in my experience. The only time I have ever used them was building a video player, but that is a very uncommon use case to optimize for.
If you are using the transform
option, the default should be no range requests because it means you need to be very careful your transform can handle range requests. The content-length
is a little less clear, but it really depends on your transform, so it makes sense to not set it if you use transform
. Lastly for last-modified
, this also depends on what your transform does so it makes sense to leave it up to the user.
The more I think about this, the more I think this is the right approach to a transform
option default as long as it is well documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree on some of this; I think we can make a transform work with all the features if we implement it instead along the lines of a virtual fs, as I poked at in another comment, and not like this implementation. This implementation, ultimately, is just a simple solution that ends up having a lot of pitfalls I think don't need to be there if done differently. Range requests are not uncommon in practice; the module implements specifically what is not uncommon around the web. A very common point (besides audio and video files) that use range requests are pdf files too. That is how the web browser can display the first page of a 50mb pdf without waiting around forever. The browsers make use of range requests for quite a variety of file types under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also generally think it is bad API design for the default setting of one option to change based on what another, different option is set to, as it just adds confusion as to what exactly is going to happen with a set of options, as the defaults are now dynamic to other factors.
A function which returns a stream which can be used to transform the data. | ||
If this option is passed it will change the defaults of `etag` and `lastModified` | ||
to `false`. If the transform stream changes the content length, it is also | ||
responsible for updating the `Content-Length` header or changing to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be a large foot gun. Is there some way to just make this work without adding such a potential issue? For example, for the intended use cases of this, is it more likely than not that the content length will be different from before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought hard on this. I don't think there is a good solution which doesn't rely on the user setting it if they are modifying the content. So in my case, I need to buffer the entire file to parse the json and then transform.
In this case, I can calculate the new length, but if you are doing something like escaping content in a CSV you might be able to do it mid stream and in that way you would not know the content length in advance. In this case, you might consider moving to chunked encoding to avoid it. One of the reference implementations just does that for you, but that is also very presumptuous.
TBQH, I feel that this is an advanced feature, requiring advanced understanding. That justifies, to me, documenting it and letting users "do the right thing" for their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this feature is being implemented on the "wrong side" of this module's operations. As in, maybe the transformation should happen prior to the stream entering into this module, perhaps solving the virtual fs stuff folks want can also solve this as well, including keeping all the features of this module working an intact, like ranges, etags, etc.
I'm thinking like in a way, the transform is a kind of virtual fs in itself, where the contents, at least, are virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, well if we allowed you to pass in the reference to fs
then yeah we could do that. That said, I am strongly opposed to virtual fs implementations. There is so much complexity in that, it is just a really hard trade off compared to such a simple feature such as this. Have you ever dug into things like vinyl-fs
? Maybe there are better more modern ones, but either way it is a big dependency to take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to expand on what I mean by a virtual fs is not those enormous things. More like providing an interface that is much simpler to do the little but that this module needs like "hey, give me a readable stream for the file of this name" or some such. In that way, the returned stream in this case could have all the transforms added at that point, which would be below where things like ranging and similar are done. If the "virtual" thing didn't understand (or able to support) ranges, it would just return the full stream, for example. I am just thinking out loud here, to be honest. I just think having the transform part between this module and the fs seems like a better fit to get all the features working, but happy to hear other solutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, ok then yeah I think I can get on board with that. So we would need fs.stat
and fs.createReadStream
. I think it would be pretty simple to do that. So in the case of doing a transform you would have to read the metadata and manage your own stat
data based on the transform you are going to do. So in my case, I would have to modify the file size in the stat
call by reading the whole file and applying the transform, then throw it out. I guess maybe I could cache it in memory, anyway it would push more of the complexity to the end user, but would allow you to be more technically correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. And to also be clear: I don't think we even need to copy those names / those specific interfaces. We can "dumb it down" to what is actually needed for this module to do it's operations and let the "default impl" map to those fs.*
calls, but when users want to do something different, they get an API that is easier and only what they need to work with. For example, fs.stat
returns a lot of things we don't need -- so no need to make the user implement a fs.stat
interface. The same for fs.createReadStream
especially, as that take a whole slew of options that this module is just never going to need to send, and hammering it down to a simplified interface also means that it is clear to the user what, exactly, they need to implement to get it working well.
I know we need to fix the TOCTOU issue, so perhaps we can roll those things together. But even if not, we would do well to make this interface not prevent us from resolving that issue, at least :) To be specific, the underlying changes that were being held is to (1) open a fd to the fs path, (2) stat the fd and then (3) read from the fd. I suppose this simplified interface could provide the same mechanical interface or something.
7553109
to
b4e2d3f
Compare
Github really needs to fix that. Sorry I was fixing the build failures. I wont push anymore for a bit to be safe. |
How does this interact with a http HEAD request, by the way? We want to be sure, even if this is an advanced feature, it doesn't end up violating http itself. Ideally we should prevent the user from making such violations if possible. |
Great question, and one I did not test or think of. Now that you say it, I am confident this would be broken if you did a transform which modified anything represented in the headers values. Would it make sense to, when EDIT: also only do that when it is a |
b4e2d3f
to
7ae38be
Compare
Yea, that could probably work. It is of course hard to know what the transform is going to do, and the transform (as it is currently) cannot make any shortcuts to simply in the case of HEAD, which may be useful. I assume that a transform could change the Content-Type and other aspects, is that right? Ultimately, all the HEAD needs to do is just return the same headers as the same request would have returned if it were a GET method (including Content-Length, if GET would set that). |
Yeah, it could for sure do that. With your idea above of a fs wrapper I think we could get the behavior you are describing. To be honest though, we can get it also from this implementation. If the only requirement is to be technically correct, then I think I prefer the |
I mean, I would say that I think the "requirement" is that the user can implement a transform without scracificing all the other features in this module by default. As in, they should be able to easily implement a transform that "just works" with the module and it's feature set that makes this module actually more valuable than
Ideally I tried to clear this in a comment above; I'm not suggesting the user copy the interface of those two fs methods, as they do more than is necessary for this module. The two most popular requests for this module is transform and virtual fs. I think we can actually solve both at the same time, and to me, that is worth it just based on what the users seem to be wanting. |
I am definately not following on what your use-case means and how, exactly, the transform of this module actually helps in this case, haha. Are you saying that you have a bunch of files on disk you want to serve up, and then some of those are json, in which you want to read the json, modify it, and then return that as the response? If so, I can show you a solution that does all of that very easily without this functionality even being necessary. But I think I'm just not understanding the underlying ask, perhaps, not sure. Because unless these json files are gigabytes, doing what you are describing as a transform stream seems unnecessarily complicated compare with what is already possible with express today. |
Yes, but remember I currently have I also could add this functionality at the If not, I will take a stab at doing an abstraction over the fs calls tomorrow and see what I come up with. I think your concerns are valid, I just worry about overcomplicating things for less common use cases. For example, writing a transform stream on a pdf. While I totally understand that there are many specific use cases for range requests, those are not commonly the use cases where you would use a transform stream. Lastly, I fully agree we should never break spec, and also should "do the right thing" by default. But we cannot also let perfect get in the way of shipping the things that are useful to the end users. |
Actually, without support here, wouldn't that just move all of the complexity we are discussing into that layer? I guess that might also not be a perfect solution. |
Adding caveats doesn't help :) I'm sure you can keeping adding them all day until you eventually describe what you're trying to achieve. Is it possible to share the code at all? My understanding above would still work even with your caveat, though.
No, but also I cannot say for sure unless you can share what you are trying to achieve so we can actually work together on a solution. Right now, there is not enough information unless I just accept your solution here for what it is. I want to help make a good, workable solution, but to do this, it needs to start at the goals what what it is that is trying to be accomplished in a meaningful way, so alternative solutions can be thought of and measured against that high-level goal. |
I added you on the private repo where I am working on this. One thing to remember, I took up this PR because it was existing work and it also solved my need. There are other ways I can solve my problem, but I wanted to take this opportunity to also move forward the more generic work here. So if we can use my direct use case as an example, but think about how a more generic solution can be added here for others to use, that would be awesome. |
Also an aside, this is where having a good real time chat for the team would be nice 😄. Slack even has pretty decent video chat and screen share. |
The reason those did not land is due to the issues mentioned in those PRs, and this PR doesn't actually resolve those issues. On top of that, I ultimately think it is the wrong solution to the underlying goal, and the initial PRs simply just did what looked easy, but it turns out, it is much more complex and those PRs just didn't want to solve that complexity. Landing something "just because" feels very wrong to me. I think we need to provide good solutions to the issues and now just around the "hey, add this 1-2 lines to your source and then it works, but with all these caveats that are ultimately going to make it hard to use in practice" :) |
Yeah I agree completely that if it is hard to use in practice it is not worth landing. My point above is that "what is use in practice"? Because that usually doesn't mean every feature all the time working together. That said, I take your feedback and would rather move on to trying to see what a fs wrapper would look like because that can solve all of the problems well. |
@@ -78,6 +78,8 @@ the 4th byte in the stream. | |||
|
|||
Enable or disable etag generation, defaults to true. | |||
|
|||
This will default to `false` if the [`transform` option is passed](#transform). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is misleading; if the transform
option is passed, but as transform: undefined
, I assume this will not default to false
, but the wording says it will as long as the option is passed (without regard to what value is passed).
@@ -104,6 +106,8 @@ in preferred order. | |||
Enable or disable `Last-Modified` header, defaults to true. Uses the file | |||
system's last modified value. | |||
|
|||
This will default to `false` if the [`transform` option is passed](#transform). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is misleading; if the transform
option is passed, but as transform: undefined
, I assume this will not default to false
, but the wording says it will as long as the option is passed (without regard to what value is passed).
##### transform | ||
|
||
A function which returns a stream which can be used to transform the data. | ||
If this option is passed it will change the defaults of `etag` and `lastModified` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording is misleading; if the transform
option is passed, but as transform: undefined
, I assume this will not default those to false
, but the wording says it will as long as the option is passed (without regard to what value is passed).
@@ -119,6 +123,55 @@ Serve files relative to `path`. | |||
Byte offset at which the stream starts, defaults to 0. The start is inclusive, | |||
meaning `start: 2` will include the 3rd byte in the stream. | |||
|
|||
##### transform | |||
|
|||
A function which returns a stream which can be used to transform the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to point to what, exactly, "a stream" means. I assume it means it needs to be a Node.js ReadableStream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is addressed below, which seems like duplicate wording. I guess that it is explained down below, but maybe it doesn't need to be listed twice, idk on this comment, but left it for posterity.
`Transfer-Encoding: chunked`. | ||
|
||
The transform function is given the readable stream from `fs.createReadStream`, | ||
the `response` object and, the `options` passed to `send()`. It must return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of passing in the options
to this transform function? Remember that there is those APIs like sendStream.etag(false)
that changes the setting, but is not reflected in that options object; a transform implementer may not realize that, so if those are not important, we should strip them out. Or if they are important, they should get added in from the root source of the setting values.
can set `acceptRanges: false` to disallow range requests or you will have | ||
to ensure that your transform logic can handle partial data. For an example | ||
of a transform which can work on range requests, see the test labled | ||
`should transform range requests`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the test, it is honestly hard to tell what exactly is going on there. Is there a better real-world transform like the tobi one that can be used instead? This is mainly because the docs are pointing folks to look there.
read: function () { } | ||
}) | ||
stream.pipe(concatStream(function (d) { | ||
var _d = Buffer.from(d.toString('utf8').replace('tobi', 'not tobi')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that this example does not work with range requests; just above this it says if it doesn't work with them, be sure to turn them off. So the example should probably head the notice from the docs above as a good example :)
// headers for their stream type. | ||
res.removeHeader('Content-Length') | ||
|
||
stream = this._transform(stream, res, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So since the original stream
is overwritten and now lost; does this open open the floor to fd leaks again? Does using a transform stream mean users now need to be careful not to cause them with the fs steam encounters an error, or similar?
Yea, sorry, I am trying to articulate myself over this forum, so please bear with me if it is not quite coming through :) I think the point I am trying to make is that ideally a system does not have "two modes of operation" for example -- if using a transform stream by default will cut off a large amount of functionality this module is providing, at what point is it just a different module. For example, you can certainly get by with just But apart from that, I think we can actually implement transform in a way that doesn't require turning off all those features, which is really what I'm trying to drive towards. Of course, users can still turn off features they do not want, but those features add a lot of value, and I think if transform is provided by the platform, it should feel like it's part of the platform, letting you take advantage of all the features the platform is providing (etags for caching, range requests for partial downloads / download resumption, etc.). And from the two main feature requests from this module (transform + virtual fs), I think we can provide an interface that can be both easy to use and solve both things, which would be the biggest win for users and ourselves. |
Ok, so you are not using Express, I see. That would make the solution more complicated, because the added crutch added by trying to implement everything over again that Express provides, which seems a bit odd, really. I see quite a bit of things in there that is already Express sugar, but the bolt ons actually provide spec footguns that Express takes away :) Regardless, where is the spot in the code in which you would need to place this functionality, exactly? I assume at the one spot where the |
Ok, I see where in there. The way you are using the router, you can implement this and reduce your overall complexity a lot. I think were you may get mixed up in that, for your packuments, those are not actually "static" for your use-case, but here you are, trying to use something created for static use-cases (I mean, it's part of the name of the module you imported 😆 ). I see what you mean by wanting to move the transform. I assume I can make a PR to your repo? If so, I would be happy to throw one up tomorrow, if you are willing, to see what I mean. I think it would reduce a lot of the complexity in your impl there, ultimately, and get what you are trying to achieve. |
Happy to have a PR, but I don't want you to spend your time if it just to use send directly and do |
Also, happy to hop on a call, because it might be faster. The problem here is that I shared a work in progress, hence the private repo, so it is messy. Since it is clear this PR will not move forward, I think we should close all of these so that noone else makes the same mistake I did with trying to pick it up again. Then I can work on my project with just |
Gotcha. So I wasn't going to just change it to
I am always happy to do so as well anytime; we don't have a "norm" for that, so would need to ad-hoc it somehow if you like; I think you have my phone number for example. While we were in the TC meeting yesterday, I actually opened for discussion the idea to have a "realtime" communications area, to see what folks think. If that gets going, that could be a method in the future (but for the time being, you can email me to set up something I suppose as a solution, lol).
Hm... I guess that makes sense. I feel bad closing folk's PRs on them, but looking back on at least one (I didn't look at all before responding here), I guess they were stale in that I provided some feedback and OP didn't respond back. I can see from that POV. There is actually a top-level issue about transform that helps, as I think the idea to transform is valid. Just looking at the age of at least one of those PRs, I think once the virtual fs requests came up is when I started to form the new idea about getting them both done together, as I think I tried to articulate above (again, sorry if it's not clear, please be gentle; it's hard to share a large type of design over comments in GH, so probably need to whiteboard them and/or write up a technical document to describe what I'm thinking of to convey it better). |
Ok, so I spent a bit this morning looking at how we could provide a "synthetic filesystem". The first challenge will be that we expose the underlying I think there are reasonable breaking changes we can make, but since I don't think this is nearly important enough to hold up |
OK! I did a bunch more experimentation and I found an interesting solution which enables what I want while also not requiring any changes in this lib. Posting here to see what you think, and if it seems fine to you, I will publish the work as a module and we can include an example of how to use it. function transformResp (res, transform) {
transform.pipe(res)
return new Proxy(res, {
get: (target, prop) => {
if (prop === 'write' || prop === 'end') {
return (d, e, c) => transform[prop](d, e, c)
}
return target[prop]
}
})
} And here is an example of using it: const buf = []
const s = new TransformStream({
transform: (d, e, cb) => {
buf.push(d)
cb()
},
flush: (cb) => {
const json = JSON.parse(Buffer.concat(buf).toString('utf8'))
// transform json...
const b = Buffer.from(JSON.stringify(json), 'utf8')
res.setHeader('content-length', b.length)
cb(null, b)
}
})
// Send file
send(req, '/file.json').pipe(transformResp(res, s)) Obviously this is not quite complete, there are other properties you would need to proxy to get the full behavior of the stream, but it illustrates the general point. With this I am able to use the same transform stream in |
Prior work:
Adds a
transform
option. Allows users to transform inflight requests.Things to note:
transform
option it can invalidateetag
andlast-modified
. I added documentation to that effect and set the defaults tofalse
whentransform
is passed.acceptRanges: false
if you cannot write a transform which handles this kind of request.res
andopts
to the function, this should make it easier to implement transforms which support range and other types of requests.I think this resolves all the open conversations on the topic I saw from before. Let me know what you all think!