-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ecosystem detection tool to nodejs org #7935
Comments
Citing myself from #7619 (comment) and below:
|
@ChALkeR : Can you describe the infrastructure / hosting requirements that you have to run this? I can look into whether or not some part of our existing dedicated hosting resources can be used. |
@jasnell Atm it's suboptimal, some fixes are required to make it faster and less consuming. The main requirement is storage, but I can't tell the exact numbers now. Something around 200 GiB, perhaps (could be hosted on slow storage, it affects only dataset rebuild time). With proper fixes, that could be brought down to a few GiB, but a cache of about 80 GiB for packed files would still be useful in case if the rebuilding rules change. The built dataset itself is around 5 GiB, so it's quite small. |
@jasnell I will to move the whole process to a small vps again to check if everything works fine (another reason is that I am far from my PC atm and can't rebuild this on my notebook due to slow internet). That would also help me to outline (and lower) the requirements to run this thing autonomously. |
No worries. |
@jasnell More specific: Tarballs for Partials (uncompressed) are exprected to consume around 30-50 GiB (I need to readjust the blacklist). Those are required to obtain reasonable rebuild speed and are essentially pre-built dataset chunks for each package. Those could be stored in the compressed form (reducing their size to something around 5 GiB), but that's not supported yet. The dataset itself should be about the same size as all the partilals together (minus a few GiB). The compressed size is expected to be about 5 GiB, and we might want to keep several versions of those. Deep dependencies builder has some memory requirements, but I don't remember how much memory exactly does it consume. It's below 2 GiB, I think. I will post an update with real numbers once I finish rebuilding the dataset on my VPS. |
Ok. Partials are consuming 40 GiB. That could increase as I would probably want to add more information there, i.e. package.json (to check postinstall scripts, for example) and the disk usage (to build a list of abnormally huge packages). That would be only +2-4 GiB though, I suppose. Or it could also decrease once I update the blacklist. Upd: partials are 36 GiB with |
Ok. The unpacked dataset size is 29 GiB, 27 GiB of which is code search data. The actual (packed) dataset size is 4.3 GiB, latest version is uploaded to http://oserv.org/npm/Gzemnid/2016-08-04/. |
Status update: no additional manual actions are required anymore, and it could be used by itself to build the same datasets that I'm using, without some internal knowledge. I began documenting the commands and making it easier to work with the tool, I will hopefully make it into something sensible this week. |
Perhaps the build group can host this? We have room in our infrastructure. |
Status update: initial documentation is in place, search script also got merged, atm it has no more bash parts. The usage is pretty simple now. I believe we could start moving this into the org at this point, if we would decide on that, |
/cc @nodejs/ctc, should this be mentioned on a CTC meeting? |
Big +1 on adding this to the org and having documentation that will allow individuals to make use of the tool! |
+1 to moving this in. |
|
@ChALkeR does your tool need to make a lot of calls to the npm database ? If that's the case, it may be useful (for speed) to host a copy with continuous replication of https://skimdb.npmjs.com/registry on the same server. |
@targos I thought about that. No, it doesn't. I don't think that a replication is needed. Everything that I get from skimdb could be solved with a follower (and takes one or two minutes per run even without a follower), and I don't think that that replication would have Dependency builder is the only thing that could actually benefit from a skimdb replica, not because calling npm registry is long, but because that could speed up the storage, and I am not sure if it is worth it at the moment — that could save up to only ~30-40% of the dependency builder build time, as I estimate. |
Result from CTC discussion last week was:
|
Ok, this got stalled for a bit mostly due to personal time constraints, I will now try to continue this effort =). |
Pre-built dataset update: http://oserv.org/npm/Gzemnid/2016-10-22/, it's under 5 GiB. To perform code search, you need three That dataset is exactly the one I currently use for code search, and it was built following the instructions on https://github.com/ChALkeR/Gzemnid/blob/master/README.md. |
Should this remain open? |
I think so. Gzemnid seems like it'll be as useful as CitGM for testing the breakingness of changes, and I assume this is just waiting on people to get the time to add it to our infra. |
Removing |
This moved to nodejs/TSC#490 and subsequently nodejs/admin#130. Closing this. |
This is a continuation of the discussion in #7619. For a while now, we have pinged @ChALkeR every time we wanted to know how heavily something was used in the ecosystem. I'm proposing that we bring this into the nodejs org so that it is more accessible to other people (
busFactor++;
), and more likely to get additional maintainers. @ChALkeR is currently working on the project in https://github.com/ChALkeR/Gzemnid.The text was updated successfully, but these errors were encountered: