-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create lists of commands to test coverage parity against #1070
Comments
We should absolutely leverage the online linux man pages to periodically fetch a big, big list of commands. Sample: http://linux.die.net/man/1/ has almost 10,000 commands. |
We could make separate projects to track commands based on platform (since they overlap, we can't use milestones, which is a pity since it would give us a nice progress bar) Linux:
Windows: OS X: |
@waldyrious For Windows, commands in CMD and PowerShell are DIFFERENT. For example, |
@be5invis thanks for bringing that up. It is certainly something we need to consider (e.g. we currently treat all linuxes the same, even though some of the commands are shell-specific). See #190 and #816 for previous discussion. That said, that problem does not affect this issue: the former deals with how we organize the command pages we do have, while this issue is about identifying which commands we don't yet have, but should. |
@waldyrious The full PowerShell commands on my PC: |
If the tldr client emits a different exit status depending on whether the page exists or not (like tldr-bash-client does, then we could have an semi-automatic bash script that runs through a list of commands and emits a list of commands that don't exist yet. I could even write something like that & create a gist quite easily. It would certainly help people who want to contribute find a page that needs doing. |
You can always check the files present in the repo itself for parity no ? |
@agnivade Yeah, we could do that too! Do a |
Executing
on http://linux.die.net/man/1/, gives this file: linux-commands.txt This is obviously pending sorting, which I'll do soon. |
Sorting complete! Here's what I came up with: cat linux-commands.txt | xargs -P4 -I {} bash -c 'if [[ "$(find tldr/pages/ -name {}.md | wc -l)" -ne 0 ]]; then echo yep>>yeses.txt; else echo nope>>nos.txt; fi'
echo We have $(cat yeses.txt | wc -l) out of $(cat linux-commands.txt | wc -l) commands in tldr-pages - $(cat nos.txt | wc -l) commands are missing. Running the above reveals that:
|
I wonder if, after we've compiled one or more lists of commands to add, we could somehow calculate the completeness percentages automatically and display them in the README with a badge. If we do compile multiple lists, we could even organize the completion badges in a table to provide a dashboard similar to the progress table of Wikipedia's WikiProject Missing encyclopedic articles. Does anyone have an idea whether something like that is doable and/or hints about how to go about implementing it? |
I would like to take a stab at this. I am thinking of just taking the GNU coreutils list and test parity against it. The linux.die.net page contains a lot commands which have to be installed separately. The badge thing can be easily done with a custom svg element. |
@agnivade I can't wait to see what you come up with! I'm more than willing to provide the actual content of the lists if that takes some work off your plate (I have a bunch of notes and links in a google doc, besides the resources I listed above). |
Sure, that would be great. |
Oooh, awesome :D |
@waldyrious - I might take a stab at it this weekend. Can you share the links/notes that you have ? |
Sure. I'll block off one hour to work on this today, and will post the resulting data. |
Heads-up: the wiki page "Pages plan" has been deleted to centralize tracking of missing pages in this thread. I've moved all the information that was present there to this spreadsheet, which is publicly viewable and anyone can add comments. It's a work in progress (I just started it). I'll give write access to the current maintainers. |
@waldyrious Wow, that's an impressive spreadsheet! Is there a filter for just the ones that haven't been done yet? How are bulk lists of commands added to the list? |
There will be a filter, yeah -- that's one of the reasons I've decided to build it in a spreadsheet. The lists of commands will be added manually (using various helper tools, of course), since the various sources don't use a common format. Let me know on Gitter if you'd like to work on this so we can coordinate. |
I am concerned about how do I get the total list of commands programmatically. Since I would like to run the list against every commit merged with master. |
That document is by no means meant to be the final location of the list. It's just the way I figured would be easiest to get it started and quickly filling it. I don't know yet what setup would be the best balance of (1) community maintenance of the data, (2) machine consumption of the contents, (3) automatic synchronization (as much as possible) as new pages are added. Ideas are welcome. Also, the choice of how to set this up would depend on how often we would want to update the list. I think we can start with something reasonably static, to make things easier, especially since we have a lot of work to catch up to established commands before it would make sense to start chasing more dynamic lists (say, top node.js-based CLI tools or something like that) |
Umm no .. I think you got the wrong idea. 😝 We don't need to synchronize when new pages are added. That would be crazy. It seems like you put a lot of effort into this. Frankly, I didn't need so much details. Here's what my plan is -
That's it. No need to update any list when new pages are added. |
Hahah, yeah, I got a little carried away there. Although I might have given off the wrong impression. The way I was planning to have this "automatic sync" feature was to simply open one issue per command to add, and assign them to milestones according to the lists they appear in. That way we'd get a nice live overview page with progress bars for each of the lists we'd want to reach parity with. For reference, my inspiration came from the overview table of Wikipedia's WikiProject Missing encyclopedic articles. In addition one milestone per (major) source, we might also want platform-specific lists (windows commands, bsd, etc.), and maybe topic-specific lists (email clients, text editors, compilers, etc.). Of course, this doesn't prevent us from having a "master completeness list" and use that to compute a single "overall completeness" metric. We do need to decide what goes into that list, though. The obvious choice is a metric of the most popular pages (e.g. the top 1000 entries sorted by how many of those lists they appear on), but let me know if you think something else would make more sense. |
Your idea seems like a lot of manual work, something which I personally would want to avoid. I was planning for that "completeness" metric and be done with it. If we indeed decide that it's just gonna be 1000 commands, then we might as well compute the list and check it in in the repo, so that my code can easily compare with it. |
Sure, as I said, the list will probably not change much after we compile it. My idea is just a nice-to-have I might do on my own later on (unless you guys object). For the master list, we just need to decide what is the criteria we'll use to define its contents -- from there it's just a matter of collecting the rest of the data and applying the filters. So in that regard, what are your thoughts regarding which criteria to use: which lists to compare against, how many commands to include, etc? |
Update: the table is pretty much ready now. Some areas that still need some help:
Apart from that, we can start deciding how to use that data to compile our master list :) Note: I didn't include the linux.die.net manpages, since even just the first section contains about 10,000 commands, which makes the table unwieldy and kinda overwhelming, to be honest. |
By the way, the plan to use milestones won't be possible after all. I had already reached this conclusion before, but forgot it in the meantime: it turns out GitHub only allows a single milestone per issue, so there would be no way to simultaneously track progress towards multiple coverage parity goals :( That said, we could still have a milestone for the master parity list, which IMO would be a good thing as it would make those missing commands more visible as issues that newcomers could tackle. (It could also be the target URL for the badge.) |
So are you saying you have manually counted each 'x' just to verify the automated count ? That's some dedication ! Why bother with the manual count at all if there is already automation for it ? Unless you suspect that |
Oh god no, haha :P I'm not that crazy ;) |
Ah I see :) Didn't notice that there was another sheet. |
Awesome work! Yeah, perhaps we could have a 'current goal' to document all the commands in a given list, and keep moving to new lists as we complete old ones. Having a list of commands auto-generated that have yet to be documented for the 'current goal' parity list would be helpful for newcomers, yeah. The sheet is rather unwieldy though on my screen, since the frozen panes take up about 60% of my available screen real-estate on my laptop 😕 |
I think we should move the orange and yellow cells to a new row below. Because its in the same row as coverage. And it just signifies the expected count, not coverage. And lastly, our current coverage % is 52 right ? |
I made the heading more compact. Is that workable now?
Agreed, I just did that. Ideally we won't even have to include the expected count on the table, but until we figure out what's going on with the mismatched values, we'll need those cells. |
Ok, I've filled the table some more. The two sources that still need parsing into a plain list of command names are Inconsolation and ArchWiki's List of applications. Any help appreciated!
Yes, but that's a plain fraction that doesn't consider the relative importance of the missing commands. I'd rather have a weighted coverage percentage, where each entry is weighted by the number of occurrences in these other lists. |
Through weird es6 magic, I bring you a list of commands for the Inconsolation lists! Here's the code I used in the firefox console for reference: (function() {
let result = [];
document.querySelectorAll(".entry-content > p:nth-child(4) a[href]").forEach((el) => {
if(el.innerText.search(":") === -1 || el.innerText.trim()[0] !== el.innerText.trim()[0].toLowerCase() || el.innerText.search(/\./) !== -1) return;
result.push(...el.innerText.split(":")[0].split(/\s*(and|,)\s*/gi));
});
result = result.filter((cmd) => cmd.search(/[,\*\(\) \{\}]|and/) == -1 || cmd.length == 0);
console.log(result.filter((el, i, arr) => arr.indexOf(el) === i).join("\n"));
})(); ...I've pasted them into the spreadsheet. They might need a little bit of tidy-up work though, since the input was messy. That archwiki one though looks tough, since they don't detail the name of all commands in the list. |
Can you explain the code? I'm afraid just parsing the link titles will produce a list with way too many missing entries, because many of the titles don't contain command names directly. On the other hand, I'm not sure I can think of anything that would work better without involving manual processing of each page linked from the entries... 😕 As for the ArchWiki page, I guess it would suffice to extract only the contents of the sections titled "Console". That will definitely leave some gaps in the output, but the page isn't meant to be a structured list anyway, nor it focuses specifically on command line programs, so I guess it's reasonable to parse it more loosely. |
It is a bit messy, isn't it! 😛 What it does is extract the names of the commands listed on the page, since I assumed that it was an index of all the commands the author had talked about. It discards the following:
Once done, it extracts the bit before the colon and does the following:
|
Some related discussion here: #1953 |
This has been pending too long ! I will go on vacation soon, I promise to work on this during that time ! |
Enjoy your vacation! If the two coincide, then so be it 👍 |
No description provided.
The text was updated successfully, but these errors were encountered: