Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Closed
Shazwazza opened this issue Nov 25, 2020 · 9 comments

Comments

@Shazwazza
Copy link
Contributor

Decided to create a central discussion about this here instead of all of the ideas getting lost in various threads :)

Here's some found previous notes

The issue is: We have a conversion program that converts the Lucene (java) docs to a doc format that we can use for Lucene.Net. Each time this is run it will overwrite our docs files. This means that any manual updates made to these docs files such as fixing spelling mistakes and more importantly code snippets would just be overwritten.

I have a simple proposal

In various old threads we had some complex proposals but none of these are 'perfect' and won't cater very well for things like spelling mistakes. Instead of inventing some system, we can simply use Git:

  • We have a branch for each lucene (java) docs conversion that we plan to do
    • currently this is just one: 4.8.1. This branch can be called something like docs/converted/4.8.1 and created from master
  • These converted branches are the only branches we ever execute the JavaDocToMarkdownConverter conversion program.
    • So currently, we would run JavaDocToMarkdownConverter once we create this branch. There will probably be no changes now since we've already executed this in master.
  • Merge the converted branch to the master branch
  • Make any required changes to the master branch docs files
  • If we need to re-run JavaDocToMarkdownConverter because we've made some fixes to it to fix some conversions, formatting, etc... we re-run this on the converted branch, then merge changes to the master branch. This will trigger a bunch of merge conflicts which can be resolved by Git merge.

We won't be running this conversion program that often so the amount of merging would be minimal and we'd only have merge conflicts on files that we've changed that have large conflicts with changes we've made in the converter ... which would be quite rare so I don't foresee a lot of work with a Git merge.

Some basic rules to this:

  • We will never merge into a converted branch from master since this will mean merge conflicts are not triggered and everythig would just be overwritten again
  • When we start working on a newer version of Lucene.Net, a new converted branch will be created from the current converted branch and the conversion is run there for the new lucene version.

@NightOwl888 Any thoughts on that?

@NightOwl888
Copy link
Contributor

I agree this sounds simpler to manage than other ideas that have been kicked around, provided the procedure is documented in the contributing section of the web site so we never forget how to manage it.

Implementing this procedure would close #283 and allow us to begin working on #284 immediately (and allow us to accept PRs to update API documentation).

However, the tag docs/converted/4.8.1 sounds like a finished product rather than a half-way point which is what it actually is. Perhaps one of the following would be more descriptive?

  • docs/converted2markdown/4.8.1
  • docs/markdown-converted/4.8.1
  • docs/raw-converted-markdown/4.8.1

Changes to JavaDocToMarkdownConverter

Since JavaDocToMarkdownConverter is only executed against the converted branch, any changes to it must also be committed to the converted branch. And being that merges only happen in one direction, changing it in the master branch would need to be strictly prohibited.

Perhaps the simplest way to enforce this would be to remove git tracking for JavaDocToMarkdownConverter from the master branch and add it to .gitignore on the master branch, but keep both enabled on the converted branches. If we follow the same conflict resolution pattern as the nbgv prepare-release command uses for the version.json file with our .gitignore file, we can prevent these changes from creeping their way back onto master when merging into it.

If we need to re-run JavaDocToMarkdownConverter because we've made some fixes to it to fix some conversions, formatting, etc... we re-run this on the converted branch, then merge changes to the master branch. This will trigger a bunch of merge conflicts which can be resolved by Git merge.

By that I hope you mean git mergetool. I use it with Git for Windows in combination with Beyond Compare, and it makes dealing with conflicts much quicker and easier.

The "LOCAL", "BASE", and "REMOTE" branches visible side by side as well as a way to quickly scroll to each merge conflict. The scrollbars are linked so you see the same section in each pane when you scroll one of them. There are color coded arrows to quickly inject different blocks/lines of text or for more complicated edits manual editing is also allowed. Beyond Compare Pro also has features that allow ignoring of "unimportant changes" such as whitespace or code comments, which make some changes easier to deal with.

image

In this example, I am merging master into the current branch. The head of master is displayed on the right, the current commit of the current branch is displayed on the left, and "BASE" is what the current branch looked like prior to the current commit. The bottom section is a "best guess" of how to merge it, and around 40% of the time it guesses right.

At the end of the operation of a file, it creates a backup of the merge conflict in a file with a .orig extension added to the original, so if you get it wrong on the first try there is a way to go back and fix it manually.

It takes a little getting used to, but it is much quicker to pop in a block of text by clicking a colored arrow or to resolve a conflict by clicking on an exclamation point than going through the blocks in the raw source file that git creates.

@Shazwazza
Copy link
Contributor Author

Implementing this procedure would close #283 and allow us to begin working on #284 immediately (and allow us to accept PRs to update API documentation).

yep! without a lot of work we can very quickly start getting all of the docs and code samples fixed up

However, the tag docs/converted/4.8.1 sounds like a finished product rather than a half-way point which is what it actually is. Perhaps one of the following would be more descriptive?

yep sure, I'm happy with docs/markdown-converted/4.8.1

By that I hope you mean git mergetool

yes exactly. I just use GitExtensions so this happens all automatically for me. I use KDiff instead of beyond compare since I like it better and it does all the 3 way merging, auto guessing, whitespace options, etc... (although it looks like it was made in the 80s and never changed)

Perhaps the simplest way to enforce this would be to remove git tracking for JavaDocToMarkdownConverter from the master branch and add it to .gitignore on the master branch

Yep all of what you said makes sense. The converter code could even be removed entirely from the master branch if we wanted and only live in the conversion branches. I have no experience with what nbgv prepare-release so I don't really have a comment about that one (but seems like an interesting tool!)


I'm happy to get started on this if we agree with the branch naming?

@NightOwl888
Copy link
Contributor

Yep all of what you said makes sense. The converter code could even be removed entirely from the master branch if we wanted and only live in the conversion branches. I have no experience with what nbgv prepare-release so I don't really have a comment about that one (but seems like an interesting tool!)

I poked around under the hood of nbgv and then started experimenting to see what options Git may have to solve the issue with keeping JavaDocToMarkdownConverter updated on each of the converted branches. Then I realized something: this is not a VCS problem, it is a versioning problem.

This issue can be addressed simply by:

  1. Moving JavaDocToMarkdownConverter to a repo of its own so there is only 1 copy to deal with (https://selfserve.apache.org/)
  2. Converting it into a dotnet tool so it can be packaged and versioned with NuGet
  3. Making all of the scripts that utilize it always uninstall and reinstall the latest version of the tool from the feed (see dotnet tool update) or use a specific version in cases where version drift is a problem

We don't have to use the official NuGet.org feed for this, it can be MyGet, Azure Devops, GitHub, or whatever as long as it is Internet-accessible.

I'm happy to get started on this if we agree with the branch naming?

Sure, that's fine.

@NightOwl888
Copy link
Contributor

I have converted the JavaToMarkdownConverter to a dotnet tool (currently at https://github.com/NightOwl888/lucenenet-javadoc2markdown) simply named javadoc2markdown and set up an official NuGet feed to download it from at https://dev.azure.com/lucene-net/Lucene.NET/_packaging?_a=feed&feed=lucene-net-tools

To install the latest version of the tool in the scripts to a local path, just run the command:

dotnet tool install javadoc2markdown --add-source https://pkgs.dev.azure.com/lucene-net/_packaging/lucene-net-tools/nuget/v3/index.json --tool-path ./

Then run the command:

javadoc2markdown <LUCENE DIRECTORY> <LUCENENET DIRECTORY>

The parameters are exactly the same as they were before.

Then to clean up, run the command:

dotnet tool uninstall javadoc2markdown --tool-path ./

You might want to put that last line in a finally block to ensure the script is re-runnable in the case of an exception.

The GitHub repo is only meant to be temporary until an Apache repo is set up for it. I haven't yet removed the tool from this repo, but I suggest to do so and update the scripts before doing any branching.

The changes I made are enough to decouple the docs build scripts from any specific version of the tool, but the JavaDoc2Markdown tool's build and upload to the feed could be fully automated by referencing Nerdbank.GitVersioning and copying the azure-pipelines.yml and related files from Morfologik.Stemming (minus the test stage, we don't need that). We would also need to remove some of the custom settings from the version.json file, which is what configures Nerdbank.GitVersioning and remove the explicit <Version> property from the project file.

@Shazwazza
Copy link
Contributor Author

Well done sir! 🎉

Amazing start, looks like I have some work to start on :)

@Shazwazza
Copy link
Contributor Author

@NightOwl888 This is complete now, there is a new branch docs/markdown-converted/4.8.1 which will be used solely for re-executing the dotnet global tool for the docs conversion.

What needs to happen now is:

  • Rebuild and publish the website and docs to reflect recent changes and fixes
  • Update the docs to reflect this approach

There's a bunch of other outstanding docs tasks too that I'll look into. I'll close this one.

@NightOwl888
Copy link
Contributor

Per @Shazwazza, when I asked him via chat whether it is now safe to update the documents with .NET-specific content:

Yes it is safe to modify the .md files now with .NET content. In some cases however, it is more safe to use docfx 'override' files. For example, we are already doing this with the /websites/apidocs/apispec/core/Lucene_Net_Codecs.md file. This is 'overriding' the default converted namespace file at /src/Lucene.Net/Codecs/package.md . In some cases it may make more sense to use override files as opposed to modifying the converted file. I guess there's no strict rules on whether to use override files or not. I dunno, i guess there's pros/cons to both. By changing the converted files it means we'd know about any changes made upstream to them in the Java Lucene project and we'd have to manually merge those, but by using override files we don't end up with merge issues which might be nicer in cases where we totally overhaul the docs.

It sounds like override files are probably better for maintenance. But on the other hand, it sounds like it might be confusing for contributors if they think that modifying the original will modify the API site.

We should probably do an experiment with each to try to get a feel for which choice is the right one for given scenarios.

@Shazwazza
Copy link
Contributor Author

I think what is easiest is in most cases is to only use override files when namespace documentation doesn't already exist. Namespace documentation will exist for most namespaces in the lucene.net project based on how our markdown converter tool works.

This will be consistent with how the "Improve this doc" button works on the documentation. When clicking on that button and namespace documentation exists, it will edit the currently existing file. When namespace documentation doesn't exist already (this is rare), there will be no "Improve this doc" button. In this case we should use override files. As an example of such a namespace, see https://lucenenet.apache.org/docs/4.8.0-beta00009/api/analysis-phonetic/Lucene.Net.Analysis.Phonetic.Language.html

There might be some cases where we use override files even if the namespace doc is already there. I think this can occur only if the updated docs are completely different from the originals.

@NightOwl888
Copy link
Contributor

As an example of such a namespace, see https://lucenenet.apache.org/docs/4.8.0-beta00009/api/analysis-phonetic/Lucene.Net.Analysis.Phonetic.Language.html

That was actually not part of Lucene, but part of the commons-codec package from Apache, which was imported to save us from maintaining an external library and porting the parts of it we don't need. There are a couple others, the SAX and TagSoup modules that were imported into Lucene.Net.Benchmarks to parse HTML. AFAIK, we could use HTML Agility Pack instead and dump these classes if someone were willing to analyze this at a higher level to map over the functionality.

The only actual Lucene case I can think of where we are missing the document is the migration guide (#399), presumably because it was named Migrate.txt instead of following the other "overview" and "package" naming conventions.

Lucene.Net.Codecs was only different because we were trying to release that document before we had the conversion process sorted out. Now that it is, would it make sense to integrate these changes back into the original doc?

Are there any other cases you can specifically recall where the documentation doesn't exist in Lucene? If there are no other exceptions and your suggestion is not to use override files on the rest, I am on board with that - it would be fewer files to maintain and less confusing to contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants