Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Shazwazza · 2020-11-25T04:50:36Z

Decided to create a central discussion about this here instead of all of the ideas getting lost in various threads :)

Here's some found previous notes

Docs - Build/Deploy Automation #282 (comment) (note: a lot has changed from this discussion to now!)
Make Highlighter docs readable #393
Documentation: Tidy up of Lucene.Net.Join #231
Updates docs build #229 (comment) (scroll down to the "code samples" heading)

The issue is: We have a conversion program that converts the Lucene (java) docs to a doc format that we can use for Lucene.Net. Each time this is run it will overwrite our docs files. This means that any manual updates made to these docs files such as fixing spelling mistakes and more importantly code snippets would just be overwritten.

I have a simple proposal

In various old threads we had some complex proposals but none of these are 'perfect' and won't cater very well for things like spelling mistakes. Instead of inventing some system, we can simply use Git:

We have a branch for each lucene (java) docs conversion that we plan to do
- currently this is just one: 4.8.1. This branch can be called something like docs/converted/4.8.1 and created from master
These converted branches are the only branches we ever execute the JavaDocToMarkdownConverter conversion program.
- So currently, we would run JavaDocToMarkdownConverter once we create this branch. There will probably be no changes now since we've already executed this in master.
Merge the converted branch to the master branch
Make any required changes to the master branch docs files
If we need to re-run JavaDocToMarkdownConverter because we've made some fixes to it to fix some conversions, formatting, etc... we re-run this on the converted branch, then merge changes to the master branch. This will trigger a bunch of merge conflicts which can be resolved by Git merge.

We won't be running this conversion program that often so the amount of merging would be minimal and we'd only have merge conflicts on files that we've changed that have large conflicts with changes we've made in the converter ... which would be quite rare so I don't foresee a lot of work with a Git merge.

Some basic rules to this:

We will never merge into a converted branch from master since this will mean merge conflicts are not triggered and everythig would just be overwritten again
When we start working on a newer version of Lucene.Net, a new converted branch will be created from the current converted branch and the conversion is run there for the new lucene version.

@NightOwl888 Any thoughts on that?

The text was updated successfully, but these errors were encountered:

NightOwl888 · 2020-11-25T14:04:25Z

I agree this sounds simpler to manage than other ideas that have been kicked around, provided the procedure is documented in the contributing section of the web site so we never forget how to manage it.

Implementing this procedure would close #283 and allow us to begin working on #284 immediately (and allow us to accept PRs to update API documentation).

However, the tag docs/converted/4.8.1 sounds like a finished product rather than a half-way point which is what it actually is. Perhaps one of the following would be more descriptive?

docs/converted2markdown/4.8.1
docs/markdown-converted/4.8.1
docs/raw-converted-markdown/4.8.1

Changes to JavaDocToMarkdownConverter

Since JavaDocToMarkdownConverter is only executed against the converted branch, any changes to it must also be committed to the converted branch. And being that merges only happen in one direction, changing it in the master branch would need to be strictly prohibited.

Perhaps the simplest way to enforce this would be to remove git tracking for JavaDocToMarkdownConverter from the master branch and add it to .gitignore on the master branch, but keep both enabled on the converted branches. If we follow the same conflict resolution pattern as the nbgv prepare-release command uses for the version.json file with our .gitignore file, we can prevent these changes from creeping their way back onto master when merging into it.

If we need to re-run JavaDocToMarkdownConverter because we've made some fixes to it to fix some conversions, formatting, etc... we re-run this on the converted branch, then merge changes to the master branch. This will trigger a bunch of merge conflicts which can be resolved by Git merge.

By that I hope you mean git mergetool. I use it with Git for Windows in combination with Beyond Compare, and it makes dealing with conflicts much quicker and easier.

The "LOCAL", "BASE", and "REMOTE" branches visible side by side as well as a way to quickly scroll to each merge conflict. The scrollbars are linked so you see the same section in each pane when you scroll one of them. There are color coded arrows to quickly inject different blocks/lines of text or for more complicated edits manual editing is also allowed. Beyond Compare Pro also has features that allow ignoring of "unimportant changes" such as whitespace or code comments, which make some changes easier to deal with.

In this example, I am merging master into the current branch. The head of master is displayed on the right, the current commit of the current branch is displayed on the left, and "BASE" is what the current branch looked like prior to the current commit. The bottom section is a "best guess" of how to merge it, and around 40% of the time it guesses right.

At the end of the operation of a file, it creates a backup of the merge conflict in a file with a .orig extension added to the original, so if you get it wrong on the first try there is a way to go back and fix it manually.

It takes a little getting used to, but it is much quicker to pop in a block of text by clicking a colored arrow or to resolve a conflict by clicking on an exclamation point than going through the blocks in the raw source file that git creates.

Shazwazza · 2020-11-26T02:42:02Z

Implementing this procedure would close #283 and allow us to begin working on #284 immediately (and allow us to accept PRs to update API documentation).

yep! without a lot of work we can very quickly start getting all of the docs and code samples fixed up

However, the tag docs/converted/4.8.1 sounds like a finished product rather than a half-way point which is what it actually is. Perhaps one of the following would be more descriptive?

yep sure, I'm happy with docs/markdown-converted/4.8.1

By that I hope you mean git mergetool

yes exactly. I just use GitExtensions so this happens all automatically for me. I use KDiff instead of beyond compare since I like it better and it does all the 3 way merging, auto guessing, whitespace options, etc... (although it looks like it was made in the 80s and never changed)

Perhaps the simplest way to enforce this would be to remove git tracking for JavaDocToMarkdownConverter from the master branch and add it to .gitignore on the master branch

Yep all of what you said makes sense. The converter code could even be removed entirely from the master branch if we wanted and only live in the conversion branches. I have no experience with what nbgv prepare-release so I don't really have a comment about that one (but seems like an interesting tool!)

I'm happy to get started on this if we agree with the branch naming?

NightOwl888 · 2020-11-26T11:52:41Z

Yep all of what you said makes sense. The converter code could even be removed entirely from the master branch if we wanted and only live in the conversion branches. I have no experience with what nbgv prepare-release so I don't really have a comment about that one (but seems like an interesting tool!)

I poked around under the hood of nbgv and then started experimenting to see what options Git may have to solve the issue with keeping JavaDocToMarkdownConverter updated on each of the converted branches. Then I realized something: this is not a VCS problem, it is a versioning problem.

This issue can be addressed simply by:

Moving JavaDocToMarkdownConverter to a repo of its own so there is only 1 copy to deal with (https://selfserve.apache.org/)
Converting it into a dotnet tool so it can be packaged and versioned with NuGet
Making all of the scripts that utilize it always uninstall and reinstall the latest version of the tool from the feed (see dotnet tool update) or use a specific version in cases where version drift is a problem

We don't have to use the official NuGet.org feed for this, it can be MyGet, Azure Devops, GitHub, or whatever as long as it is Internet-accessible.

I'm happy to get started on this if we agree with the branch naming?

Sure, that's fine.

NightOwl888 · 2020-11-26T17:30:58Z

I have converted the JavaToMarkdownConverter to a dotnet tool (currently at https://github.com/NightOwl888/lucenenet-javadoc2markdown) simply named javadoc2markdown and set up an official NuGet feed to download it from at https://dev.azure.com/lucene-net/Lucene.NET/_packaging?_a=feed&feed=lucene-net-tools

To install the latest version of the tool in the scripts to a local path, just run the command:

dotnet tool install javadoc2markdown --add-source https://pkgs.dev.azure.com/lucene-net/_packaging/lucene-net-tools/nuget/v3/index.json --tool-path ./

Then run the command:

javadoc2markdown <LUCENE DIRECTORY> <LUCENENET DIRECTORY>

The parameters are exactly the same as they were before.

Then to clean up, run the command:

dotnet tool uninstall javadoc2markdown --tool-path ./

You might want to put that last line in a finally block to ensure the script is re-runnable in the case of an exception.

The GitHub repo is only meant to be temporary until an Apache repo is set up for it. I haven't yet removed the tool from this repo, but I suggest to do so and update the scripts before doing any branching.

The changes I made are enough to decouple the docs build scripts from any specific version of the tool, but the JavaDoc2Markdown tool's build and upload to the feed could be fully automated by referencing Nerdbank.GitVersioning and copying the azure-pipelines.yml and related files from Morfologik.Stemming (minus the test stage, we don't need that). We would also need to remove some of the custom settings from the version.json file, which is what configures Nerdbank.GitVersioning and remove the explicit <Version> property from the project file.

Shazwazza · 2020-12-02T05:37:07Z

Well done sir! 🎉

Amazing start, looks like I have some work to start on :)

Shazwazza · 2021-02-21T23:31:39Z

@NightOwl888 This is complete now, there is a new branch docs/markdown-converted/4.8.1 which will be used solely for re-executing the dotnet global tool for the docs conversion.

What needs to happen now is:

Rebuild and publish the website and docs to reflect recent changes and fixes
Update the docs to reflect this approach

There's a bunch of other outstanding docs tasks too that I'll look into. I'll close this one.

NightOwl888 · 2021-02-27T00:19:39Z

Per @Shazwazza, when I asked him via chat whether it is now safe to update the documents with .NET-specific content:

Yes it is safe to modify the .md files now with .NET content. In some cases however, it is more safe to use docfx 'override' files. For example, we are already doing this with the /websites/apidocs/apispec/core/Lucene_Net_Codecs.md file. This is 'overriding' the default converted namespace file at /src/Lucene.Net/Codecs/package.md . In some cases it may make more sense to use override files as opposed to modifying the converted file. I guess there's no strict rules on whether to use override files or not. I dunno, i guess there's pros/cons to both. By changing the converted files it means we'd know about any changes made upstream to them in the Java Lucene project and we'd have to manually merge those, but by using override files we don't end up with merge issues which might be nicer in cases where we totally overhaul the docs.

It sounds like override files are probably better for maintenance. But on the other hand, it sounds like it might be confusing for contributors if they think that modifying the original will modify the API site.

We should probably do an experiment with each to try to get a feel for which choice is the right one for given scenarios.

Shazwazza · 2021-03-02T22:29:43Z

I think what is easiest is in most cases is to only use override files when namespace documentation doesn't already exist. Namespace documentation will exist for most namespaces in the lucene.net project based on how our markdown converter tool works.

This will be consistent with how the "Improve this doc" button works on the documentation. When clicking on that button and namespace documentation exists, it will edit the currently existing file. When namespace documentation doesn't exist already (this is rare), there will be no "Improve this doc" button. In this case we should use override files. As an example of such a namespace, see https://lucenenet.apache.org/docs/4.8.0-beta00009/api/analysis-phonetic/Lucene.Net.Analysis.Phonetic.Language.html

There might be some cases where we use override files even if the namespace doc is already there. I think this can occur only if the updated docs are completely different from the originals.

NightOwl888 · 2021-03-03T06:45:01Z

As an example of such a namespace, see https://lucenenet.apache.org/docs/4.8.0-beta00009/api/analysis-phonetic/Lucene.Net.Analysis.Phonetic.Language.html

That was actually not part of Lucene, but part of the commons-codec package from Apache, which was imported to save us from maintaining an external library and porting the parts of it we don't need. There are a couple others, the SAX and TagSoup modules that were imported into Lucene.Net.Benchmarks to parse HTML. AFAIK, we could use HTML Agility Pack instead and dump these classes if someone were willing to analyze this at a higher level to map over the functionality.

The only actual Lucene case I can think of where we are missing the document is the migration guide (#399), presumably because it was named Migrate.txt instead of following the other "overview" and "package" naming conventions.

Lucene.Net.Codecs was only different because we were trying to release that document before we had the conversion process sorted out. Now that it is, would it make sense to integrate these changes back into the original doc?

Are there any other cases you can specifically recall where the documentation doesn't exist in Lucene? If there are no other exceptions and your suggestion is not to use override files on the rest, I am on board with that - it would be fewer files to maintain and less confusing to contributors.

NightOwl888 added docs is:idea labels Nov 25, 2020

NightOwl888 mentioned this issue Dec 8, 2020

Migrating from lucene.net 3.0 to 4.8 #398

Closed

Shazwazza closed this as completed Feb 21, 2021

This was referenced Mar 23, 2021

Docs - store .net code snippets beside the converted markdown #283

Closed

Docs - convert java snippets to c# #284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Shazwazza commented Nov 25, 2020

NightOwl888 commented Nov 25, 2020

Shazwazza commented Nov 26, 2020

NightOwl888 commented Nov 26, 2020

NightOwl888 commented Nov 26, 2020

Shazwazza commented Dec 2, 2020

Shazwazza commented Feb 21, 2021

NightOwl888 commented Feb 27, 2021

Shazwazza commented Mar 2, 2021

NightOwl888 commented Mar 3, 2021

Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Idea on how to handle manually editing the docs files while keeping the automated conversion #396

Comments

Shazwazza commented Nov 25, 2020

I have a simple proposal

NightOwl888 commented Nov 25, 2020

Changes to JavaDocToMarkdownConverter

Shazwazza commented Nov 26, 2020

NightOwl888 commented Nov 26, 2020

NightOwl888 commented Nov 26, 2020

Shazwazza commented Dec 2, 2020

Shazwazza commented Feb 21, 2021

NightOwl888 commented Feb 27, 2021

Shazwazza commented Mar 2, 2021

NightOwl888 commented Mar 3, 2021