Adding non-breakable spaces lua filter #119

Delanii · 2020-10-13T08:51:09Z

This adresses #114 .

Its first pull request I have ever made, so hopefully I did not messed anything up. I know it is a whole bunch of files, which should be avoided, but I think that in this case it should be put-and-packaged as one ...

Per #114 it still has not implemented the change in tables and presence-check, since I am having trouble understanding that syntax (idiom) and using it. Hopefully it doesnt mind too much.

Regards, Tomas

Delanii · 2020-10-13T11:59:49Z

OK, it seems that somehow I did not upload the makefile?

I have added makefile to my fork, which looks like added the makefile also here ... Well, I have to learn about git and GitHub :D

alerque · 2020-10-13T12:34:46Z

On GitHub a Pull Request is a request to merge one branch into another, in this case your fork's master branch into this repository's. You can update PR's by adding, removing, or rebasing commits in a branch. Typically (and to get more GitHub tooling) you'd open a PR from a branch on your fork other than master. In that case maintainers of the target project can also make changes to the PR before merging. As it is only you can, but anything you do in your master branch will continue to change this PR until it is merged or closed.

Delanii · 2020-10-13T12:51:31Z

Oh, I wasnt sure if I should create new branch in my fork or not. Next time I do better.
Thank you for clarification.

Do I understand you correctly, that now nobody (except me) can make edit to this PR? I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked, which I hoped is for making it possible for project maintainers to correct any errors in this PR (well, if they prefer that over telling me what should I fix).

alerque · 2020-10-13T12:54:26Z

I thought, that the "Allow edits by maintainers" checker is to control this. I have it checked

Really? Last I checked (less than a week ago) that checkbox was disabled for all PR's from master branches. If you see the box and have it checked then GitHub has updated something recently, and that's a nice fix that I'm glad to hear about.

Delanii · 2020-10-13T13:14:56Z

I have used this manual:

Pull request from a fork

Step no. 7 is the one which we are talking about, I hope. Except I did the exact opposite: Left it checked to allow maintainers to do with this PR anything they deem neccessary ...

Btw. It seems that bibliography test are misbehaving; maybe because of newest update?

stroobandt · 2020-10-13T22:14:30Z

Interesting. Today, I was looking at some legacy production code with the exact same purpose; thinking by myself that I better replace it by a Lua filter. The current code is a monster of a nested sed that lives in my makefile.

The sed command creates secondary sed substitution commands from a list of so called short begin words.
These secondary sed commands will replace the ordinary space after a short begin word at the beginning of a sentence with a non-breaking space.

The list of short begin words is language dependent. Here are the English and Dutch lists.

Finally, we do not want such substitutions to occur in <PRE>-formatted code blocks.
That is where the /```/,/```/! is for.

Delanii · 2020-10-14T07:51:23Z

@stroobandt Thank you for posting that!

Sadly, I am not enough versed in makefiles or sed to make much of it, but from the lists I see that you went in simillar direction. I am actually not sure how to implement different behavior according to language (well, I would have to read somehow the metadata variable lang).

If you are concerned about code-block (I was too) I have tested that (and it is even in test file of the PR) - codeblocks and inline code is recognized well and no space substitution happens there.

tarleb · 2020-10-18T12:05:11Z

Thanks for the updates! I'll take a closer look soon, I just need to fix the tests first. This may take a while, as I'm going to spend less time in front of the screen for the next couple of days.

The filter name "nonbreakingspace" is a bit generic and somewhat cryptic. Can we find something more specific? How about using the same name as the TeX package, "vlna"? Would that be too confusing? Maybe "west-slavic-nbsp"? I'm open for ideas.

Delanii · 2020-10-19T11:30:02Z

Well, "pandoc-vlna" in itself could be good, but I would think that it would introduce different "crypticity;" that is, that LaTeX packages will be known only to LaTeX users; and most probably to the subset that doesnt already used theyre own homemade solutions (I would guess that those with homemade solutions will be actually plenty).

I thought that naming the filter by what it does as much as possible will prevent its name from being misleading. Very often I am taking inspiration from lua filter pagebreak - which does exactly what it is named after: inserts pagebreak.

But in the spirit of your previous comment, also "insert-nbsp" (or no-abbreviated form "insert-nonbreakable-space") could work too. I would like to aviod specifying any nationalities or geografic locations to the filter, because it might be offputting to someone (or even any way worse). The code state is mostly caused by me being lua newbie and natively Czech, which leads me to make the default setting suitable for Czech language requirements. I hope that the filter as it is is customizable enough for anybody to make changes according to his native language requirements.

If anybody would know how to make this filter conformant to more than one language, I would be happy to make more improvements. In this regard, my only idea is to have multiple (well, depending on how many languages a lot) tables for each language and then (somehow) read metadata lang variable. But that is highly beyond my linguistic and pandoc-hacking skills ...

@tarleb Let me know which identification is more suitable:

pandoc-vlna
insert-nbsp
insert-nonbreakable-space

At the end of the day, as famous classic says:

"A rose by any other name would smell as sweet"

... as long as its well described in README.md (hopefully it is).

Regards, Tomas

tarleb · 2020-10-28T09:06:00Z

I'd like to ask a favor. I compiled a new contributing guide over the last two weeks and would like to get some feedback on it. I'd appreciate if you could take a look at the document and tell me about all passages that are unclear or that you feel should be improved for other reasons.

I am not commenting on the code yet, as I'm also trying to evaluate if the information I added to the guide is sufficient to prevent some common issues which often come up during code review. Hope that's ok with you?

As for the name: each options seems fine. Since Czech is the only supported language, we could also be more explicit about the filter's purpose and include czeck in the name. I'll leave the choice to you.

Delanii · 2020-11-01T19:40:23Z

Considering the new guide, everything mentioned seems pretty completely clear and OK to me. I can accomodate for the ammount of characters per line. Editorconfig seems like a great suggestion, although I am not familiar with that utility.
I take another look at the diagram-generator filter and try to accomodate for setting this fiter according to lang metadata value. I can try to make another set of rules for english (and make that actually default).
That could even make it valid to name the filter pandoc-vlna (well, in my opinion).
I let you know when all changes are finished (could be pretty fast).

Updated to have max 80 chars per line.

Renamed to `sampleCZ` to allow testing of czech language setting.

Another sample for english language testing.

Expected result from compilation of `sampleCZ.md`. Just renamed.

Updated to read `lang` variable setting with default value for english, also with fallback to english if `lang` is not specified. Also added english `prefixes`.

New file for testing after latest filter update.

Added explicit `lang` setting.

Added test result for english `lang` setting.

Added test for english `lang` setting, renamed filter file.

Delanii · 2020-11-06T20:59:28Z

@tarleb I have finished changes according to the newest contributing guide. All should be good to go!
Only thing I realized that in case of Spans or Divs it doesnt check, if those have set theyre lang (attribute?) differently. It is something I can work on when there will be more languages supported (for two languages it might not make much sense).
The only thing is that I am having trouble changing the directory name itself from nonbreakablespace to pandocVlna (now I dare to name it that way) ... :)

tarleb

Thank you for the changes, I like where this is heading! I left some inline comments which I hope are helpful. Ready to merge once these are resolved.

tarleb · 2020-11-07T09:47:29Z

nonbreakablespace/pandocVlna.lua

+  local insert = insert_nonbreakable_space(FORMAT)
+
+  for i = 1, #inlines do
+    if inlines[i].t == 'Space' then


How about naming the elements which we are looking at? I find prev or previous much easier to read than inlines[i - 1].

OK, I can accomodate for that, eventhough in this case for me personally was easier to see exactly what is it I am calling at.

I have made required changes, but I am having trouble assigning the replacement string wtih them, meaning if I write:

currentElement = insert

it doesnt work.

But this works:

inlines[i] = insert

works. It makes sense that these make-up variables dont reassing to original inlines list, but how can I accomodate for that?

I have also noticed another issue - writing SoftBreak element instead of Space in the place, where should be &nbsp wont trigger replacement (of course) - I have remedied for that: Now I am testing for Space of SoftBreak elements.

I have uploaded new files.

Please, let me know what you think about it. I personally in this specific case would prefer writing inlines[i]; but modifying the filter as you suggest would allow me to learn more, so I am open to doing that. After resolving this issue, I will work on the next one.

nonbreakablespace/pandocVlna.lua

Also added fixes per first review suggestion (at least partially)

Complete reupload of new files because fo failing tests.

Complete reupload

Fixed review suggestion no. 2

about to reupload new one

Changes per review suggestion 3 and 4.

Also updated file header according to new contribution guide.

Delanii · 2020-11-22T12:35:47Z

Added another update with all improvements suggested in review. Also updated filter file header according to new contribution guide.

I was returning from Meta function, which caused destruction of all metadata. That should be fixed now, Hopefully it will fix the filter tests too.

The filter failed if last element in paragraph would be "Space" (result of other filter or from writing), similarly to case when "Space" would be first element of par block. Fixed now in replacement loop beginning (line 122)

netique · 2021-05-06T10:30:50Z

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

badumont · 2021-05-06T12:41:20Z

Just a side note: in lua regexp, % is the escape char, so you have to type %% so that it escapes itself. I wonder if this filter could be made more general by creating some metadata fields taking a list a characters to be associated with a non-breakable space. Something like "non-breakable-space-before", "non-breakable-space-after", "thin-space-before", "thin-space-after". Of course it would be complementary to the built-in configuration files per language that you mentioned. What are your thoughts about that? Le Thursday 06 May 2021 à 03:31:12AM, Jan Netík a écrit :

…

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.*

Delanii · 2021-05-06T18:01:51Z

Many thanks for this filter, @Delanii! I wonder if it is possible to add the % character to the list. In Czech, we write non-breakable space between the number and % sign (when percentages are used as an adjective, there is no space). I've tried to change the filter, but failed miserably. Maybe % needs to be escaped as \%, but that fails too (as well as \\% or \\\% does). Help would be greatly appreciated!

Hello @netique ,

well, this filter doesnt cover it all. I admit I have settled up to dealing with things that were the easiest -- just list few prepositions and deal with split numbers. The percent character could be thought about as a unit, which has actually the opposite requirement -- has to be coupled with preceeding text element (number, as in "10 %"), not with the following one (word, as in "10 apples").

As I wrote in the beginning, I was and still am a lua newbie, so by cutting off units, chapters (ex: "chapter 9") and such, I have made creating this filter a lot easier.

I am not very sure how I would deal with units and such in this filter. In LaTeX this is being dealt with by means of famous siunitx package, so I could take a peek at its code and try to inspire myself, but I think that that would require actually adding extra markup that would be recognized by filter, something like @Unit{5}{%} or @Unit{10}{m/h}, but I am not sure if everybody would wont to write so much markup just to have nonbreakable spaces between number and unit.

Delanii · 2021-05-06T18:10:24Z

Just a side note: in lua regexp, % is the escape char, so you have to type %% so that it escapes itself. I wonder if this filter could be made more general by creating some metadata fields taking a list a characters to be associated with a non-breakable space. Something like "non-breakable-space-before", "non-breakable-space-after", "thin-space-before", "thin-space-after". Of course it would be complementary to the built-in configuration files per language that you mentioned. What are your thoughts about that? Le Thursday 06 May 2021 à 03:31:12AM, Jan Netík a écrit :

Hi @badumont !

Well, that is a great idea, I was actually thinking about something like that in the beginning, but then I thought/hoped that this filter could grow by feedback to incorporate more languages with their respective rules. I wanted the filter to be as simple to use as possible so there is no configuration. But the code is open as well, so you can modify it to your respective needs (by just modifying the table with prepositions) to tweak the behaviour as you need.

I also have no idea how to implement configuration of a lua filter. And how would that be stated (configuration file, default file, stating at command line??)

Do you have an idea how to do that?

badumont · 2021-05-06T19:50:01Z

I am not an expert either, but here is my guess.

If you wish to support multiple languages, I think that the filter itself should
contain only general-purpose functions, not language-related data, exactly what
you do in l. 117-140. So we could imagine an Inlines function that only calls
functions like: replace-space-by-nbsp and replace-space-by-thin (if you expect
from the user that he or she types a normal space where the nbsp should be, which of
course is debatable). Every function would then read a table containing the
characters that need a nbsp.

For the configuration, there could be two non-exclusive ways: the one that I
mentioned, and another based on configuration files per language. The easiest
way from the programmer's point of view is to write configuration files as
Lua tables, so that they can be directly imported in the lua filter with dofile.
It could look like this (simplified for French):

local nbsp_before = {
  ':',
  '»'
}

local nbsp_after = {
  '«'
}

local thin_before = {
  ';'
}

spaces_config = {
  nbsp_before = nbsp_before,
  nbsp_after = nbsp_after,
  thin_before = thin_before
}

Then simply add dofile "config/FR_config.lua" near the beginning of the file. Then you will be able to get the list of the characters that require a thin space before it with spaces_config.thin_before. Of course, these "characters" can be whole strings like the. That would only require to map the supported IETF language tags to the corresponding configuration files before calling dofile. Such configuration file could be easily written by users by modifying an existing one, I think.

Another thing: why do you define the nbsp character per format? Pandoc does the
conversion itself (try pandoc -t latex -f html <<< '&nbsp').

I can help if you want. As a French user who currently selects manually every special space, I am very interested in this filter! However, in that case, it may be easier to do it on a separate repo for I already have a fork of lua-filters with a pending pull request...

hugoroy · 2021-08-31T14:44:34Z

Hi both
I am very interested in this as well. I currently do this with a bash script as filter and use sed. Not very clean :-)

What about specifying functions depending on the language set?

For instance if meta.lang = fr then we may provide a list of functions to replace the non-breakable spaces with the right non-breakable space that should appear in French.

Edit: Okay I just saw how this is implemented.

Delanii added 3 commits October 13, 2020 10:40

Create sample.md

1c951b5

Add files via upload

ed6a50a

Add files via upload

dd2f12e

Delanii added 11 commits November 6, 2020 21:23

Update README.md

36c91bf

Updated to have max 80 chars per line.

Rename sample.md to sampleCZ.md

7df6f7c

Renamed to `sampleCZ` to allow testing of czech language setting.

Add files via upload

64bf968

Another sample for english language testing.

Rename expected.html to expectedCZ.html

e167899

Expected result from compilation of `sampleCZ.md`. Just renamed.

Update nonbreakablespace.lua

5e829a0

Updated to read `lang` variable setting with default value for english, also with fallback to english if `lang` is not specified. Also added english `prefixes`.

Rename nonbreakablespace.lua to pandocVlna.lua

6a62c16

Delete expectedCZ.html

6a8bc4f

Add files via upload

d0bc724

New file for testing after latest filter update.

Update sampleCZ.md

feec602

Added explicit `lang` setting.

Add files via upload

0cca88c

Added test result for english `lang` setting.

Update makefile

ea8e13a

Added test for english `lang` setting, renamed filter file.

tarleb approved these changes Nov 7, 2020

View reviewed changes

Delanii added 3 commits November 8, 2020 21:55

Delete expectedCZ.html

f78a1fa

New test file with fixes error causes by SoftBreak element

4b5d58a

Now testing also for SoftBreak element

d5fce8b

Also added fixes per first review suggestion (at least partially)

Delanii added 18 commits November 8, 2020 22:18

Delete expectedEN.html

80a6900

New test file for english

25a7564

Delete expectedEN.html

a4c6e81

Another try for test file for english

5f13001

Delete README.md

f060207

Delete expectedCZ.html

c02aa69

Delete expectedEN.html

22c7c82

Delete makefile

fa6a2b8

Delete pandocVlna.lua

7008e81

Delete sampleCZ.md

a675d44

Delete sampleEN.md

5ebf029

Reupload

bc67eb1

Complete reupload of new files because fo failing tests.

Add files via upload

7de7a14

Complete reupload

Update pandocVlna.lua

76b00e7

Fixed review suggestion no. 2

Delete pandocVlna.lua

11794ff

about to reupload new one

Add files via upload

7536ba9

Changes per review suggestion 3 and 4.

Rewrite or README.md: Default setting to English

dad0797

Update per suggestions in PR

8f7501f

Also updated file header according to new contribution guide.

Delanii added 2 commits December 2, 2020 17:29

Bugfix

ce063f3

I was returning from Meta function, which caused destruction of all metadata. That should be fixed now, Hopefully it will fix the filter tests too.

Bugfix if last element in par block is "Space"

d407e90

The filter failed if last element in paragraph would be "Space" (result of other filter or from writing), similarly to case when "Space" would be first element of par block. Fixed now in replacement loop beginning (line 122)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding non-breakable spaces lua filter #119

Adding non-breakable spaces lua filter #119

Delanii commented Oct 13, 2020

Delanii commented Oct 13, 2020

alerque commented Oct 13, 2020

Delanii commented Oct 13, 2020

alerque commented Oct 13, 2020

Delanii commented Oct 13, 2020

stroobandt commented Oct 13, 2020

Delanii commented Oct 14, 2020

tarleb commented Oct 18, 2020

Delanii commented Oct 19, 2020

tarleb commented Oct 28, 2020

Delanii commented Nov 1, 2020

Delanii commented Nov 6, 2020

tarleb left a comment

tarleb Nov 7, 2020

Delanii Nov 8, 2020

Delanii commented Nov 22, 2020

netique commented May 6, 2021

badumont commented May 6, 2021 via email

Delanii commented May 6, 2021 •

edited

Loading

Delanii commented May 6, 2021 •

edited

Loading

badumont commented May 6, 2021

hugoroy commented Aug 31, 2021 •

edited

Loading

Adding non-breakable spaces lua filter #119

Are you sure you want to change the base?

Adding non-breakable spaces lua filter #119

Conversation

Delanii commented Oct 13, 2020

Delanii commented Oct 13, 2020

alerque commented Oct 13, 2020

Delanii commented Oct 13, 2020

alerque commented Oct 13, 2020

Delanii commented Oct 13, 2020

stroobandt commented Oct 13, 2020

Delanii commented Oct 14, 2020

tarleb commented Oct 18, 2020

Delanii commented Oct 19, 2020

tarleb commented Oct 28, 2020

Delanii commented Nov 1, 2020

Delanii commented Nov 6, 2020

tarleb left a comment

Choose a reason for hiding this comment

tarleb Nov 7, 2020

Choose a reason for hiding this comment

Delanii Nov 8, 2020

Choose a reason for hiding this comment

Delanii commented Nov 22, 2020

netique commented May 6, 2021

badumont commented May 6, 2021 via email

Delanii commented May 6, 2021 • edited Loading

Delanii commented May 6, 2021 • edited Loading

badumont commented May 6, 2021

hugoroy commented Aug 31, 2021 • edited Loading

Delanii commented May 6, 2021 •

edited

Loading

Delanii commented May 6, 2021 •

edited

Loading

hugoroy commented Aug 31, 2021 •

edited

Loading