Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use the sorting function StrCmpLogicalW provided by the win32 API #798

Merged
merged 1 commit into from
Dec 8, 2023

Conversation

dyphire
Copy link
Contributor

@dyphire dyphire commented Nov 18, 2023

Thank @stax76 for providing the code. See-Also mpvnet-player/mpv.net#575 (comment), jonniek/mpv-playlistmanager#128

The current sorting function sort_filenames() is very different from the sorting in Windows Explorer.
Microsoft does not open source the sorting algorithm, but does provide the API, see StrCmpLogicalW.

Fortunately, LuaJIT can call Windows API, and most popular mpv windows build are builds with LuaJIT, such as shinchiro/mpv-winbuild-cmake.

So if OS is Windows and mpv builds with luaJIT, we can use the win32 API to ensure that the sorting in uosc is exactly the same as Windows Explorer, otherwise, use the default sorting function sort_filenames().

Fixes #599

@hooke007
Copy link
Contributor

All indents doesn't use Tab.

@dyphire
Copy link
Contributor Author

dyphire commented Nov 18, 2023

Fix formatting.

src/uosc/lib/utils.lua Outdated Show resolved Hide resolved
@dyphire dyphire changed the title feat: use LuaJIT to implement native sorting algorithms on Windows feat: use the sorting function StrCmpLogicalW provided by the win32 API Nov 18, 2023
@tomasklaen
Copy link
Owner

I assume this will use window's sorting algorithm to sort stuff?

What is the performance impact of this compared to our current solution? Because what we have now works pretty well, and I'm not willing to take much of a performance hit for no perceivable difference.


Regarding formatting, we already have a .editorconfig set up. If you install sumneko.lua you can just run its formatter on the file.

@hooke007
Copy link
Contributor

I think it firstly considers to fix #599 instead of performance.

@stax76
Copy link

stax76 commented Nov 19, 2023

Maybe make it optional, users could then choose which sort method to use.

@hooke007
Copy link
Contributor

Because what we have now works pretty well.

In addition to the issues I mentioned above, the current method cannot handle other languages's sorting at all. So I think it's far away form "pretty well".

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

Maybe make it optional, users could then choose which sort method to use.

This is a strict improvement that should always be unconditionally enabled for users using the mpv builds with luajit on Windows. I don't think an option should be added for switches because Lua's sorting algorithm has many issues.
As for the performance concerns, the Win32 API performed well in my testing.

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

As @hooke007 mentioned above, the current Lua sorting algorithm cannot properly handle the sorting of non Latin characters, which leads to a bad experience.

It is quite difficult to optimize sorting algorithms in Lua to handle non Latin characters correctly. This is different from romanization of search characters, as we only need to romanize the user's local language for searching. But for sorting, we actually need to romanize all non Latin characters otherwise we won't be able to sort correctly.
This requires much more work and performance waste than using the Win32 API to implement native sorting algorithms, and I cannot see the advantages of doing so.

In addition, even the sorting of Latin characters still has problems with special characters. Please refer to jonniek/mpv-playlistmanager#128. The same sorting algorithm is used in the playlistmanager.lua script, so the sorting problem mentioned there also exists in uosc.

@christoph-heinrich
Copy link
Contributor

the current Lua sorting algorithm cannot properly handle the sorting of non Latin characters

Maybe we'd have to convert the strings to utf-32 (or an array of unicode codepoints?

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

the current Lua sorting algorithm cannot properly handle the sorting of non Latin characters

Maybe we'd have to convert the strings to utf-32 (or an array of unicode codepoints?

Yes, this may be a good idea, we can use Unicode table order for sorting, which will improve the sorting of non Latin characters.
This is much simpler than the romanization process I previously thought, but even so, the native sorting algorithm based on the win32 API is still much more powerful.

@christoph-heinrich
Copy link
Contributor

I've come to the realization that utf-32 won't work. It'll sort the same as utf-8.
How come non Latin characters aren't sorted correctly? Is the order of their code points not how they should be sorted?

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

Yes, this may be a good idea, we can use Unicode table order for sorting, which will improve the sorting of non Latin characters.
This is much simpler than the romanization process I previously thought, but even so, the native sorting algorithm based on the win32 API is still much more powerful.

For this reason, I searched for the sorting method of the unicode table. The logogram text included in unicode is sorted by glyph order rather than pronunciation. This means that sorting according to the unicode table is also incorrect, and only after romanization can the correct sorting be obtained.

I've come to the realization that utf-32 won't work. It'll sort the same as utf-8. How come non Latin characters aren't sorted correctly? Is the order of their code points not how they should be sorted?

As I mentioned above, the sorting of logogram text by Windows Explorer and Macos Finder is based on romanized pronunciation rather than code points.

@christoph-heinrich
Copy link
Contributor

The ideographic text included in unicode is sorted by glyph order rather than pronunciation. This means that sorting according to the unicode table is also incorrect, and only after romanization can the correct sorting be obtained.

Good to know. If I understand that correctly that could simply be done by shoving the result from gsub into the char conversion function without having to do anything else.

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

The ideographic text included in unicode is sorted by glyph order rather than pronunciation. This means that sorting according to the unicode table is also incorrect, and only after romanization can the correct sorting be obtained.

Good to know. If I understand that correctly that could simply be done by shoving the result from gsub into the char conversion function without having to do anything else.

Not only that, we also need to place latin characters above romanization conversion characters in sorting. And as I mentioned earlier, romanization sort in lua may not be a good idea.

It is quite difficult to optimize sorting algorithms in Lua to handle non Latin characters correctly. This is different from romanization of search characters, as we only need to romanize the user's local language for searching. But for sorting, we actually need to romanize all non Latin characters otherwise we won't be able to sort correctly.
This requires much more work and performance waste than using the Win32 API to implement native sorting algorithms, and I cannot see the advantages of doing so.

Additionally, there is a difficulty in distinguishing and sorting romanization characters in different languages, which will have to be handled using Unicode ranges.

@dyphire
Copy link
Contributor Author

dyphire commented Nov 19, 2023

Benchmark performance testing was conducted using the following test code:

filenames = {"11", "19", "速20", "度21", "测0", "试1", "9", "10"}
local begin = os.clock()
for i = 1, 10000 do
    sort_filenames_windows(filenames)  --or sort_filenames_lua(filenames)
end
print(string.format('total time:%.2fms\n', ((os.clock() - begin) * 1000)))

test result:

  • sort_filenames_lua(filenames): total time:37.00ms
  • sort_filenames_windows(filenames): total time:61.00ms

The performance of the win32 API meets expectations, and its impact on performance is relatively small compared to the significant improvement brought by native sorting algorithms.

By the way, when introducing a simple romanization process to Lua's sorting algorithm and then conducting the above benchmark testing, the results are total time:140.00ms.

@dyphire
Copy link
Contributor Author

dyphire commented Nov 20, 2023

Since the win32 API solution is only applicable to Windows users, I am thinking about how to optimize sorting algorithms for Linux and MacOS users. After the above discussion, it is difficult to implement romanization sorting in lua and cannot guarantee performance. We need a new alternative solution, as uosc already has a ziggy tool, using Go's sorting algorithm to optimize the sorting experience on Unix systems may be a good idea.

Fortunately, Go has a powerful library golang.org/x/text/collate which could help us.

I tried using the following code to test the sorting algorithm with collate of Go.

package main

import (
	"fmt"

	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {
	letters := []string{"ä", "å", "ö", "o", "a"}

	ec := collate.New(language.English)
	ec.SortStrings(letters)
	fmt.Printf("English Sorting: %v\n", letters)

	sc := collate.New(language.Swedish)
	sc.SortStrings(letters)
	fmt.Printf("Swedish Sorting: %v\n", letters)

	numbers := []string{"ä", "å", "ö", "o", "a", "阿", "不", "0", "11", "01", "2", "3", "23", "050", "50", "-50", "The Woodshed-Episode 5", "The Woodshed-Episode 5_1", "sort&test.mp4", "sort.test19.mp4", "sort_test.mp4", "sort+test.mp4", "sort1test.mp4", "sort07test .mp4", "sorttest(1).mp4", "sort-test.mp4"}

	ec.SortStrings(numbers)
	fmt.Printf("Alphabetic Sorting: %v\n", numbers)

	nc := collate.New(language.Chinese, collate.IgnoreCase, collate.Numeric)
	nc.SortStrings(numbers)
	fmt.Printf("Numeric Sorting: %v\n", numbers)
}

Here are the test results:

English Sorting: [a å ä o ö]
Swedish Sorting: [a o å ä ö]
Alphabetic Sorting: [-50 0 01 050 11 2 23 3 50 a å ä o ö sort_test.mp4 sort-test.mp4 sort.test19.mp4 sort&test.mp4 sort+test.mp4 sort07test .mp4 sort1test.mp4 sorttest(1).mp4 The Woodshed-Episode 5 The Woodshed-Episode 5_1 不 阿]
Numeric Sorting: [-50 0 01 2 3 11 23 050 50 a å ä o ö sort_test.mp4 sort-test.mp4 sort.test19.mp4 sort&test.mp4 sort+test.mp4 sort1test.mp4 sort07test .mp4 sorttest(1).mp4 The Woodshed-Episode 5 The Woodshed-Episode 5_1 阿 不]

We can see that the sorting algorithm of Go can fix #599. Although it cannot handle special characters like the native sorting algorithm provided by the win32 API, and therefore cannot fix the problem mentioned jonniek/mpv-playlistmanager#128, but it can correctly sort non latin characters when the correct language options are set.

@tomasklaen is more familiar with the Go, and it should be up to him to decide how to use and optimize the sorting algorithm of Go correctly. It is not within the range of this pr.

@tomasklaen
Copy link
Owner

First of all, how big of an issue is the lua sorting we have right now? It sorts some languages slightly differently than they'd prefer, but it's still in the realm of good enough and requires minimal effort to get used to it.

Is it completely nonsensical for asian languages?


Doing it in ziggy might be an option, but that depends on how much delay would the data passing and json encode/decode add, and I don't think I'll like the numbers. I already don't like the double in sorting time this solution adds. And it feels like it should be in ziggy, since making this only for windows feels wrong.

I'm trying to balance the added value of features to their performance hit and development/maintenance cost going forward, and I'm on the fence here.

@dyphire
Copy link
Contributor Author

dyphire commented Dec 7, 2023

First of all, how big of an issue is the lua sorting we have right now? It sorts some languages slightly differently than they'd prefer, but it's still in the realm of good enough and requires minimal effort to get used to it.

Is it completely nonsensical for asian languages?

As mentioned above (#798 (comment)), in the current Lua sorting algorithm, all logogram texts have absurd sorting results. They are sorted according to the code point order in the Unicode table rather than the correct pronunciation order. In addition, for accented characters that exist in many languages, it cannot be correctly sorted according to different languages. The current lua sorting algorithm is far from being good enough.

Doing it in ziggy might be an option, but that depends on how much delay would the data passing and json encode/decode add, and I don't think I'll like the numbers. I already don't like the double in sorting time this solution adds. And it feels like it should be in ziggy, since making this only for windows feels wrong.

I'm trying to balance the added value of features to their performance hit and development/maintenance cost going forward, and I'm on the fence here.

In my opinion, implementing better sorting algorithms in ziggy is meaningful for Linux and Macos platforms, otherwise non Latin users on these platforms would only have a poor sorting experience.

We still need this win platform specific solution on Windows. As shown in the results of my testing of the sorting algorithm in Go above (#798 (comment)), although the local language based sorting algorithm in Go is already excellent enough, its handling of special characters is still completely different from the native sorting algorithm in Windows. When we have a way to implement native sorting algorithms on Windows to provide users with a consistent experience, there is no reason not to use it.

Although the new win32 api sorting algorithm incurs additional time costs, as demonstrated in performance tests, the differences and delays between the two are still not within human perception even in sorting tens of thousands of files, as their speed is fast enough. And this kind of usage scenario is almost never encountered in mpv.

@tomasklaen tomasklaen merged commit d01eb25 into tomasklaen:main Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

any way to change playlist sorting (by name -> ascending) to how windows does?
5 participants