Use CMake to build uchardet and update upstream submodule #5

Bobo1239 · 2016-10-26T12:51:18Z

Upstream change: see README

Also had to change the examples a bit as the algorithm changed and other charsets are detected. ASCII is now a valid result of uchardet so we can directly return a EncodingDetectorResult<String>.

This is a breaking change.

uchardet now returns "ASCII" when the input was ASCII and only returns "" when it failed to detect a charset

Bobo1239 · 2016-10-26T12:53:00Z

I'm going to test this on Windows (GNU and MSVC) in a couple of hours.

emk · 2016-10-26T15:40:26Z

As for Windows, I'm happy to set up an appveyor.yml file so that it stays
working.

Le 26 oct. 2016 8:54 AM, "Eric Kidd" eric@randomhacks.net a écrit :

This is a fantastic idea! I'll take a look and merge it shortly. If not,
ping me.

If this changes the charsets detected for plain ASCII inputs, that will be
a breaking change for semver. No worries; I was thinking about a version
bump anyway.

emk · 2016-10-26T15:45:56Z

This is a fantastic idea! I'll take a look and merge it shortly. If not, ping me. If this changes the charsets detected for plain ASCII inputs, that will be a breaking change for semver. No worries; I was thinking about a version bump anyway.

Bobo1239 · 2016-10-26T21:31:22Z

(At least on my machine...)

Setting up Windows CI is a good idea. I only tested an x64 build with Visual Studio 15 (preview version; had to make cmake-rs recognize it first...).

Now on to windows-gnu...

Bobo1239 · 2016-10-26T22:05:24Z

Got it (windows-gnu) working after I rediscovered the solution I found two month ago. Depends on rust-lang/cmake-rs#17.

emk

This is a really great patch, and I'm looking forward to merging it as soon as I can test it against substudy. Happily, I was already planning to update substudy (and overhaul the build system for rust-uchardet), and you've just saved me a ton of work! 👌🏻

I ask several questions below and suggest a couple of minor changes, but this is a great PR and you've really made my day.

emk · 2016-10-27T00:35:44Z

.travis.yml

@@ -3,7 +3,7 @@ sudo: required
 rust:
 - nightly
 - beta
- 1.0.0
+- stable


Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.

I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.

emk · 2016-10-27T00:40:08Z

src/lib.rs

 //!
 //! Note that the underlying implemention is written in C and C++, and I'm
 //! not aware of any security audits which have been performed against it.
 //!
 //! ```
 //! use uchardet::detect_encoding_name;
 //!
-//! assert_eq!(Some("windows-1252".to_string()),
-//!            detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7,
-//!                                 0x61, 0x69, 0x73]).unwrap());


Does the detector still detect "windows-1252" and output that string? I notice that this PR removes several tests related to this encoding.

For my use case, I encounter a lot of input data in "windows-1252" (including older subtitle data for European films), and I want to make sure the detector can still detect that format correctly.

It outputs "WINDOWS-1252" when e.g. , 0x80 is appended to the array. I just changed it so the encoded string isn't totally random but I can also change the examples back to windows-1252 if you like.

This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

Relevant wiki pages:
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

emk · 2016-10-27T00:41:07Z

src/lib.rs

@@ -54,27 +53,30 @@ struct EncodingDetector {
 }

 /// Return the name of the charset used in `data`, or `None` if the
-/// charset is ASCII or if the encoding can't be detected.  This is
+/// charset if the encoding can't be detected. This is


This function can no longer return None, so this comment is now incorrect.

emk · 2016-10-27T00:43:37Z

src/lib.rs

    /// `None` on error, or if the data appears to be ASCII.
-    fn charset(&self) -> Option<String> {
+    fn charset(&self) -> Result<String, EncodingDetectorError> {


This type is written as EncodingDetectorResult<String> above, but Result<String, EncodingDetectorError> here. At the least we should be consistent—but I'm halfway tempted to convert this library to use error-chain since we're making breaking API changes, and since our error type is relatively heavyweight already thanks to the message: String field. What do you think?

I'm totally happy to handle the error-chain conversion myself; I've done several of them lately.

I'd like to try the conversion. Though it's my first time using quick-error so tell me if I'm doing dumb stuff.

Ehm, I meant error-chain...

emk · 2016-10-27T00:44:05Z

src/lib.rs

-                Ok("") => None,
-                Ok(encoding) => Some(encoding.to_string())
+                Ok("") => Err(EncodingDetectorError {
+                    message: "uchardet failed to recognize a charset".to_string()


This looks like a good idea, I need to double-check this new behavior against substudy to see if it still gets good results with older subtitle files.

And if I convert the library to use error-chain, I want to break this out as a distinct error type so that our callers can handle it specially.

Will try to integrate this.

emk · 2016-10-27T00:44:26Z

uchardet-sys/Cargo.toml

@@ -18,3 +18,4 @@ libc = "*"

 [build-dependencies]
 pkg-config = '*'
+cmake = "*"


I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream.

emk · 2016-10-27T00:46:12Z

uchardet-sys/build.rs

-            .arg(&src.join("uchardet"))
-            .arg("-DCMAKE_BUILD_TYPE=Release")
-            .arg(&format!("-DCMAKE_INSTALL_PREFIX={}", dst.display()))
-            .arg(&format!("-DCMAKE_CXX_FLAGS={}", cxxflags)));


This massive code deletion is immensely satisfying.

emk

This is a great conversion to error-chain, and it makes things feel tidier.

I have a couple of very minor remarks below.

emk · 2016-10-27T16:19:03Z

src/lib.rs

-    /// `None` on error, or if the data appears to be ASCII.
-    fn charset(&self) -> Result<String, EncodingDetectorError> {
+    /// Get the decoder's current best guess as to the encoding. May return
+    /// an error if uchardet was unable to detect an encoding


Missing period.

emk · 2016-10-27T16:21:06Z

src/lib.rs

+///            detect_encoding_name("ascii".as_bytes()).unwrap());
+/// assert_eq!("UTF-8".to_string(),
+///            detect_encoding_name("©français".as_bytes()).unwrap());
+/// assert_eq!("ISO-8859-1".to_string(),


These to_string calls in these assertions are no longer necessary now that the surrounding Ok has been removed from the strings, because you can compare &str and String, but not Ok(&str) and Ok(String), if I recall correctly.

emk · 2016-10-27T16:24:56Z

src/lib.rs

-    fn description(&self) -> &str { "encoding detector error" }
-    fn cause(&self) -> Option<&Error> { None }
-}
+use errors::*;


Try pub use instead, so that users can access all the types used in our API signatures. Older Rust will fail to compile without this, and callers can't use links to import our error chain into theirs.

emk · 2016-10-27T16:26:33Z

src/lib.rs

        let result = unsafe {
            ffi::uchardet_handle_data(self.ptr, data.as_ptr() as *const i8,
                                      data.len() as size_t)
        };
        match result {
            0 => Ok(()),
            _ => {
-                let msg =  "Error handling data".to_string();
-                Err(EncodingDetectorError{message: msg})
+                Err(ErrorKind::DataHandlingError.into())


Are there any other uchardet error codes we should map to separate ErrorKind values? Or does it just return one cryptic error code for everything?

As I see it, there are currently only two possible return values:

enum nsresult { NS_OK, NS_ERROR_OUT_OF_MEMORY };

source
In nsUniversalDetector::HandleData only those values are returned but I don't think they guarantee that it stays like this in the future.

OK, in this case, we should map nsresult to Ok(()), Err(ErrorKind::OutOfMemory) and Err(ErrorKind::Other). We might want to do this using an impl From<nsresult> for ErrorKind in the errors module if nsresult is an actual struct or enum in Rust. Or if nsresult is just a type alias, then we could define a private ErrorKind::from_nsresult function.

emk · 2016-10-27T16:28:38Z

src/lib.rs

@@ -128,9 +120,7 @@ impl EncodingDetector {
            match charset {
                Err(_) =>
                    panic!("uchardet_get_charset returned invalid value"),


I know this is my code, but I'd like to change this panic! message to something like "uchardet_get_charset returned a charset name containing invalid characters". I can do this when merging unless you want to do it yourself.

emk · 2016-10-27T16:59:41Z

OK, here's my TODO list to get this patch merged and a 2.0.0 release made. Some of these items are for me, some you'll probably want to do yourself, and some are for either of us.

Merge checklist:

@Bobo1239 Add back unit tests for "WINDOWS-1252" (preferably using 1252 smart quotes or something like that).
@emk Decide what the oldest Rust version we want to support is. Debian testing seems to have 1.10; we could try that, maybe?
Fix charset name decoding panic message to be more informative.
Fix missing period.
@Bobo1239 Remove all unnecessary to_string calls from doc tests.
Check what error codes are returned from uchardet and decide if we want to break out any of them as separate Rust-level errors.
@Bobo1239 pub use errors::*;
@emk Test a build against emk/substudy to make sure that we still produce good results with older real-world data (from subtitle files in this case).
@emk Set up AppVeyor on this PR branch so it will merge over to master with the PR.

I'll check off items as we do them.

Bobo1239 · 2016-10-27T17:37:56Z

What about AppVeyor Windows CI (GNU and MSVC)?

emk · 2016-10-27T18:08:40Z

AppVeyor is added to the checklist! I need to do some setup on my end, and I already have a standard config file that I use for my portable projects. I'll take a look at it shortly.

Bobo1239 · 2016-10-28T23:21:04Z

src/lib.rs

+    }
+
+    impl ErrorKind {
+        pub fn from_nsresult(nsresult: ::ffi::nsresult) -> ErrorKind {


I don't know how to make this private but still accessible for EncodingDetector::handle_data.

I could make a public free function error_kind_from_nsresult(res: i32) -> ErrorKind in the errors module and (in the crate root) only reexport errors::{Error, ErrorKind, ChainErr}.
Does that seem fine?

In that case, we should probably just hoist all the code in mod errors up to the top of the crate for simplicity.

Thank you for taking the time to respond to all my excessively detailed review comments, by the way!

I put it in a module so I could #[allow(missing_docs)] as it appears that you're unable to document the generated code (and I don't see another way to set that attribute). I'm currently trying to write up a PR for error-chain so the generated code has #[allow(missing_docs)].

It's no problem at all; Rather it's really educational for me to have somebody more experienced guide me.

(This comment is just a remark about something I noticed, not a feature request!)

You know, I think it might actually be possible to fix error-chain to support doc comments. Comments starting with /// actually get transformed to #[doc...] attributes before macros see them (if I recall correctly), so the macro could move them around and attach them to the right enum members on output.

I actually also thought about that possibility but after taking a glance at quick_error.rs I quickly abandoned that idea... 😅

(PR is up: rust-lang-deprecated/error-chain#50)

emk · 2016-10-28T23:24:26Z

Looks great! I'm thinking that from_nsresult might be better as private, especially since it has that assertion.

emk · 2016-10-29T11:48:57Z

OK, this patch is looking really great!

I'm going to go ahead and merge it, because I want to take a look at #1 (which might be more feasible now that we have from https://github.com/hsivonen/encoding_rs) as well as BurntSushi/ripgrep#1 (which can't depend on any C code, but which would benefit from a Rust-native implementation of detecting the most common character sets).

If you have any other tweaks, please feel free to open a new PR! And thank you for all your terrific work here.

emk · 2016-10-29T16:09:48Z

OK, I've set up AppVeyor for both Windows targets, I've updated Travis to trying building against 1.10.0. We can cross-compile from Linux to MinGW!

rustup run nightly cargo build --target x86_64-pc-windows-gnu
   Compiling uchardet-sys v1.0.0 (file:///home/emk/w/src/rust-uchardet/uchardet-sys)
   Compiling dbghelp-sys v0.2.0
   Compiling kernel32-sys v0.2.2
   Compiling backtrace v0.2.3
   Compiling error-chain v0.5.0
   Compiling uchardet v1.0.0 (file:///home/emk/w/src/rust-uchardet)
    Finished debug [unoptimized + debuginfo] target(s) in 2.35 secs

Unfortunately, AppVeyor fails for the MinGW build, because I don't seem to have everything installed:

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_AR was not found, please set to archive program.

If you have a spare moment, would you be interested in looking at this and offering me a hint? :-)

Bobo1239 · 2016-10-29T16:46:40Z

I can take a look at it in a couple of hours after I'm back.

emk · 2016-10-29T19:30:18Z

Thank you! I don't have a working Windows setup at the moment, so I can't test any of the Windows builds locally, except for cross-compilation (which works). I really appreciate your help.

Before we release this, we also need to take care of licensing. As the README.md points out, this crate is in the public domain so that it can be used without any restrictions at all, and to keep it that way, contributors need to include the following notice in one of their commit messages:

I dedicate any and all copyright interest in my contributions to this project to the public domain. I make this dedication for the benefit of the public at large and to the detriment of my heirs and successors. I intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

Once again, many thanks for helping update and port this project!

Bobo1239 added 2 commits October 26, 2016 14:21

Use CMake to build uchardet and update upstream submodule

7c37c65

Remove unnecessary Option around the charset name

f319faf

uchardet now returns "ASCII" when the input was ASCII and only returns "" when it failed to detect a charset

emk self-assigned this Oct 26, 2016

Bobo1239 added 2 commits October 26, 2016 15:33

Also consider 32-bit systems

784a47b

Test with stable on Travis

69c80ba

Make it work on windows-msvc

e80234a

Make it work on windows-gnu

44553be

emk reviewed Oct 27, 2016

View reviewed changes

Use error-chain for errors

11fdb93

emk reviewed Oct 27, 2016

View reviewed changes

Address remarks

5be347f

Make errors more specific

4a5a4fe

emk mentioned this pull request Oct 28, 2016

add support for other text encodings BurntSushi/ripgrep#1

Closed

Bobo1239 commented Oct 28, 2016

View reviewed changes

emk merged commit 2281551 into emk:master Oct 29, 2016

@@ @@ -3,7 +3,7 @@ sudo: required @@
               rust:
               - nightly
               - beta
-              - 1.0.0
+              - stable

Use CMake to build uchardet and update upstream submodule #5

Use CMake to build uchardet and update upstream submodule #5

Conversation

Bobo1239 commented Oct 26, 2016

Bobo1239 commented Oct 26, 2016

emk commented Oct 26, 2016

emk commented Oct 26, 2016 via email

Bobo1239 commented Oct 26, 2016

Bobo1239 commented Oct 26, 2016 • edited Loading

emk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emk commented Oct 27, 2016 • edited Loading

Bobo1239 commented Oct 27, 2016

emk commented Oct 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emk commented Oct 28, 2016 via email

emk commented Oct 29, 2016

emk commented Oct 29, 2016

Bobo1239 commented Oct 29, 2016

emk commented Oct 29, 2016

Bobo1239 commented Oct 26, 2016 •

edited

Loading

emk commented Oct 27, 2016 •

edited

Loading