-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USX 3.0 support #38
Comments
Update: I did some more digging and found that converting the Relax NG file from https://github.com/ubsicap/usx to a XSD file is not really possible in an automated way, using the most recent version of Trang it still requires many manual fixes. The main difference between USX 3.0 and 2.5/6 is mostly the addition of the "Peripheral" feature set. For my requirements I only need to convert files that solely use the "Scripture" feature set. I would like to make an opensource contribution to this project, however I'm not sure how to proceed, I can do a few things:
Note: Instead of converting the most recent Relax NG file for USX 3.0, I could also copy and alter the current usx.xsd file. What do you think? |
Hello Rolf, welcome and thanks for your willingness to contribute to this project. First about plans to support USX 3 (or in other words any new bible format) and implement it myself: It mainly depends how many bibles are available/circulating in that format. As a ballpark figure, there should be at least a few dozen Bibles (in more than one language) available which are published in the new format primarily (or exclusively), or alternatively a few hundred Bibles which are available in the new format in addition to being available in other formats implemented by BibleMultiConverter. As ebible.org (the largest repository I know that has USFM/USFX/USX bibles) still uses 2.6, there would have to be other sources (which may exist and I am not aware of) to make the format interesting for me. For contributing an implementation of a new format, the threshold is a lot lower - if it compiles and if it is able to convert the test case bibles to the new format without crashing, I'm fine with incorporating them (even more if you are willing to follow this repo and don't mind getting issues about your format assigned to you). If you try to obtain software that can convert this formats (so probably Paratext) and compare the Paratext dumps before/after converting bibles both with the official software and your contribution, even better. Now in particular about USX 3: As I see it (from a quick glance to the spec and the list of changes), USX3 is paired with a new USFM3 format, which introduces new meta tags (like As I said, this was only a quick glance at the spec. Maybe it is possible to use the same AbstractParatextFormat and use some tagging to distinguish which of the paragraph or character content tags are Version 3 and how to best "downconvert" them to Version 2.6. So the old formats will have to downconvert all new features before outputting them. What I don't want is an exporter that exports USFM 2.6 files that contain USFM 3 tags or vice versa. The USX2.6 Schema also needed a bunch of edits, mainly because the Relax-NG conversion got rid of all the enumeration values, which are very useful when writing a converter, as you can use code completion and other features to test if you implemented support for every allowed value. So I'm not surprised you have some difficulties converting the USX3 schema. If you are hitting a dead end at some point, tell me - I cannot promise I'll find time soon, but eventually I will find some time to look at it. You should not have to rewrite or edit any of the generated classes (if you need, please ask as there should be a way to avoid that). I'd suggest not to "hijack" the old schema but make a new one. I don't care if you create it by manually updating the old schema, editing the converted new schema, or a mixture of both. You should update the known schema names in ValidateXML class and validate some real-world USX3 bibles to make sure your final schema is correct. About Peripheral content - it again boils down to how common this feature is in "real-world" bibles. My USFM2.6 implementation does not support Extended Study Content or 3 alternate verse numbers for each verse supported by the spec, as these features require quite some effort to implement, yet are not used in most available bibles. BibleMultiConverter has support for a Bible Introduction, two Testament Introductions, Book/Chapter prologs (at the beginning of books/chapters), and an Appendix, which are each streams of the "FormattedText" elements (so just Rich Text without any semantic tagging). So you can at least map some of the peripheral content to those sections, and put all the rest either into the Bible Introduction or the Appendix. For your use case of converting USX3 to USFM3, it does not matter, as it should take the short cut route and will keep those tags as is. Last but not least, sorry if I missed any questions. Kindly ask again :) Regards, Michael |
Hi Michael, Thanks for answering my questions so quick and detailed. I'm currently looking to at least reading USX3 into Java objects. The format I need is a simplified and custom version of USFM3 that is used by my mobile app. This means I don't necessarily need a direct conversion to official USFM3 or 2 for that matter, as soon as I have access to Java objects it is easy enough to write my own custom format. The reason for this custom format is that USFM when simplified and normalised is really fast and efficient to parse, faster than any XML parser even. When I first wrote this app speed was quite important. Ok enough background. Currently I'm getting most of the Bibles from the Digital Bible Library (DBL) which currently uses USX3 as it's official format, especially now that ParaText 9 has been out for quite some time, the amount of Bibles available in USX3 keeps growing. I was using this library as a step in between my own simplifier and normaliser tool that only accepts USFM as input. By using the BibleMultiConverter I was able to also process USX. As you can imagine with the growing number of USX3 available Bibles I really need to be able to convert from that format as well. I have considered moving to USX3 completely as the internal format used by the app, but XML parsing performance is just not fast enough to satisfy my needs ;) I would love to add support for USX3 to this library, but the schema alone gives me headaches: The schema provided by UBS is in the Relax NG format, which has no tooling whatsoever available to generate POJO's from (at least not in the Java world). Creating an XSD from this schema can be done, but it will never be as strict and descriptive as its Relax NG counterpart. For example the Relax NG schema for USX3 defines Footnote and CrossReference as two completely different types, however in xml they both use the Anyhow... I think I'm going to make an attempt at this, but I will start with an MVP implementation that supports only Scripture content and not Peripheral content and will only import USX3 and export USFM3. By looking at the differences between the USX3 and USX2 specification it indeed seems to make more sense to have an Btw: It seems more recent versions of Trang do generate the enumerations, but the Relax NG schema is so full of features that it basically generates an unusable SAX compliant XSD (ambiguity between different types that share the same element): |
OK, I have stumbled upon the DBL website a few times, but never could find out how mere mortals could register there to download their content... About the Relax NG. It is nice that Relax NG can do such things as different elements with same name, on the other hand, this will make validation and parsing a lot slower. Perhaps even to the point where it gets Turing complete? Parsing C++ templates, unlike Java, is Turing complete and there is an infamous C++ program of 4 lines (of 80 chars each) that takes several hours to even validate it. When Paratext defines their own XML format, they could have used different tags for different elements and avoided this altogether. Using JXB binding mappings, you could assign different classes to the same xs:element depending on which xs:choice inside it is taken, but that does not help you here as the difference is in attribute presence (or even attribute value) and attributes may not be inside xs:choice, only subelements. You could still map the tag to two different possible classes, but when parsing from XML only one would ever get generated. So you'd probably have to live with unifying the elements into one, and checking attributes in code to find out which one is the right one. Or alternatively run the XML through some transformations (XSL or otherwise) to disambiguate the tag names based on presence of attributes, before converting them into Java objects. Which you will have to reverse when exporting USX files. And about the enumerations, I think I was a bit unclear here. When having an xs:attribute that has an anonymous simple type restriction with enumeration values, JAXB will still not create enumerations from them. You'd have to change them to have a named simple type, which is defined as an enumeration. So even the new trang output will need some manual improvement. |
Yea this is a bit of an issue, I got to join them but it is quite a hassle, prefer to talk about that in private.
I don't see why they did not go this route, it makes a lot of sense from a usability perspective. Based on your experience with XSD and it's possibilities I think the easiest way to do this is to have unified models. I'm almost done with the XSD file, as soon as I finish I will point you to a branch so you can keep an eye on the work and give me some tips along the way. |
Hi Micheal, First of all thanks for helping me out here and answering so quick and detailed. Really appreciated! I have a few thoughts/questions to share: Reuse existing Paratext classesIt seems since the structure of USX2 and USX3 is really comparable we can reuse the internal Paratext classes if we like. The main changes are basically additional Char en Para styles and the end chapter/verse milestones. But there is a small catch... How is during an export made sure only supported attributes/elements/milestones are exported?For example It seems you have already answered this question here:
However I'm obviously asking because I need to add some new I was thinking about a few ways to solve this:
Any advice? |
The
As I understand it, the main difference between USFM 2 and USFM3 is her, that USFM 2 allows arbitrary content (titles) in the \periph tag, while USFM 3 provides an enumeration of allowed values for it. Which would mean the conversion can go 3->2 without problem, while 2->3 would have to check the values... About how to handle the path, I'm fine with either way. If you disallow the direct conversion between version 2 and version 3, it's fine. If you have some list of "problematic tags" and disallow it only if one of them is included, even better. If you convert the problematic tags, I don't mind either. You can probably have convert functions at different depths of the type hierarchy to avoid a big ball of spaghetti mud converter, so you can call As the new chapter/verse end tags seem to exist only in USX3 and not in USFM3, probably you don't need to add them to the internal representation, but just synthesize them during export and strip them during import |
I must have missed that, it is indeed! |
@schierlm this took a bit longer than expected, but I think I'm there. Only thing I left out is:
Since for now I don't need import or export support for USFM 3, but I may add it in the future. PR is here: #39 I'm can image it takes some time to look at this, so take your time. |
@schierlm I'm closing this issue, as the PR has been merged. |
Ok. Just a note, I may have to reduce the size of the USX Genesis book test cases, as I have to agree that for a 2MB total source code size, having >25% of it for a single test case may be a bit overkill. Do you have any preference which chapters to keep in the test? From a quick glance I would keep chapters 1 and 3, which is about 5% of the original size and about 2% of the total source code size. |
First of all, thanks a lot for this extremely useful library.
Today I tried to convert a USX 3.0 file to USFM, but it seems BibleMultiConverter only supports USX 2.6. I tried updating the schema after converting it from Relax NG Compact to XSD using Trang, but my knowledge about XML and schema's is just lacking, and the converted schema contains a lot of errors, such as
Unique Particle Attribution
violations.Anyhow, before I dive in deeper, do you have any plans to support USX 3.0? Java is no problem for me, however I usually don't work with XML, as a mobile developer XML is just not really part of the skill set I guess.
Is updating the schema even enough? Or does this also require a complete rewrite of the USX class and Usx class? I can imagine it does?Update: I'm now working on adding USX 3.0 Import support
and export for USFM 3.0. Once this works I might also start working on Export for USX 3.0and Import for USFM 3.0.Branch: https://github.com/Rolf-Smit/BibleMultiConverter/tree/feature/usx-3.0
Progress:
format/paratext/*
) with new tags from the USX 3.0 specification.AbstractParatextFormat
that imports USX 3.0 into the internal Paratext models.The text was updated successfully, but these errors were encountered: