Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where formatId is set to application/vnd.ms-excel for csvs #930

Merged
merged 2 commits into from
Jun 16, 2020

Conversation

dmullen17
Copy link
Member

Bug fix where the editor sets the formatId to application/vnd.ms-excel for csv files with their mime type set as excel (only occurs for excel-exported csvs in windows).

I decided to add a new method because DataONEObject.getFormatId returns the first formatId that matches an object's mediatype. The body of sanitizeFormatId would need to be inserted before that first return, so an additional method seemed cleaner.

Some possible changes:

  1. This could be condensed into one line - I thought this was more readable.
var formatId = this.sanitizeFormatId(this.get("formatId") || this.getFormatId());
xml.find("formatid").text(formatId);
  1. I grabbed this block of code to extract the extension from the fileName and modified it so it only calls this.get("fileName") once, but I can revert that to maintain consistency if needed.

  2. This block could be removed without affecting the function, but I thought it would improve the speed to skip all non application/vnd.ms-excel files.

         if (formatId !== "application/vnd.ms-excel") {
            return formatId;
          }

@csjx
Copy link
Member

csjx commented Mar 17, 2019

Heya @dmullen17 - Thanks for the PR. I certainly understand your logic for creating sanitizeFormatId(), my inclination is that we should keep the format id resolution stuff in getFormatId(). We certainly don't have to be married to returning immediately based on the formatId matching the mediaType. When I originally wrote this, I knew it was a naive approach, but works a good deal of the time. So I think we should make getFormatId() smarter. What do you think about changing it up to weight the potential matches based on formatId, mediaType, and extension? The trick is to figure out which indicator is "wrong" - like in this case, extension is correct, and mediaType is incorrect. We might weight extension higher, and then apply the formatId that get's the highest score, on say a scale of 0 to 1. Thoughts are welcome from @laurenwalker .

@dmullen17
Copy link
Member Author

@csjx thanks for the review! I updated getFormatId based on your recommendations. This version gets the matching formatIds based on mediaType and extension and store them as arrays in the objectFormats object. If both arrays are populated it compares the first two items in each array and returns if they're equal. If they're unequal it handles specific discrepancies - we can add more cases below the excel mime type test if they arise. Next, if either of the arrays were empty it will use the first mediaTypes match to return a formatId, and finally if that's empty it will use the first extensions match (behavior from the previous version of getFormatId).

Additional comments / explanations:

  1. I noticed this line never returns anything. I believe it's because the backbone where method calls filter which fails on objects. I elected to remove it as getFormatId was working without it.

  2. My reasoning for adding a specific test for application/vnd.ms-excel media type: If instead of a test we picked one of the two to return as a default (in this case extensions) then there would no point in checking to see if they're equal - that value would get returned in either case. So we can add specific tests inside of the block if mediaType and extensions are unequal, and default to mediaType over extension outside of the block.

@laurenwalker
Copy link
Member

Hey Dom - I looked at your code yesterday and it's looking good!

Have you done some testing? It'd be a good test to upload like a hundred different file types in the editor and test whether the expected formatId is set on each. There are 129 DataONE formats in stage 2. If we could test all of them, even better.

@dmullen17
Copy link
Member Author

@laurenwalker thanks for looking over it! I tested it with ~ 20 common file types - I can test the rest of the DataOne formats as well. Is there somewhere I could add a unit test? I can also store a folder that contains a sample file of each Dataone format somewhere useful (if something similar already exists let me know!)

@laurenwalker
Copy link
Member

We're developing unit tests on the feature-config-refactor branch right now, but it's in a pretty hacky state right now. You could add the test files to that branch.

@dmullen17
Copy link
Member Author

dmullen17 commented Apr 4, 2019

hey @laurenwalker @csjx when one of you has a second can I get your thoughts on the tests I'm writing for DataONEObject.getFormatId()? I'm trying to write one to test all the registered dataone formats. What I have so far:

  1. Get all mediaTypes and their possible formatId matches (because one mediaType can correspond to multiple formatIds) from the global MetacatUI` object and store them in an object. Basically Object.keys() = mediaTypes, and Object.values() = arrays of formatIds
  2. Create an array of DataONEObjects and set their mediaType attributes to each unique mediaType
  3. Call getFormatId() on each object
  4. Check that the returned value of getFormatId() matches one of the formatIds in the corresponding array from step 1.

The test is working as expected, it actually helped me uncover some other file formats that aren't being set correctly, like application/json-ld, because the where method can't filter by objects. I'm trying to update that now.

Here's a fiddle of the tests so far. If you want to run them the best way I've found is by: adding a console.log(this); to one of the methods of DataONEObject(I just put it in getFormatId()) , importing the chai library in index.html ( <script type = "text/javascript", src = "../test/js/chai.js"></script>), launching a local instance of MetacatUI, then when the object logs you can right-click and store it as a global variable temp1, then extract the constructor var DataONEObject = temp1.__proto__.constructor. That's the only way I could get the DataONEObject constructor and the global MetacatUI object in the same environment. Then you can just copy-paste the plain JS from the tests (no describe, before, it, or require.js) into the console to run the tests. (super hacky sorry).

Any feedback would be really appreciated!

@laurenwalker
Copy link
Member

I've been working on getting all the tests working with access to the MetacatUI global object and all it's attributes. Once that is done, could you submit a separate pull request with your test and I'll take a look?

I'll let you know here when that code is checked in.

@laurenwalker
Copy link
Member

This issue popped up again with our OPC hosted repository. Related to #560.

@laurenwalker laurenwalker added this to the 2.12.0 milestone Apr 24, 2020
@laurenwalker laurenwalker added bug editor Requested by: Hosted repo Features requested by a DataONE hosted repository labels Apr 24, 2020
@csjx
Copy link
Member

csjx commented Apr 24, 2020

The upshot I think is that Microsoft Excel doesn't change the OS-level metadata when it exports a CSV file. When someone then uploads a CSV using the editor, the Content-Disposition field in the File object within the browser remains as application/vnd.ms-excel instead of text/csv, and so our lookup in the ObjectFormats collection ends up typing this file incorrectly. Perhaps we need to trust the extension more than the disposition that is set.

@laurenwalker
Copy link
Member

Yeah, we could make special exceptions for Excel formatids to trust the .csv extension

@amoeba
Copy link
Contributor

amoeba commented Apr 24, 2020

The upshot I think is that Microsoft Excel doesn't change the OS-level metadata when it exports a CSV file.

Have you seen that under some combination of variables? It doesn't happen here on a CSV I saved with Excel:

00000000  ef bb bf 31 2c 32 2c 33                           |...1,2,3|
00000008

That's just a UTF-8 BOM followed by my data and Chrome/Firefox/Safari reports it as text/csv. Might be worth testing on Windows. Here's a CodePen to test with.

@laurenwalker
Copy link
Member

I've only seen this bug happen on Windows.

@amoeba
Copy link
Contributor

amoeba commented Apr 24, 2020

Gotcha. I was just curious so I checked: It turns out that Chrome and other browsers are picking this up from Computer\HKEY_CLASSES_ROOT. You can go in and change the associated type to any value you like actually, for any extension. I think checking for a .csv extension when we get application/vnd.ms-excel is probably good for 99% of use cases, short of using magic numbers.

@laurenwalker laurenwalker modified the milestones: 2.12.0, 2.12.1 Jun 4, 2020
@laurenwalker laurenwalker changed the base branch from dev-2.6 to dev-2.12 June 16, 2020 15:18
@laurenwalker laurenwalker merged commit 0eeb57e into NCEAS:dev-2.12 Jun 16, 2020
@laurenwalker
Copy link
Member

Thanks @dmullen17 for your work on this. It's finally merged in!

@laurenwalker laurenwalker modified the milestones: 2.12.1, 2.13.0 Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug editor Requested by: Hosted repo Features requested by a DataONE hosted repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants