Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 support for XML output in pycsw? #276

Closed
mattenp opened this issue Oct 6, 2014 · 7 comments
Closed

UTF-8 support for XML output in pycsw? #276

mattenp opened this issue Oct 6, 2014 · 7 comments
Assignees
Milestone

Comments

@mattenp
Copy link

mattenp commented Oct 6, 2014

Hi,

I use pycsw 1.10.0 only for CSW-publication, not for harvesting. So I use an PostgreSQL database, encoding is UTF-8 and it works. But the XML in database is UTF-8 and the XML output in browser isn't UTF-8, it's WIn1252. The browser convert's the encoding to UTF-8. So If I would harvest I wouldn't get UTF-8. Only The browser convert's the xml to unicode. Do you know a solution for this issue?

an example:

     UTF-8 in xml document: commerciale de la société
     UTF-8 in database: commerciale de la société
     WIN1252 in xml output stream: commerciale de la société

Best regards,
Matthias

@tomkralidis
Copy link
Member

@mattenp do you have a CSW URL or test case (configuration, test file[s]) you can send to demonstrate the issue?

@mattenp
Copy link
Author

mattenp commented Oct 7, 2014

Hi Tom,

I get the xml document from here http://www.ifremer.fr/geonetwork-sdn/srv/eng/csw-csr?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&id=urn:SDN:CSR:LOCAL:1000010&outputschema=csw:IsoRecord&ElementSetName=full

https://github.com/mattenp/testcase/blob/master/IFREMER-CSW_1000010_2009-02-24.xml

Then I load this record to database: pycsw-admin.py -c load_records -f default.cfg -p /path/to/records

default.cfg https://github.com/mattenp/testcase/blob/master/default.cfg

If I use a PostgreSQL database,then it's UTF-8, if I use an Oracle database, then it's Latin 1.
By the way I would like to use Oracle with UTF-8 support.

If I request the record

http://localhost:8080/pycsw-wsgi?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&id=urn:SDN:CSR:LOCAL:1000010&ElementSetName=full&outputSchema=http://www.isotc211.org/2005/gmd

the Browser show's a fine result in UTF-8, but if I look at the sourcecode in Firefox in IE or if I request the result by wget the result is encoded in Latin1.

@tomkralidis
Copy link
Member

@mattenp thanks. FYI I'm unable to load the data into pycsw:

$ pycsw-admin.py -c load_records -f default.cfg -p issue-276-testcase/
Initializing static context
creating new engine: sqlite:///tests/suites/cite/data/records.db
binding ORM to existing database
setting repository queryables
Processing file issue-276-testcase/IFREMER-CSW_1000010_2009-02-24.xml (1 of 1)
Serialized metadata, parsing content model
Traceback (most recent call last):
  File "/home/tkralidi/work/foss4g/pycsw/master/bin/pycsw-admin.py", line 7, in <module>
    execfile(__file__)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/bin/pycsw-admin.py", line 234, in <module>
    admin.load_records(CONTEXT, DATABASE, TABLE, XML_DIRPATH, RECURSIVE)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/admin.py", line 337, in load_records
    record = metadata.parse_record(context, exml, repo)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 102, in parse_record
    return _parse_metadata(context, repos, record)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 127, in _parse_metadata
    return [_parse_iso(context, repos, exml)]
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 844, in _parse_iso
    _set(context, recobj, 'pycsw:Title', md.identification.title)
AttributeError: 'NoneType' object has no attribute 'title'

When I inspect https://github.com/mattenp/testcase/blob/master/IFREMER-CSW_1000010_2009-02-24.xml closer, this appears to be a ISO 19115/SeaDataNet profile type record, which is not supported by OWSLib (which pycsw uses to parse metadata).

1./ are you working off a fork/custom OWSLib build that supports the SeaDataNet profile (aside: this would be a valuable enhancement to OWSLib)
2./ can you provide an ISO XML that OWSLib can handle so we I can reproduce the encoding issue?

@tomkralidis
Copy link
Member

If I use a PostgreSQL database,then it's UTF-8, if I use an Oracle database, then it's Latin 1.
By the way I would like to use Oracle with UTF-8 support.

@mattenp FYI there would need to be some enhancement to pycsw to support some of the Oracle specifics like geometry and full text search (Oracle text?). I'm cc'ing @msmitherdc, who I believe/heard is interested in this as well (we can open another issue related to Oracle).

@mattenp
Copy link
Author

mattenp commented Oct 9, 2014

1./ are you working off a fork/custom OWSLib build that supports the SeaDataNet profile (aside: this would be a valuable enhancement to OWSLib)

@tomkralidis That's right, I work for SeaDataNet and I will do this. But it's ISO19139/SeaDataNet profile for CSR and CDI

http://www.seadatanet.org/Standards-Software/Metadata-formats/CSR
http://www.seadatanet.org/Standards-Software/Metadata-formats/CDI

I'm next on holiday for 2 weeks, after that I will report a solution for OWSLib concerning to ISO19139/SDN

2./ can you provide an ISO XML that OWSLib can handle so we I can reproduce the encoding issue?

Well, I've uploaded another test file https://github.com/mattenp/testcase/blob/master/ISO19139-example.xml .
The important string (société) is in the following tag

<gmd:organisationName>
<gco:CharacterString>commerciale de la société -- Centre for Ecology &amp; Hydrology</gco:CharacterString>
</gmd:organisationName>

@mattenp FYI there would need to be some enhancement to pycsw to support some of the Oracle specifics like geometry and full text search (Oracle text?). I'm cc'ing @msmitherdc, who I believe/heard is interested in this as well (we can open another issue related to Oracle).

@tomkralidis @msmitherdc Yes, I'm very interested in this.

@mattenp mattenp closed this as completed Oct 9, 2014
@mattenp mattenp reopened this Oct 9, 2014
@tomkralidis
Copy link
Member

@mattenp are you able to make a local change/test? This should fix things:

diff --git a/pycsw/server.py b/pycsw/server.py
index f078e5e..55f1c1b 100644
--- a/pycsw/server.py
+++ b/pycsw/server.py
@@ -2323,7 +2323,7 @@ class Csw(object):
         else:  # it's XML
             self.contenttype = self.mimetype
             response = etree.tostring(self.response,
-            pretty_print=self.pretty_print)
+            pretty_print=self.pretty_print, encoding='unicode')
             xmldecl = '<?xml version="1.0" encoding="%s" standalone="no"?>\n' \
             % self.encoding
             appinfo = '<!-- pycsw %s -->\n' % self.context.version
@@ -2331,7 +2331,7 @@ class Csw(object):
         LOGGER.debug('Response:\n%s' % response)

         s = '%s%s%s' % (xmldecl, appinfo, response)
-        return s.encode()
+        return s.encode('utf8')


     def _gen_soap_wrapper(self):

@tomkralidis tomkralidis added this to the 1.10.1 milestone Oct 9, 2014
@tomkralidis tomkralidis self-assigned this Oct 9, 2014
tomkralidis added a commit that referenced this issue Oct 10, 2014
fix response encoding support (#276)
@tomkralidis
Copy link
Member

Applied to master and 1.10 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants