UTF-8 support for XML output in pycsw? #276

mattenp · 2014-10-06T07:46:43Z

Hi,

I use pycsw 1.10.0 only for CSW-publication, not for harvesting. So I use an PostgreSQL database, encoding is UTF-8 and it works. But the XML in database is UTF-8 and the XML output in browser isn't UTF-8, it's WIn1252. The browser convert's the encoding to UTF-8. So If I would harvest I wouldn't get UTF-8. Only The browser convert's the xml to unicode. Do you know a solution for this issue?

an example:

     UTF-8 in xml document: commerciale de la société
     UTF-8 in database: commerciale de la société
     WIN1252 in xml output stream: commerciale de la soci&#233;t&#233;

Best regards,
Matthias

The text was updated successfully, but these errors were encountered:

tomkralidis · 2014-10-06T19:42:09Z

@mattenp do you have a CSW URL or test case (configuration, test file[s]) you can send to demonstrate the issue?

mattenp · 2014-10-07T10:19:14Z

Hi Tom,

I get the xml document from here http://www.ifremer.fr/geonetwork-sdn/srv/eng/csw-csr?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&id=urn:SDN:CSR:LOCAL:1000010&outputschema=csw:IsoRecord&ElementSetName=full

https://github.com/mattenp/testcase/blob/master/IFREMER-CSW_1000010_2009-02-24.xml

Then I load this record to database: pycsw-admin.py -c load_records -f default.cfg -p /path/to/records

default.cfg https://github.com/mattenp/testcase/blob/master/default.cfg

If I use a PostgreSQL database,then it's UTF-8, if I use an Oracle database, then it's Latin 1.
By the way I would like to use Oracle with UTF-8 support.

If I request the record

http://localhost:8080/pycsw-wsgi?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecordById&id=urn:SDN:CSR:LOCAL:1000010&ElementSetName=full&outputSchema=http://www.isotc211.org/2005/gmd

the Browser show's a fine result in UTF-8, but if I look at the sourcecode in Firefox in IE or if I request the result by wget the result is encoded in Latin1.

tomkralidis · 2014-10-08T23:16:45Z

@mattenp thanks. FYI I'm unable to load the data into pycsw:

$ pycsw-admin.py -c load_records -f default.cfg -p issue-276-testcase/
Initializing static context
creating new engine: sqlite:///tests/suites/cite/data/records.db
binding ORM to existing database
setting repository queryables
Processing file issue-276-testcase/IFREMER-CSW_1000010_2009-02-24.xml (1 of 1)
Serialized metadata, parsing content model
Traceback (most recent call last):
  File "/home/tkralidi/work/foss4g/pycsw/master/bin/pycsw-admin.py", line 7, in <module>
    execfile(__file__)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/bin/pycsw-admin.py", line 234, in <module>
    admin.load_records(CONTEXT, DATABASE, TABLE, XML_DIRPATH, RECURSIVE)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/admin.py", line 337, in load_records
    record = metadata.parse_record(context, exml, repo)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 102, in parse_record
    return _parse_metadata(context, repos, record)
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 127, in _parse_metadata
    return [_parse_iso(context, repos, exml)]
  File "/home/tkralidi/work/foss4g/pycsw/master/pycsw/pycsw/metadata.py", line 844, in _parse_iso
    _set(context, recobj, 'pycsw:Title', md.identification.title)
AttributeError: 'NoneType' object has no attribute 'title'

When I inspect https://github.com/mattenp/testcase/blob/master/IFREMER-CSW_1000010_2009-02-24.xml closer, this appears to be a ISO 19115/SeaDataNet profile type record, which is not supported by OWSLib (which pycsw uses to parse metadata).

1./ are you working off a fork/custom OWSLib build that supports the SeaDataNet profile (aside: this would be a valuable enhancement to OWSLib)
2./ can you provide an ISO XML that OWSLib can handle so we I can reproduce the encoding issue?

tomkralidis · 2014-10-08T23:20:19Z

If I use a PostgreSQL database,then it's UTF-8, if I use an Oracle database, then it's Latin 1.
By the way I would like to use Oracle with UTF-8 support.

@mattenp FYI there would need to be some enhancement to pycsw to support some of the Oracle specifics like geometry and full text search (Oracle text?). I'm cc'ing @msmitherdc, who I believe/heard is interested in this as well (we can open another issue related to Oracle).

mattenp · 2014-10-09T12:56:03Z

1./ are you working off a fork/custom OWSLib build that supports the SeaDataNet profile (aside: this would be a valuable enhancement to OWSLib)

@tomkralidis That's right, I work for SeaDataNet and I will do this. But it's ISO19139/SeaDataNet profile for CSR and CDI

http://www.seadatanet.org/Standards-Software/Metadata-formats/CSR
http://www.seadatanet.org/Standards-Software/Metadata-formats/CDI

I'm next on holiday for 2 weeks, after that I will report a solution for OWSLib concerning to ISO19139/SDN

2./ can you provide an ISO XML that OWSLib can handle so we I can reproduce the encoding issue?

Well, I've uploaded another test file https://github.com/mattenp/testcase/blob/master/ISO19139-example.xml .
The important string (société) is in the following tag

<gmd:organisationName>
<gco:CharacterString>commerciale de la société -- Centre for Ecology &amp; Hydrology</gco:CharacterString>
</gmd:organisationName>

@mattenp FYI there would need to be some enhancement to pycsw to support some of the Oracle specifics like geometry and full text search (Oracle text?). I'm cc'ing @msmitherdc, who I believe/heard is interested in this as well (we can open another issue related to Oracle).

@tomkralidis @msmitherdc Yes, I'm very interested in this.

tomkralidis · 2014-10-09T17:18:15Z

@mattenp are you able to make a local change/test? This should fix things:

diff --git a/pycsw/server.py b/pycsw/server.py
index f078e5e..55f1c1b 100644
--- a/pycsw/server.py
+++ b/pycsw/server.py
@@ -2323,7 +2323,7 @@ class Csw(object):
         else:  # it's XML
             self.contenttype = self.mimetype
             response = etree.tostring(self.response,
-            pretty_print=self.pretty_print)
+            pretty_print=self.pretty_print, encoding='unicode')
             xmldecl = '<?xml version="1.0" encoding="%s" standalone="no"?>\n' \
             % self.encoding
             appinfo = '<!-- pycsw %s -->\n' % self.context.version
@@ -2331,7 +2331,7 @@ class Csw(object):
         LOGGER.debug('Response:\n%s' % response)

         s = '%s%s%s' % (xmldecl, appinfo, response)
-        return s.encode()
+        return s.encode('utf8')


     def _gen_soap_wrapper(self):

fix response encoding support (#276)

tomkralidis · 2014-10-10T16:59:06Z

Applied to master and 1.10 branch.

mattenp closed this as completed Oct 9, 2014

mattenp reopened this Oct 9, 2014

tomkralidis added this to the 1.10.1 milestone Oct 9, 2014

tomkralidis self-assigned this Oct 9, 2014

tomkralidis added bug server labels Oct 9, 2014

tomkralidis added a commit that referenced this issue Oct 10, 2014

Merge pull request #277 from tomkralidis/issue-276

d4b8e95

fix response encoding support (#276)

tomkralidis added a commit that referenced this issue Oct 10, 2014

fix response encoding support (#276)

b96a7ea

tomkralidis closed this as completed Oct 10, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 support for XML output in pycsw? #276

UTF-8 support for XML output in pycsw? #276

mattenp commented Oct 6, 2014

tomkralidis commented Oct 6, 2014

mattenp commented Oct 7, 2014

tomkralidis commented Oct 8, 2014

tomkralidis commented Oct 8, 2014

mattenp commented Oct 9, 2014

tomkralidis commented Oct 9, 2014

tomkralidis commented Oct 10, 2014

UTF-8 support for XML output in pycsw? #276

UTF-8 support for XML output in pycsw? #276

Comments

mattenp commented Oct 6, 2014

tomkralidis commented Oct 6, 2014

mattenp commented Oct 7, 2014

tomkralidis commented Oct 8, 2014

tomkralidis commented Oct 8, 2014

mattenp commented Oct 9, 2014

tomkralidis commented Oct 9, 2014

tomkralidis commented Oct 10, 2014