Thursday, September 13, 2018

Sphinx 1.8 and feedformatter

I installed and tried the new Sphinx 1.8. There was a problem when generating my blog:

Traceback (most recent call last):
  File "/site-packages/sphinx/cmd/build.py", line 304, in build_main
    app.build(args.force_all, filenames)
  File "/site-packages/sphinx/application.py", line 369, in build
    self.emit('build-finished', None)
  File "/site-packages/sphinx/application.py", line 510, in emit
    return self.events.emit(event, self, *args)
  File "/site-packages/sphinx/events.py", line 80, in emit
    results.append(callback(*args))
  File "/sphinxfeed/sphinxfeed.py", line 95, in emit_feed
    feed.format_rss2_file(path)
  File "/site-packages/feedformatter.py", line 399, in format_rss2_file
    string = self.format_rss2_string(validate, pretty)
  File "/site-packages/feedformatter.py", line 393, in format_rss2_string
    return _stringify(RSS2root, pretty=pretty)
  File "/site-packages/feedformatter.py", line 273, in _stringify
    return ET.tostring(tree)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/etgen/etgen/etree.py", line 29, in _serialize_xml
    return _original_serialize_xml(write, elem, *args, **kwargs)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/etgen/etgen/etree.py", line 29, in _serialize_xml
    return _original_serialize_xml(write, elem, *args, **kwargs)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/etgen/etgen/etree.py", line 29, in _serialize_xml
    return _original_serialize_xml(write, elem, *args, **kwargs)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/etgen/etgen/etree.py", line 29, in _serialize_xml
    return _original_serialize_xml(write, elem, *args, **kwargs)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

I opened #2534 and explored the problem:

  • Seems that this is caused by either sphinxfeed or feedformatter or etgen.etree.

  • etgen : I removed the patch in etgen.etree because it lookeds suspicious. But that didn’t help. So etgen.etree is probably innocent. That patch for writing CDATA has maybe become useless, but I am not sure, so I leave it there at the moment.

  • feedformatter : I tried a newer version. Also that didn’t help. Note that feedformatter seems to be unmaintained. The PyPI version is 0.4 still points to code.google.com but there are two forks. I created a third fork but deleted it again when I found thte explanation below.

  • sphinxfeed: I use my own fork of sphinxfeed (see Sunday, September 2, 2018), so I cannot simply try other versions.

Adding a try…except in my /usr/lib/python2.7/xml/etree/ElementTree.py finally revealed the explanation which I am going to simulate here:

sphinxfeed sets the pubDate field of feed items to a time_struct:

>>> import time
>>> fmt = '%Y-%m-%d %H:%M'
>>> pubDate = time.strptime("2018-03-13 11:07", fmt)
>>> pubDate
time.struct_time(tm_year=2018, tm_mon=3, tm_mday=13, tm_hour=11, tm_min=7, tm_sec=0, tm_wday=1, tm_yday=72, tm_isdst=-1)

When sphinxfeed then calls feedformatter, feedformatter writes all dates using the format demanded by the RSS 2.0 specification which itself refers to the venerable RFC 822 (search for - 25 - in that document to get to the “5. DATE AND TIME SPECIFICATION” section). Anyway, here is how a pubdate field in an rss.xml file should look like:

>>> s = time.strftime("%a, %d %b %Y %H:%M:%S %Z", pubDate)
>>> repr(s)
"'Tue, 13 Mar 2018 11:07:00 '"

Now Sphinx version 1.8 (at least on my machine) sets the locale to Estonian:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'et_EE.utf8')
'et_EE.utf8'

And feedformatter now gets a localized string containting non-ascii characters which under Python 2 is not even a unicode string but a bytestring:

>>> s = time.strftime("%a, %d %b %Y %H:%M:%S %Z", pubDate)
>>> type(s)
<type 'str'>
>>> repr(s)
"'T, 13 m\\xc3\\xa4rts 2018 11:07:00 '"

And when trying to serialize that bytestring, we get our decoding error:

>>> s.encode("ascii", 'xmlcharrefreplace')
Traceback (most recent call last):
  File "/usr/lib/python2.7/doctest.py", line 1315, in __run
    compileflags, 1) in test.globs
  File "<doctest 0913.rst[13]>", line 1, in <module>
    s.encode("ascii", 'xmlcharrefreplace')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

It is true that I live in Estonia and that my Ubuntu system probably has some setting seomwhere saying this. But in my conf.py I have:

language = 'en'

So why does Sphinx version 1.8 set the locale to “Estonian” on my machine? It is because of the environment variable LC_TIME. I can work around the problem by setting this variable to en_GB.UTF-8 before building:

$ export LC_TIME=en_GB.UTF-8

I added a unit test in my sphinxfeed clone which reproduces the problem (when LC_TIME=et_EE.UTF-8) : The test suite passes with “Sphinx<1.8” and fails with 1.8.

But setting my LC_TIME to en_GB.UTF-8 is not really a satisfying solution.

Miscellaneous

There was a warning no files found matching '.idea' during inv test in lino.

Lino and WeasyPrint

The new accounting report shows us that WeasyPrint is a great tool for most Lino printing jobs. That’s why I invested some time into trying to find out who’s behind this package.

Oh, here is a post by its author (gayoub from kozea group) where he explains why he wrote WeasyPrint: Comment générer automatiquement des jolis documents ? It’s so nice to read about somebody who shares similar experiences and feelings about producing printable documents!

Later I read another blog post by the Kozea group: Philippe et sa montre, an interview with Philippe Donadieu, manager of the Kozea group. Their main product is a suite of software solutions for drugstores in France. It seems to be proprietary software, though.

But their main developer is Guillaume Ayoub who also gave an interview. This corresponds to the AUTHORS file and the author of the first blog post.

And who is Simon Sapin, the first author mentioned in that file? According to his site exyr.org he has previously worked on WeasyPrint at Kozea. In 2012 he presented WeasyPrint at W3C Developer Meetup in Lyon. On the slides I read that Kozea had 10 employees at that time, is located in the Lyon area and builds custom web applications for businesses (“Industrialization, HTML5/CSS3 e-learning and Semi-automated reporting”). And that they recently became a W3C member. Which seems to be no longer true (at least they aren’t listed here).

Their community website finally confirms that they invite us to collaborate or to just tell them about ourselves.

It seems that Simon left the Kozea community when he left his job there, and that he has moved away from Python to Rust since then.

WeasyPrint was written and is maintained by a “corporate-driven community”. But other than the Python extension for Visual Studio Code (see Monday, September 10, 2018) this is what I would call a corporate working for a free culture because their product serves also people who are not customers of the corporate. That’s why Kozea is more sympathic than Microsoft for me.

Hi Simon and Guillaume, I’d like to say thank you for the great work you have done and are doing on WeasyPrint! I hope that its maintenance will continue to give you much joy and satisfaction. At the moment we just use WeasyPrint (in the lino.modlib.weasyprint plugin), and WP simply works as expected. This is great! Don’t expect active contributions because we have other things to do as well. Let us know if you see how we can help.

Lino Tera für Therapeuten weiter

DONE:

  • Activated lino.modlib.dashboard.

  • Überfällige Termine : nicht schon die von heute, erst ab gestern.

  • users.UserDetail hat keine Reiter (Dashboard, event_type, …)

  • Changed the symbol for a “Cancelled” calendar entry in lino_xl.lib.cal from ☉ to ⚕. Because the symbol ☉ (a sun) is used in Lino Tera for events where the guest missed the appointment without a valid reason. The sun reminds a day on the beach while the ⚕ reminds a drugstore.

  • Neuer Stand “Verpasst” (“Missed – Guest missed the appointment”) für Termine.

Note the analogy: a guest (participant) can be “absent” or “excused”, an appointment can be “missed” or “cancelled”. In Lino Tera we will need this analogy because they have a mixture of group appointments and individual appointments.