This report is out-of-date.
The state of things has changed dramatically, for the better, since I first wrote this in early 2008. Although my test cases are still quite useful, any information regarding specific python packages is likely to be inaccurate. I am leaving these pages here primarily for historic interest.
Unicode support
The set of tests on this page is primarily concerned with strings containing non-ASCII Unicode characters (those above U+007F). See the Strings test page for more details on how ASCII characters are handled.
Notation: Because some of the examples on this page contain
unprintable or invisible characters, a special notation delimited by
guillemets, as in «U+007B», is adopted
to represent a single occurance of a Unicode character, in this case
U+007B.
Test environment
The python implementation used in these tests internally uses UCS-4 Unicode characters, such that:
sys.maxunicode = 1114111,sys.getdefaultencoding() = 'ascii',sys.byteorder = 'little',unicodedata.unidata_version = '4.1.0'.
This is also Python version 2.5, so there are no builtin UTF-32 codecs.
About JSON and Unicode
The JSON string type supports representing any of the nearly 1.1 million possible characters from the entire Unicode repertoire: U+0000 through U+10FFFF, except for the 2048 surrogate code points U+D800–U+DFFF which can not be represented by JSON.
However it is important to realize that JSON standard ultimately
defines a byte
stream, not a character stream—which makes sense considering
that JSON is intended as a data interchange format between languages
and platforms. Consequently if JSON data is read from or into files,
be sure to use the binary-file open flag, e.g.,
open(file,'rb'). The default character
encoding for JSON is UTF-8, although the
standard requires certain autodetection techniques for readers. Since
the Python byte type is not available until Python 3.0
and we are only testing with Python 2.x in this report, we will use
the regular Python str type to hold any JSON byte stream
data.
Writers: All JSON-compliant writers must be able to generate a UTF-8 data stream. Note that generating a 7-bit US-ASCII output stream, although perhaps inefficient, intrinsically qualifies as being a UTF-8 stream as well. On the other hand, while being able to produce ASCII output may be nice (and is always possible thanks to JSON's \u-escape syntax), it is not required by the JSON specification.
Readers: All JSON-compliant readers must be able to accept any of UTF-8, UTF-16, or UTF-32. Note that per the Unicode specifications for those encoding forms, UTF-16 and UTF-32 byte streams may be in either little-endien or big-endien, and may or may not be prefixed with a BOM.
The missing UTF-32 codec: Python does not come with a UTF-32 codec, until the Python 3.0 release. So any module which wishes to support reading UTF-32 encoded JSON as required by the standard must supply its own.
How to get UTF-8 output from the modules
To test JSON output compliance it is necessary for all modules to be able to generate a UTF-8 encoded byte stream. This requires a varying amount of work with each module. Some modules also support other output encodings, and for those we also test the ability to output a 7-bit ASCII byte stream as well.
The demjson module can directly produce output in any character encoding as long as a codec is available—it also includes built-in UTF-32 support. It automatically determines whether to use \u-escapes adaptively on a character by character basis depending on the repertoire of the chosen encoding. For example converting the unicode string "aÀ†" into JSON with the ISO-8859-9 encoding will result in "aÀ\u2020" rather than the much longer "a\u00c0\u2020". For these tests it was invoked twice to illustrate the different behavior, as follows:
# For demjson/utf8 json_bytes = demjson.encode( pydata, encoding='utf-8' ) # For demjson/ascii json_bytes = demjson.encode( pydata, encoding='ascii' )
Both the jsonlib and simplejson modules give a choice
of either outputting in ASCII (using \u-style escapes as necessary),
or as a Python Unicode string that can be subsequently encoded by the
caller into a byte stream. A limitation of this approach is that the
production of \u-style escape sequences is not adaptive to the
output encoding—either all the non-ASCII character are
\u-escaped or none of them are. This means that the final encoding
used should be capable of handling the entire Unicode repertoire, or
risk raising a UnicodeDecodeError. Of course this is
not a concern for any of the UTF-[8,16,32] encodings, but it could be
if you wanted ISO-8859-4 for example.
The jsonlib module was invoked as follows:
# For jsonlib/utf-8 json_ustr = jsonlib.write( pydata, ascii_only=False ) json_bytes = json_ustr.encode('utf-8') # For jsonlib/ascii json_ustr = jsonlib.write( pydata, ascii_only=True ) json_bytes = json_ustr.encode('ascii')
The simplejson module was invoked as follows:
# For simplejson/utf8 json_ustr = simplejson.dumps( pydata, ensure_ascii=False ) json_bytes = json_ustr.encode('utf-8') # For simplejson/ascii json_ustr = simplejson.dumps( pydata, ensure_ascii=True ) json_bytes = json_ustr.encode('ascii')
For python-cjson, the output is always an ASCII string, using \u-escapes for all non-ascii characters. It is maximally portable at the expense of possibliy producing inefficently large JSON data streams. Since it is ASCII, it is by implication also UTF-8. It is therefore called as simply:
# For python-cjson
json_bytes = cjson.encode( pydata )
For python-json, the output is either a Python
str or a Python unicode string, depending on
whether non-ASCII characters were ever involved. JSON \u-style escapes are
never produced, thus this is the only module tested here that can not
be made to restrict its output to just the ASCII subset of UTF-8. It
is called as follows:
# For python-json
json_ustr = json.write( pydata )
json_bytes = json_ustr.encode('utf-8')
Converting Python strings to JSON output
In these tests we convert various Python unicode strings into a
JSON UTF-8 data stream. The ouput of those that can be restricted
to just ASCII is also shown.
| Test# | from Python | to JSON | demjson/utf8 | demjson/ascii | jsonlib/utf8 | jsonlib/ascii | python-cjson | python-json | simplejson/utf8 | simplejson/ascii |
|---|---|---|---|---|---|---|---|---|---|---|
| 1–1 | [u'\x80'] | [«U+0080»] | yes | yes, outputs:["\u0080"] |
yes | yes, outputs:["\u0080"] |
yes, outputs:["\u0080"] |
yes | yes | yes, outputs:["\u0080"] |
| 1–2 | [u'abc'] | ["abc"] | yes | yes | yes | yes | yes | yes | yes | yes |
| 1–3 | [u'\u0061'] | ["a"] | yes | yes | yes | yes | yes | yes | yes | yes |
| 1–4 | [u'\u00c0'] | ["À"] | yes | yes, outputs:["\u00c0"] |
yes | yes, outputs:["\u00c0"] |
yes, outputs:["\u00c0"] |
yes | yes | yes, outputs:["\u00c0"] |
| 1–5 | [u'\u2021'] | ["‡"] | yes | yes, outputs:["\u2021"] |
yes | yes, outputs:["\u2021"] |
yes, outputs:["\u2021"] |
yes | yes | yes, outputs:["\u2021"] |
| 1–6 | [u'\U0001d120'] | ["«U+1D120»"] | yes | yes, outputs:["\ud834\udd20" |
yes | yes, outputs:["\ud834\udd20" |
no, outputs:["\U0001d120"] |
yes | yes | yes, outputs:["\ud834\udd20" |
When normal Python str types are converted rather than
unicode strings, the characters in the python string are
normally interpreted as being ASCII, or in whatever the default
sys.getdefaultencoding() happens to be. However the
simplejson module allows the caller to specify the encoding
used by the python strings; which is independent of the encoding in
which the JSON is being output. It is called like:
# simplejson encoding of python str strings
json_ustr = simplejson.dumps( pydata_with_str, encoding="cp1252" )
Autodetecting the character encoding of JSON input
These tests determine how well the module can automatically detect and decode an identical JSON input that has been encoded in different character encodings. A compliant JSON parser should be able to accept input encoded in any of UTF-8, UTF-16, and UTF-32; including all the different encoding schemes (endianness and use of BOM).
Note that manual decoding is almost always possible for any module, as long as the codec is supported by Python (which includes most everything except UTF-32). However these tests are regarding the automatic detection that the JSON specification requires.
For the tests that follow, the JSON input consists the five characters ["‡"], or U+005B U+0022 U+2021 U+0022 U+005B. The actual byte sequence of the input is also shown along with each test case.
| Test# | JSON encoding |
raw input bytes | demjson | jsonlib | python-cjson | python-json | simplejson |
|---|---|---|---|---|---|---|---|
| 3–1 | UTF-8 | 5b 22 e2 80 a1 22 5d |
yes | yes | no | no | yes |
| 3–2 | UTF-16LE | 5b 00 22 00 21 20 22 00 5d 00 |
yes | yes | no | no | no |
| 3–3 | UTF-16BE | 00 5b 00 22 20 21 00 22 00 5d |
yes | yes | no | no | no |
| 3–4 | UTF-16LE /w BOM |
ff fe 5b 00 22 00 21 20 22 00 5d 00 |
yes | no | no | no | no |
| 3–5 | UTF-16BE /w BOM |
fe ff 00 5b 00 22 20 21 00 22 00 5d |
yes | no | no | no | no |
| 3–6 | UTF-32LE | 5b 00 00 00 22 00 00 00 21 20 00 00 22 00 00 00 5d 00 00 00 |
yes | yes | no | no | no |
| 3–7 | UTF-32BE | 00 00 00 5b 00 00 00 22 00 00 20 21 00 00 00 22 00 00 00 5d |
yes | yes | no | no | no |
| 3–8 | UTF-32LE /w BOM |
ff fe 00 00 5b 00 00 00 22 00 00 00 21 20 00 00 22 00 00 00 5d 00 00 00 |
yes | no | no | no | no |
| 3–9 | UTF-32BE /w BOM |
00 00 fe ff 00 00 00 5b 00 00 00 22 00 00 20 21 00 00 00 22 00 00 00 5d |
yes | no | no | no | no |
Converting JSON unicode strings to Python
The following JSON-encoded strings are converted into Python
strings. If they contain any non-ASCII characters (either literally
or by using \u-style escapes) then the resulting Python string should
be of the unicode type. The input JSON string for the
purposes of these tests is a UTF-8 encoded byte string, the default
encoding for JSON.
| Test# | from JSON | to Python | demjson/strict | demjson/loose | jsonlib | python-cjson | python-json | simplejson |
|---|---|---|---|---|---|---|---|---|
| 2–1 | ["\u0041"] | ['A'] | yes | yes | yes | yes | yes | yes |
| 2–2 | ["\u00b6"] | [u'\xb6] | yes | yes | yes | yes | yes | yes |
| 2–3 | ["\u00B6"] | [u'\xb6'] | yes | yes | yes | yes | yes | yes |
| 2–4 | ["¶"] | [u'\xb6'] | yes | yes | yes | no, outputs:[u'\xc2\xb6'] |
no, outputs:['\xc2\xb6'] |
yes |
| 2–5 | ["\u2021"] | [u'\u2021'] | yes | yes | yes | yes | yes | yes |
| 2–6 | ["‡"] | [u'\u2021'] | yes | yes | yes | no, outputs:[u'\xe2\x80\xa1 |
no, outputs:['\xe2\x80\xa1' |
yes |
| 2–7 | ["\x20\x21"] | n/a | yes: error | almost, outputs:[' !'] |
yes: error | yes: error | yes: error | yes: error |
| 2–8 | ["\ud834\udd20" ] |
[u'\U0001d120'] | yes | yes | yes | no, outputs:[u'\xf0\x9d\x84 |
no, outputs:['\xf0\x9d\x84\ |
no, outputs:[u'\ud834\udd20 |
| 2–9 | ["«U+1D120»"] | [u'\U0001d120'] | yes | yes | yes | no, outputs:[u'\xf0\x9d\x84 |
no, outputs:['\xf0\x9d\x84\ |
yes |
| 2–10 | ["\ud834"] | n/a | yes: error | yes: error | yes: error | no, outputs:[u'\ud834'] |
no, outputs:[u'\ud834'] |
no, outputs:[u'\ud834'] |
| 2–11 | ["\udd20"] | n/a | yes: error | yes: error | no, outputs:[u'\udd20'] |
no, outputs:[u'\udd20'] |
no, outputs:[u'\udd20'] |
no, outputs:[u'\udd20'] |
| 2–12 | ["\U0001d120"] | n/a | yes: error | almost, outputs:['U0001d120'] |
yes: error | no, outputs:['\\\\U0001d120 |
yes: error | yes: error |
The end.

