Pages in this report:
  1. Introduction
  2. Basic JSON conformance
  3. Whitespace
  4. Numbers
  5. Sequences
  6. Strings
  7. Unicode

This report is out-of-date.

The state of things has changed dramatically, for the better, since I first wrote this in early 2008. Although my test cases are still quite useful, any information regarding specific python packages is likely to be inaccurate. I am leaving these pages here primarily for historic interest.

Unicode support

The set of tests on this page is primarily concerned with strings containing non-ASCII Unicode characters (those above U+007F). See the Strings test page for more details on how ASCII characters are handled.

Notation: Because some of the examples on this page contain unprintable or invisible characters, a special notation delimited by guillemets, as in «U+007B», is adopted to represent a single occurance of a Unicode character, in this case U+007B.

Test environment

The python implementation used in these tests internally uses UCS-4 Unicode characters, such that:

This is also Python version 2.5, so there are no builtin UTF-32 codecs.

About JSON and Unicode

The JSON string type supports representing any of the nearly 1.1 million possible characters from the entire Unicode repertoire: U+0000 through U+10FFFF, except for the 2048 surrogate code points U+D800U+DFFF which can not be represented by JSON.

However it is important to realize that JSON standard ultimately defines a byte stream, not a character stream—which makes sense considering that JSON is intended as a data interchange format between languages and platforms. Consequently if JSON data is read from or into files, be sure to use the binary-file open flag, e.g., open(file,'rb'). The default character encoding for JSON is UTF-8, although the standard requires certain autodetection techniques for readers. Since the Python byte type is not available until Python 3.0 and we are only testing with Python 2.x in this report, we will use the regular Python str type to hold any JSON byte stream data.

Writers: All JSON-compliant writers must be able to generate a UTF-8 data stream. Note that generating a 7-bit US-ASCII output stream, although perhaps inefficient, intrinsically qualifies as being a UTF-8 stream as well. On the other hand, while being able to produce ASCII output may be nice (and is always possible thanks to JSON's \u-escape syntax), it is not required by the JSON specification.

Readers: All JSON-compliant readers must be able to accept any of UTF-8, UTF-16, or UTF-32. Note that per the Unicode specifications for those encoding forms, UTF-16 and UTF-32 byte streams may be in either little-endien or big-endien, and may or may not be prefixed with a BOM.

The missing UTF-32 codec: Python does not come with a UTF-32 codec, until the Python 3.0 release. So any module which wishes to support reading UTF-32 encoded JSON as required by the standard must supply its own.

How to get UTF-8 output from the modules

To test JSON output compliance it is necessary for all modules to be able to generate a UTF-8 encoded byte stream. This requires a varying amount of work with each module. Some modules also support other output encodings, and for those we also test the ability to output a 7-bit ASCII byte stream as well.

The demjson module can directly produce output in any character encoding as long as a codec is available—it also includes built-in UTF-32 support. It automatically determines whether to use \u-escapes adaptively on a character by character basis depending on the repertoire of the chosen encoding. For example converting the unicode string "aÀ†" into JSON with the ISO-8859-9 encoding will result in "aÀ\u2020" rather than the much longer "a\u00c0\u2020". For these tests it was invoked twice to illustrate the different behavior, as follows:

# For demjson/utf8
json_bytes = demjson.encode( pydata, encoding='utf-8' )

# For demjson/ascii
json_bytes = demjson.encode( pydata, encoding='ascii' )

Both the jsonlib and simplejson modules give a choice of either outputting in ASCII (using \u-style escapes as necessary), or as a Python Unicode string that can be subsequently encoded by the caller into a byte stream. A limitation of this approach is that the production of \u-style escape sequences is not adaptive to the output encoding—either all the non-ASCII character are \u-escaped or none of them are. This means that the final encoding used should be capable of handling the entire Unicode repertoire, or risk raising a UnicodeDecodeError. Of course this is not a concern for any of the UTF-[8,16,32] encodings, but it could be if you wanted ISO-8859-4 for example.

The jsonlib module was invoked as follows:

# For jsonlib/utf-8
json_ustr = jsonlib.write( pydata, ascii_only=False )
json_bytes = json_ustr.encode('utf-8')

# For jsonlib/ascii
json_ustr = jsonlib.write( pydata, ascii_only=True )
json_bytes = json_ustr.encode('ascii')

The simplejson module was invoked as follows:

# For simplejson/utf8
json_ustr = simplejson.dumps( pydata, ensure_ascii=False )
json_bytes = json_ustr.encode('utf-8')

# For simplejson/ascii
json_ustr = simplejson.dumps( pydata, ensure_ascii=True )
json_bytes = json_ustr.encode('ascii')

For python-cjson, the output is always an ASCII string, using \u-escapes for all non-ascii characters. It is maximally portable at the expense of possibliy producing inefficently large JSON data streams. Since it is ASCII, it is by implication also UTF-8. It is therefore called as simply:

# For python-cjson
json_bytes = cjson.encode( pydata )

For python-json, the output is either a Python str or a Python unicode string, depending on whether non-ASCII characters were ever involved. JSON \u-style escapes are never produced, thus this is the only module tested here that can not be made to restrict its output to just the ASCII subset of UTF-8. It is called as follows:

# For python-json
json_ustr = json.write( pydata )
json_bytes = json_ustr.encode('utf-8')

Converting Python strings to JSON output

In these tests we convert various Python unicode strings into a JSON UTF-8 data stream. The ouput of those that can be restricted to just ASCII is also shown.

Table 1: Python Unicode strings to JSON
Test# from Python to JSON demjson/utf8 demjson/ascii jsonlib/utf8 jsonlib/ascii python-cjson python-json simplejson/utf8 simplejson/ascii
1–1 [u'\x80'] [«U+0080»] yes yes, outputs:
["\u0080"]
yes yes, outputs:
["\u0080"]
yes, outputs:
["\u0080"]
yes yes yes, outputs:
["\u0080"]
1–2 [u'abc'] ["abc"] yes yes yes yes yes yes yes yes
1–3 [u'\u0061'] ["a"] yes yes yes yes yes yes yes yes
1–4 [u'\u00c0'] ["À"] yes yes, outputs:
["\u00c0"]
yes yes, outputs:
["\u00c0"]
yes, outputs:
["\u00c0"]
yes yes yes, outputs:
["\u00c0"]
1–5 [u'\u2021'] ["‡"] yes yes, outputs:
["\u2021"]
yes yes, outputs:
["\u2021"]
yes, outputs:
["\u2021"]
yes yes yes, outputs:
["\u2021"]
1–6 [u'\U0001d120'] ["«U+1D120»"] yes yes, outputs:
["\ud834\udd20"
]
yes yes, outputs:
["\ud834\udd20"
]
no, outputs:
["\U0001d120"]
yes yes yes, outputs:
["\ud834\udd20"
]

When normal Python str types are converted rather than unicode strings, the characters in the python string are normally interpreted as being ASCII, or in whatever the default sys.getdefaultencoding() happens to be. However the simplejson module allows the caller to specify the encoding used by the python strings; which is independent of the encoding in which the JSON is being output. It is called like:

# simplejson encoding of python str strings
json_ustr = simplejson.dumps( pydata_with_str, encoding="cp1252" )

Autodetecting the character encoding of JSON input

These tests determine how well the module can automatically detect and decode an identical JSON input that has been encoded in different character encodings. A compliant JSON parser should be able to accept input encoded in any of UTF-8, UTF-16, and UTF-32; including all the different encoding schemes (endianness and use of BOM).

Note that manual decoding is almost always possible for any module, as long as the codec is supported by Python (which includes most everything except UTF-32). However these tests are regarding the automatic detection that the JSON specification requires.

For the tests that follow, the JSON input consists the five characters ["‡"], or U+005B U+0022 U+2021 U+0022 U+005B. The actual byte sequence of the input is also shown along with each test case.

Table 3: Automatically detecting character encoding of JSON input
Test# JSON
encoding
raw input bytes demjson jsonlib python-cjson python-json simplejson
3–1 UTF-8 5b 22 e2 80 a1
22 5d
yes yes no no yes
3–2 UTF-16LE 5b 00 22 00 21
20 22 00 5d 00
yes yes no no no
3–3 UTF-16BE 00 5b 00 22 20
21 00 22 00 5d
yes yes no no no
3–4 UTF-16LE
/w BOM
ff fe 5b 00 22
00 21 20 22 00
5d 00
yes no no no no
3–5 UTF-16BE
/w BOM
fe ff 00 5b 00
22 20 21 00 22
00 5d
yes no no no no
3–6 UTF-32LE 5b 00 00 00 22
00 00 00 21 20
00 00 22 00 00
00 5d 00 00 00
yes yes no no no
3–7 UTF-32BE 00 00 00 5b 00
00 00 22 00 00
20 21 00 00 00
22 00 00 00 5d
yes yes no no no
3–8 UTF-32LE
/w BOM
ff fe 00 00 5b
00 00 00 22 00
00 00 21 20 00
00 22 00 00 00
5d 00 00 00
yes no no no no
3–9 UTF-32BE
/w BOM
00 00 fe ff 00
00 00 5b 00 00
00 22 00 00 20
21 00 00 00 22
00 00 00 5d
yes no no no no

Converting JSON unicode strings to Python

The following JSON-encoded strings are converted into Python strings. If they contain any non-ASCII characters (either literally or by using \u-style escapes) then the resulting Python string should be of the unicode type. The input JSON string for the purposes of these tests is a UTF-8 encoded byte string, the default encoding for JSON.

Table 2: JSON strings to Python Unicode
Test# from JSON to Python demjson/strict demjson/loose jsonlib python-cjson python-json simplejson
2–1 ["\u0041"] ['A'] yes yes yes yes yes yes
2–2 ["\u00b6"] [u'\xb6] yes yes yes yes yes yes
2–3 ["\u00B6"] [u'\xb6'] yes yes yes yes yes yes
2–4 ["¶"] [u'\xb6'] yes yes yes no, outputs:
[u'\xc2\xb6']
no, outputs:
['\xc2\xb6']
yes
2–5 ["\u2021"] [u'\u2021'] yes yes yes yes yes yes
2–6 ["‡"] [u'\u2021'] yes yes yes no, outputs:
[u'\xe2\x80\xa1
']
no, outputs:
['\xe2\x80\xa1'
]
yes
2–7 ["\x20\x21"] n/a yes: error almost, outputs:
[' !']
yes: error yes: error yes: error yes: error
2–8 ["\ud834\udd20"
]
[u'\U0001d120'] yes yes yes no, outputs:
[u'\xf0\x9d\x84
\xa0']
no, outputs:
['\xf0\x9d\x84\
xa0']
no, outputs:
[u'\ud834\udd20
']
2–9 ["«U+1D120»"] [u'\U0001d120'] yes yes yes no, outputs:
[u'\xf0\x9d\x84
\xa0']
no, outputs:
['\xf0\x9d\x84\
xa0']
yes
2–10 ["\ud834"] n/a yes: error yes: error yes: error no, outputs:
[u'\ud834']
no, outputs:
[u'\ud834']
no, outputs:
[u'\ud834']
2–11 ["\udd20"] n/a yes: error yes: error no, outputs:
[u'\udd20']
no, outputs:
[u'\udd20']
no, outputs:
[u'\udd20']
no, outputs:
[u'\udd20']
2–12 ["\U0001d120"] n/a yes: error almost, outputs:
['U0001d120']
yes: error no, outputs:
['\\\\U0001d120
']
yes: error yes: error

The end.