This report is out-of-date.
The state of things has changed dramatically, for the better, since I first wrote this in early 2008. Although my test cases are still quite useful, any information regarding specific python packages is likely to be inaccurate. I am leaving these pages here primarily for historic interest.
Dealing with whitespace in JSON
The JSON format generally allows any amount of whitespace before or after the basic lexemes (tokens) of the language; where the allowed whitespace includes sequences of zero of more of any of the following four characters:
- Space (U+0020)
- Horizontal tab (U+0009)
- Line feed, or newline (U+000A)
- Carriage return (U+000D)
Whitespace is never required in JSON, except what you choose to put inside string literal values. So when generating JSON it is possible to produce a very tightly compacted one-line string (or zero-line if you will, as there is no need for a newline character at the end of the data). This compaction can be useful to save on bandwidth when transmitting JSON data in AJAX applications.
Whitespace in JavaScript:
It is worth pointing out that JavaScript allows a much larger variety
of whitespace than does JSON. Generally any character is considered to
be whitespace if it has the
Unicode category of Zs, Zl, or Zp;
or it is one of the control characters for horizontal tab,
line feed, vertical tab, carriage return, form feed, or next line. The following
are all of the whitespace characters:
| U+0009 | U+0085 | U+2002 | U+2008 | U+205F |
| U+000A | U+00A0 | U+2003 | U+2009 | U+3000 |
| U+000B | U+1680 | U+2004 | U+200A | |
| U+000C | U+180E | U+2005 | U+2028 | |
| U+000D | U+2000 | U+2006 | U+2029 | |
| U+0020 | U+2001 | U+2007 | U+202F |
End of line: JSON needs no concept of an end-of-line, but JavaScript does, if nothing more so that comments may be parsed correctly. However some JSON modules may also wish to detect end-of-lines for help in creating error messages with line numbers. In JavaScript the following sequences of whitespace characters are to be treated as end-of-line indicators (the longest match occurs first):
- U+000A U+000C
- U+000C U+000A
- U+000A
- U+000C
- U+0085
- U+2028
- U+2029
Converting Python to JSON
When producing JSON, different choices can be made in how whitespace is incorporated. There are two extremes; the first being to omit all whitespace for the most compact representation, and the later being to introduce copious whitespace primarily for indentation and pretty-printing purposes.
In the examples which follow, we use the following Python input data:
# Some arbitrary python object used in the following examples
pydata = {'one': True,
'three': ['red', 'yellow',
['blue', 'azure', 'cobalt', 'teal'], 'orange'],
'two': 19.5}
demjson
The demjson module can output JSON with two whitespace options:
- Compact (default): contains no whitespace, not even a newline at the end of the data.
- Pretty-printed: whitespace separates punctuation, and nested lists and objects are indented.
It has no further options or control over the generation of whitespace in the output.
demjson.encode( pydata ) {"one":true,"three":["red","yellow",["blue","azure","cobalt","teal"],"orange"],"two":19.5} demjson.encode( pydata, compactly=False ) { "one" : true, "three" : [ "red", "yellow", [ "blue", "azure", "cobalt", "teal" ], "orange" ], "two" : 19.5 }
jsonlib
The jsonlib module can output JSON with two whitespace options:
- almost-compact: only includes spaces to set off punctuation
- indented: a pretty-printed format where the caller controls the indentation amount
There is no way to create an optimally compact representation.
jsonlib.write( pydata ) {"three": ["red", "yellow", ["blue", "azure", "cobalt", "teal"], "orange"], "two": 19.5, "one": true} jsonlib.write( pydata, indent=' ' ) { "three": [ "red", "yellow", [ "blue", "azure", "cobalt", "teal" ], "orange" ], "two": 19.5, "one": true }
python-cjson
The python-cjson module has no options on how it emits whitespace. It generally adds spaces after punctuation, but does not perform pretty-printing.
cjson.encode( pydata )
{"three": ["red", "yellow", ["blue", "azure", "cobalt", "teal"], "orange"], "two": 19.5, "one": true}
python-json
The python-json module has no options on how it emits whitespace. It creates compact JSON, not adding any whitespace.
json.write( pydata )
{"three":["red","yellow",["blue","azure","cobalt","teal"],"orange"],"two":19.500000,"one":true}
simplejson
The simplejson module provides the most options on how it emits whitespace. It can generally ouput JSON compactly, or in pretty-printed indented mode. The caller can control the amount of indentation used. Additionally the caller can control additional spaces after punctuation.
simplejson.dumps( pydata ) {"three": ["red", "yellow", ["blue", "azure", "cobalt", "teal"], "orange"], "two": 19.5, "one": true} simplejson.dumps( pydata, separators=(',',':') ) {"three":["red","yellow",["blue","azure","cobalt","teal"],"orange"],"two":19.5,"one":true} simplejson.dumps( pydata, indent=4 ) { "three": [ "red", "yellow", [ "blue", "azure", "cobalt", "teal" ], "orange" ], "two": 19.5, "one": true }
Parsing whitespace in JSON input
All the tested modules correctly handle any amount of whitespace at any legal location; including at the beginning of the data or at the end.
However, only the demjson module, when operating in strict-mode, will reject any whitespace which is not one of the four JSON whitespace characters; for example, the presence of form feeds.
| Test# | Type of whitespace | demjson/strict | demjson/loose | jsonlib | python-cjson | python-json | simplejson |
|---|---|---|---|---|---|---|---|
| 1–1 | ws at start | yes | yes | yes | yes | yes | yes |
| 1–2 | ws at end | yes | yes | yes | yes | yes | yes |
| 1–3 | Tab U+0009 | yes | yes | yes | yes | yes | yes |
| 1–4 | Space U+0020 | yes | yes | yes | yes | yes | yes |
| 1–5 | LF U+000A | yes | yes | yes | yes | yes | yes |
| 1–6 | CR U+000C | yes | yes | yes | yes | yes | yes |
| 1–7 | VT U+000B | yes: error | yes:allows | yes:allows | yes:allows | yes:allows | yes:allows |
| 1–8 | FF U+000D | yes: error | yes:allows | yes:allows | yes:allows | yes:allows | yes:allows |
| 1–9 | NBSP U+00A0 | yes: error | yes:allows | yes:allows | yes: error | yes: error | yes: error |
| 1–10 | ENSP U+2002 | yes: error | yes:allows | yes:allows | yes: error | yes: error | yes: error |
| 1–11 | LS U+2028 | yes: error | no: error | yes:allows | yes: error | yes: error | yes: error |
| 1–12 | PS U+2029 | yes: error | no: error | yes:allows | yes: error | yes: error | yes: error |
JavaScript comments
Although not strict JSON, some modules allow parsing input that has a more JavaScript flavor. Of these demjson and python-json handle JavaScript comments. For demjson, it must be used in a non-strict mode.
JavaScript has two kinds of comments, both similar to C.
- A delimited comment, includes anything between the characters:
/* ... */. They can not nest. - A line comment, starting a
//and continues to the end of the line.
Both demjson and python-json appear to correctly handle these comments according to
the JavaScript rules; with the exception that only demjson recognizes all Unicode end-of-line
characters for parsing the // comment, and not just linefeed or carriage-return.
| Test# | JavaScript comment | demjson/strict | demjson/loose | jsonlib | python-cjson | python-json | simplejson |
|---|---|---|---|---|---|---|---|
| 2–1 | /* ... */ | yes: error | yes:allows | yes: error | yes: error | yes:allows | yes: error |
| 2–2 | // ... | yes: error | yes:allows | yes: error | yes: error | yes:allows | yes: error |
Format control characters
In addition to whitespace, JavaScript (or more technically
ECMAScript) allows any format control character to appear
anywhere within the source text with no apparent effect. These
special characters can even appear in the middle of keyword such as
true. This does not apply to JSON, in which any format
control characters should be treated as any ordinary
character—which should result in a parsing error unless they are
inside quoted string literals.
Only the demjson module, when operating in non-strict mode, allows any format control character to appear anywhere in the JSON input stream.
| Test# | JavaScript input | demjson/strict | demjson/loose | jsonlib | python-cjson | python-json | simplejson |
|---|---|---|---|---|---|---|---|
| 3–1 | format ctl char | yes: error | yes:allows | yes: error | yes: error | yes: error | yes: error |
Format control characters are any Unicode character that has a category of Cf.
There are currently about 138 such characters; including:
| U+00AD | U+202A | U+206F | U+E0020 | U+E002E | U+E003C | U+E004A | U+E0058 | U+E0066 | U+E0074 |
| U+0600 | U+202B | U+FEFF | U+E0021 | U+E002F | U+E003D | U+E004B | U+E0059 | U+E0067 | U+E0075 |
| U+0601 | U+202C | U+FFF9 | U+E0022 | U+E0030 | U+E003E | U+E004C | U+E005A | U+E0068 | U+E0076 |
| U+0602 | U+202D | U+FFFA | U+E0023 | U+E0031 | U+E003F | U+E004D | U+E005B | U+E0069 | U+E0077 |
| U+0603 | U+202E | U+FFFB | U+E0024 | U+E0032 | U+E0040 | U+E004E | U+E005C | U+E006A | U+E0078 |
| U+06DD | U+2060 | U+1D173 | U+E0025 | U+E0033 | U+E0041 | U+E004F | U+E005D | U+E006B | U+E0079 |
| U+070F | U+2061 | U+1D174 | U+E0026 | U+E0034 | U+E0042 | U+E0050 | U+E005E | U+E006C | U+E007A |
| U+17B4 | U+2062 | U+1D175 | U+E0027 | U+E0035 | U+E0043 | U+E0051 | U+E005F | U+E006D | U+E007B |
| U+17B5 | U+2063 | U+1D176 | U+E0028 | U+E0036 | U+E0044 | U+E0052 | U+E0060 | U+E006E | U+E007C |
| U+200B | U+206A | U+1D177 | U+E0029 | U+E0037 | U+E0045 | U+E0053 | U+E0061 | U+E006F | U+E007D |
| U+200C | U+206B | U+1D178 | U+E002A | U+E0038 | U+E0046 | U+E0054 | U+E0062 | U+E0070 | U+E007E |
| U+200D | U+206C | U+1D179 | U+E002B | U+E0039 | U+E0047 | U+E0055 | U+E0063 | U+E0071 | U+E007F |
| U+200E | U+206D | U+1D17A | U+E002C | U+E003A | U+E0048 | U+E0056 | U+E0064 | U+E0072 | |
| U+200F | U+206E | U+E0001 | U+E002D | U+E003B | U+E0049 | U+E0057 | U+E0065 | U+E0073 |

