Overhaul formatting, improve wording, and add one aditional rule

This commit is contained in:
2017-10-08 13:04:36 +01:00
parent c3fa7547dd
commit ee28af7806

349
README.md
View File

@@ -1,221 +1,238 @@
# CSV Spec
CSV Spec 0.9.0-draft.0
======================
CSV is not a file format, it is typically a loose set of guidelines of how to
structure tabular data into a plain text string. As such there's an endless
amount of `*.csv` files floating around which are highly incompatible with
each other. The closest thing there is to a specification is [RFC 4180][].
Summary
-------
[rfc 4180]: http://tools.ietf.org/html/rfc4180
CSV is not a file format, it is a loose set of guidelines of how to structure
tabular data into a plain text string. As such there's an endless amount of
`*.csv` files floating around which are highly incompatible with each other. The
closest thing there is to a specification is [RFC
4180](http://tools.ietf.org/html/rfc4180).
## Goals
Goals
-----
This project is an attempt to summarize RFC 4180 and the information in the
[Comma-separated values (CSV)][csv] Wikipedia article into a easy to
understand format. The spec will also take into account that the comma (`,`)
character is not the only character used as a field delimiter. Semi-colons
(`;`), tabs (`\t`), and more are popular field delimiter characters. As such
the specification will more accurately be describing a CSV-like structured
data format.
[csv]: http://en.wikipedia.org/wiki/Comma-separated_values
[Comma-separated values
(CSV)](http://en.wikipedia.org/wiki/Comma-separated_values) Wikipedia article
into a easy to understand format. The spec will also take into account that the
comma (`,`) character is not the only character used as a field
delimiter. Semi-colons (`;`), tabs (`\t`), and more are popular field delimiter
characters. As such the specification will more accurately be describing a
CSV-like structured data format.
We will also provide input/output test files that CSV parser/writer software
libraries can use to validate if they properly adhere to the rules laid out in
this specification. And if possible we will even try to provide code snippets
in various languages that attempts to automatically determine the delimiter
this specification. And if possible we will even try to provide code snippets in
various languages that attempts to automatically determine the delimiter
character used in any given input CSV-like formatted file/data.
Roadmap
-------
## Roadmap
1. Write up core specification rules. _[in-progress - 1st draft]_
1. Write up core specification rules. _[in-progress]_
2. Create input/output test files covering all rules in the specification.
3. Create website for [csv-spec.org](http://csv-spec.org/).
4. Create linting tool as a NPM module, allowing easy validation of CSV
data both client-side in a web browser, and server side via a command line
tool.
4. Create linting tool as a NPM module, allowing easy validation of CSV data
both client-side in a web browser, and server side via a command line tool.
5. Create automatic delimiter character detection code snippets in various
programming languages which CSV parser developers can freely use to enhance
their libraries.
## Terminology
Terminology
-----------
- **Field** — A singular String value within a record.
- **Record** (or **Row**) — A collection of fields. This is often referred to
as a "line", but a single record can span multiple text lines if a field
within it contains one or more line breaks.
- **Delimiter** — The character used to separate fields withing a
row. Commonly this will be a comma (`,`), but semi-colons (`;`) or tabs
(`\t`) are two other popular delimiter characters.
- **Record** (or **Row**) — A collection of fields. This is often referred to as
a "line", but a single record can span multiple text lines if a field within
it contains one or more line breaks.
- **Delimiter** — The character used to separate fields withing a row. Commonly
this will be a comma (`,`), but semi-colons (`;`) or tabs (`\t`) are two other
popular delimiter characters.
- **Header** — The first row is often used to contain the column names for all
remaining rows. Header names would be used as key names when CSV data is
converted to JSON for example.
- **Line Break** — Line breaks in CSV files can be CRLF (`\r\n`), LF (`\n`),
and even in rare cases CR (`\r`).
- **LF, CR, and CRLF** — Different types of line breaks, typically determined
by the OS. Linux, OSX, and other *NIX operating systems generally use a line
feed (LF or `\n`) character. Windows uses a carriage return (CR or `\r`) and
a line feed character, effectively "CRLF" (`\r\n`).
- **Line Break** — Line breaks in CSV files can be CRLF (`\r\n`), LF (`\n`), and
even in rare cases CR (`\r`).
- **LF, CR, and CRLF** — Different types of line breaks, typically determined by
the OS. Linux, OSX, and other *NIX operating systems generally use a line feed
(LF or `\n`) character. Windows uses a carriage return (CR or `\r`) and a line
feed character, effectively "CRLF" (`\r\n`).
## CSV Format Definition
CSV Format Specification
------------------------
- These rules are mostly based on the corresponding section from
[RFC 4180][def], with minor changes, clarifications and improved examples.
- Where relevant, examples include both the CSV text version and the
equivalent data in JSON format.
- Line breaks in the CSV examples are displayed using the `¬` character.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119).
[def]: http://tools.ietf.org/html/rfc4180#section-2
These rules are mostly based on the corresponding section from [RFC
4180](http://tools.ietf.org/html/rfc4180#section-2), with minor changes,
clarifications and improved examples.
### Rules
1. Each record starts at the beginning of its own line, and ends with a line
break (shown as `¬`).
1. Each record starts at the beginning of its own line, and ends with a line
break (shown as `¬`).
CSV:
CSV:
```csv
aaa,bbb,ccc¬
xxx,yyy,zzz¬
```
```csv
aaa,bbb,ccc¬
xxx,yyy,zzz¬
```
JSON:
JSON:
```json
[ ["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
```json
[ ["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
2. Though it is RECOMMENDED, the last record in a file is not required to have a
ending line break.
2. Though it is recommended, the last record in a file is not required to
have a ending line break.
CSV:
CSV:
```csv
aaa,bbb,ccc¬
xxx,yyy,zzz
```
```csv
aaa,bbb,ccc¬
xxx,yyy,zzz
```
JSON:
JSON:
```json
[ ["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
```json
[ ["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
3. There may be an OPTIONAL header line appearing as the first line of the file
with the same format as normal records. This header will contain names
corresponding to the fields in the file, and MUST contain the same number of
fields as the records in the rest of the file.
3. There may be an optional header line appearing as the first line of the
file with the same format as normal records. This header will contain
names corresponding to the fields in the file, and must contain the same
number of fields as the records in the rest of the file.
CSV:
CSV:
```csv
field_1,field_2,field_3¬
aaa,bbb,ccc¬
xxx,yyy,zzz¬
```
```csv
field_1,field_2,field_3¬
aaa,bbb,ccc¬
xxx,yyy,zzz¬
```
JSON (ignoring headers):
JSON (ignoring headers):
```json
[ ["field_1", "field_2", "field_3"],
["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
```json
[ ["field_1", "field_2", "field_3"],
["aaa", "bbb", "ccc"],
["xxx", "yyy", "zzz"] ]
```
JSON (using headers):
JSON (using headers):
```json
[ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"},
{"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ]
```
```json
[ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"},
{"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ]
```
4. Within each record and the OPTIONAL header, there may be one or more fields,
separated by a delimiter (normally a comma). Each record MUST contain the
same number of fields throughout the file.
4. Within each record and the optional header, there may be one or more
fields, separated by a delimiter (normally a comma). Each record should
contain the same number of fields throughout the file.
CSV (invalid):
CSV (invalid):
```csv
aaa,bbb,ccc¬
111,222,333,444¬
xxx,yyy,zzz¬
```
```csv
aaa,bbb,ccc¬
111,222,333,444¬
xxx,yyy,zzz¬
```
5. The last field in a record MUST NOT be followed by a comma. This results in a
additional field with nothing in it.
5. The last field in the record must not be followed by a comma. This results
in a additional field with nothing in it.
CSV:
CSV:
```csv
aaa,bbb,ccc,¬
xxx,yyy,zzz,¬
```
```csv
aaa,bbb,ccc,¬
xxx,yyy,zzz,¬
```
JSON:
JSON:
```json
[ ["aaa", "bbb", "ccc", ""],
["xxx", "yyy", "zzz", ""] ]
```
```json
[ ["aaa", "bbb", "ccc", ""],
["xxx", "yyy", "zzz", ""] ]
```
6. Spaces are considered part of a field and MUST NOT be ignored.
6. Spaces are considered part of a field and should not be ignored. For
example:
CSV:
CSV:
```csv
aaa , bbb , ccc¬
xxx, yyy ,zzz ¬
```
```csv
aaa , bbb , ccc¬
xxx, yyy ,zzz ¬
```
JSON:
JSON:
```json
[ ["aaa ", " bbb ", " ccc"],
[" xxx", " yyy ", "zzz "] ]
```
```json
[ ["aaa ", " bbb ", " ccc"],
[" xxx", " yyy ", "zzz "] ]
```
7. Fields containing line breaks (CRLF, LF, or CR), double quotes, or the
delimiter character (normally a comma) MUST be enclosed in double-quotes.
7. Fields containing line breaks (CRLF, LF, or CR), double quotes, or the
delimiter character (normally a comma) must be enclosed in double-quotes.
CSV:
CSV:
```csv
aaa,"b¬
bb",ccc¬
xxx,"y, yy",zzz¬
```
```csv
aaa,"b¬
bb",ccc¬
xxx,"y, yy",zzz¬
```
JSON:
JSON:
```json
[ ["aaa", "b\r\nbb", "ccc"],
["xxx", "y, yy", "zzz"] ]
```
```json
[ ["aaa", "b\r\nbb", "ccc"],
["xxx", "y, yy", "zzz"] ]
```
8. A double-quote appearing inside a field MUST be escaped by preceding it with
another double quote, and the field itself MUST be enclosed in double quotes.
8. A double-quote appearing inside a field must be escaped by preceding it
with another double quote, and the field itself must be enclosed in double
quotes.
CSV:
CSV:
```csv
aaa,"b""bb",ccc¬
```
```csv
aaa,"b""bb",ccc¬
```
JSON:
JSON:
```json
[ ["aaa", "b\"bb", "ccc"] ]
```
```json
[ ["aaa", "b\"bb", "ccc"] ]
```
9. When a field enclosed in double quotes has spaces before and/or after the
double quotes, the spaces MUST be ignored, as the field starts and ends with
the double quotes. However this is considered invalid formatting and the CSV
parser SHOULD report some form of warning message.
9. Though it is not recommended, each field may be enclosed in double quotes
even if it does not contain a line break, double quote, or delimiter
character.
CSV:
```csv
aaa,bbb,ccc¬
xxx, "y, yy" ,zzz¬
```
JSON:
```json
[ ["aaa", "bbb", "ccc"],
["xxx", "y, yy", "zzz"] ]
```
10. It is possible to enclose every field in double quotes even if they don't
need to be enclosed. However it is RECOMMENDED to only enclose fields in
double quotes that requires it.
CSV:
@@ -231,12 +248,12 @@ character used in any given input CSV-like formatted file/data.
["xxx", "yyy", "zzz"] ]
```
10. All fields are always strings. CSV itself does not support type casting to
11. All fields are always strings. CSV itself does not support type casting to
integers, floats, booleans, or anything else. It is not a CSV library's
responsibility to type cast input CSV data.
If type casting is required, it is up to the developer using a specific
CSV library to ensure types are correctly dealt with.
If type casting is required, it is up to the developer using a specific CSV
library to ensure types are correctly dealt with.
Input JSON:
@@ -262,21 +279,21 @@ character used in any given input CSV-like formatted file/data.
At this point it is up to the developer themselves to type cast the above
output data from the CSV parser.
11. However, when rendering type cast input data to CSV text, non-string
types should be converted to a string in such a way that minimal
information is lost.
- Integers and floats should simply be rendered as a string version
of themselves.
- Booleans `true` and `false` should be rendered as `true` and `false`
12. However, when rendering type cast input data to CSV text, non-string types
MUST be converted to a string in such a way that minimal information is
lost.
- Integers and floats MUST be rendered as a string version of themselves.
- Booleans `true` and `false` MUST be rendered as `true` and `false`
strings, not as `1` or `0` numbers. If numbers are used the resulting
CSV data is indistinguishable from actual integer numbers.
- Null/Nil values should be rendered as empty strings.
- `Null`/`nil` values MUST be rendered as empty strings.
12. All forms of line breaks (CRLF, LF, and CR) should be supported when
parsing input CSV data. When rendering output CSV data, CRLF should be
used for line breaks to ensure maximum cross-platform compatibility.
13. When parsing input CSV data all forms of line breaks (CRLF, LF, and CR) MUST
be supported.
14. When rendering output CSV data, CRLF MUST be used for line breaks to ensure
maximum cross-platform compatibility.
## License
License
-------
[CC0 1.0 Universal](http://creativecommons.org/publicdomain/zero/1.0/)