From 29459df76fca3122c6605e296944236f4b1ef38e Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 20:27:27 +0100 Subject: [PATCH 01/20] Add first rules with experimental example format --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 8be144c..3829fe7 100644 --- a/README.md +++ b/README.md @@ -56,8 +56,22 @@ character used in any given input CSV-like formatted file/data. examples the `¬` character will be used to visually display line breaks. +## Rules +1. Each record is located on a separate line, each line ending with CRLF + (`\r\n`). For example: + CSV: + + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + + JSON: + + [ + ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] + ] ## License From d9435d35ef3a78505f2781fbbac7245335b2e328 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 20:31:02 +0100 Subject: [PATCH 02/20] Change markdown formatting --- README.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 3829fe7..9fed0d1 100644 --- a/README.md +++ b/README.md @@ -63,15 +63,19 @@ character used in any given input CSV-like formatted file/data. CSV: - aaa,bbb,ccc¬ - xxx,yyy,zzz¬ + ```csv + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + ``` JSON: - [ - ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] - ] + ```json + [ + ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] + ] + ``` ## License From ad4773df8071461384acc02e950d723eb4dd9ac4 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 20:48:30 +0100 Subject: [PATCH 03/20] Update formatting and add rule 2. --- README.md | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 9fed0d1..25605ad 100644 --- a/README.md +++ b/README.md @@ -52,30 +52,42 @@ character used in any given input CSV-like formatted file/data. - **Header** — The first row is often used to contain the column names for all remaining rows. Header names would be used as key names when CSV data is converted to JSON for example. -- **Line Break** — Line breaks in CSV files should be CRLF (`\r\n`). In - examples the `¬` character will be used to visually display line breaks. +- **Line Break** — Line breaks in CSV files should be CRLF (`\r\n`). ## Rules -1. Each record is located on a separate line, each line ending with CRLF - (`\r\n`). For example: +_Where relevant examples include the CSV text version and the equivalent data +in JSON format. Line breaks in the CSV examples are displayed as `¬`._ - CSV: +1. Each record is located on a separate line, each line ending with CRLF + (`\r\n`). For example: - ```csv - aaa,bbb,ccc¬ - xxx,yyy,zzz¬ - ``` + CSV: - JSON: + ```csv + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + ``` - ```json - [ - ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] - ] - ``` + JSON: + + ```json + [ + ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] + ] + ``` + +2. Though recommended, the last record in a file is not required to have a + ending line break. For example: + + CSV: + + ```csv + aaa,bbb,ccc¬ + xxx,yyy,zzz + ``` ## License From 3eb3540afc28aae0d1d032842fae62ec38e88d1e Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 20:53:47 +0100 Subject: [PATCH 04/20] Update example JSON formatting --- README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 25605ad..79a3838 100644 --- a/README.md +++ b/README.md @@ -73,10 +73,8 @@ in JSON format. Line breaks in the CSV examples are displayed as `¬`._ JSON: ```json - [ - ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] - ] + [ ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] ``` 2. Though recommended, the last record in a file is not required to have a @@ -88,6 +86,13 @@ in JSON format. Line breaks in the CSV examples are displayed as `¬`._ aaa,bbb,ccc¬ xxx,yyy,zzz ``` + + JSON: + + ```json + [ ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` ## License From 2a41c5e04222a9e77ad900a22a78c0de269e29ec Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 22:17:18 +0100 Subject: [PATCH 05/20] Add rule 3 --- README.md | 43 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 39 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 79a3838..0393339 100644 --- a/README.md +++ b/README.md @@ -55,10 +55,17 @@ character used in any given input CSV-like formatted file/data. - **Line Break** — Line breaks in CSV files should be CRLF (`\r\n`). -## Rules +## CSV Format Definition -_Where relevant examples include the CSV text version and the equivalent data -in JSON format. Line breaks in the CSV examples are displayed as `¬`._ +- The rules are mostly based on the corresponding section from + [RFC 4180][def], with minor changes, clarifications and improved examples. +- Where relevant, examples include both the CSV text version and the + equivalent data in JSON format. +- Line breaks in the CSV examples are displayed as `¬`. + +[def]: http://tools.ietf.org/html/rfc4180#section-2 + +### Rules 1. Each record is located on a separate line, each line ending with CRLF (`\r\n`). For example: @@ -86,7 +93,7 @@ in JSON format. Line breaks in the CSV examples are displayed as `¬`._ aaa,bbb,ccc¬ xxx,yyy,zzz ``` - + JSON: ```json @@ -94,6 +101,34 @@ in JSON format. Line breaks in the CSV examples are displayed as `¬`._ ["xxx", "yyy", "zzz"] ] ``` +3. There maybe an optional header line appearing as the first line of the + file with the same format as normal record lines. This header will contain + names corresponding to the fields in the file and should contain the same + number of fields as the records in the rest of the file. For example: + + ```csv + field_1,field_2,field_3¬ + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + ``` + + JSON (ignoring headers): + + ```json + [ ["field_1", "field_2", "field_3"], + ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` + + JSON (using headers): + + ```json + [ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}, + {"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ] + ``` + + + ## License From 79ec249ab8fadb56753d3c6e29cdea1f2217e383 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 22:20:24 +0100 Subject: [PATCH 06/20] See how it looks when specific terms are made bold --- README.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 0393339..2f385d2 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,7 @@ character used in any given input CSV-like formatted file/data. ## CSV Format Definition -- The rules are mostly based on the corresponding section from +- These rules are mostly based on the corresponding section from [RFC 4180][def], with minor changes, clarifications and improved examples. - Where relevant, examples include both the CSV text version and the equivalent data in JSON format. @@ -67,8 +67,8 @@ character used in any given input CSV-like formatted file/data. ### Rules -1. Each record is located on a separate line, each line ending with CRLF - (`\r\n`). For example: +1. Each **record** is located on a separate line, each line ending with a + **line break** (CRLF or `\r\n`). For example: CSV: @@ -85,7 +85,7 @@ character used in any given input CSV-like formatted file/data. ``` 2. Though recommended, the last record in a file is not required to have a - ending line break. For example: + ending **line break**. For example: CSV: @@ -101,10 +101,11 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` -3. There maybe an optional header line appearing as the first line of the - file with the same format as normal record lines. This header will contain - names corresponding to the fields in the file and should contain the same - number of fields as the records in the rest of the file. For example: +3. There maybe an optional **header** line appearing as the first line of the + file with the same format as normal **record** lines. This **header** will + contain names corresponding to the **fields** in the file and should + contain the same number of **fields** as the **records** in the rest of + the file. For example: ```csv field_1,field_2,field_3¬ From 9c0056d1ffc0c14bb3ecae924a0ae980eb100b85 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 22:22:15 +0100 Subject: [PATCH 07/20] Making certain terminology words bold was a horrible idea >_< --- README.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 2f385d2..a919510 100644 --- a/README.md +++ b/README.md @@ -67,8 +67,8 @@ character used in any given input CSV-like formatted file/data. ### Rules -1. Each **record** is located on a separate line, each line ending with a - **line break** (CRLF or `\r\n`). For example: +1. Each record is located on a separate line, each line ending with a line + break (CRLF). For example: CSV: @@ -85,7 +85,7 @@ character used in any given input CSV-like formatted file/data. ``` 2. Though recommended, the last record in a file is not required to have a - ending **line break**. For example: + ending line break. For example: CSV: @@ -101,11 +101,10 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` -3. There maybe an optional **header** line appearing as the first line of the - file with the same format as normal **record** lines. This **header** will - contain names corresponding to the **fields** in the file and should - contain the same number of **fields** as the **records** in the rest of - the file. For example: +3. There maybe an optional header line appearing as the first line of the + file with the same format as normal record lines. This header will contain + names corresponding to the fields in the file and should contain the same + number of fields as the records in the rest of the file. For example: ```csv field_1,field_2,field_3¬ From d5abebbc727a5ab30976ee1951d242cb285edaa8 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 22:23:38 +0100 Subject: [PATCH 08/20] Make example headers italic --- README.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index a919510..b80b76b 100644 --- a/README.md +++ b/README.md @@ -70,14 +70,14 @@ character used in any given input CSV-like formatted file/data. 1. Each record is located on a separate line, each line ending with a line break (CRLF). For example: - CSV: + _CSV:_ ```csv aaa,bbb,ccc¬ xxx,yyy,zzz¬ ``` - JSON: + _JSON:_ ```json [ ["aaa", "bbb", "ccc"], @@ -87,14 +87,14 @@ character used in any given input CSV-like formatted file/data. 2. Though recommended, the last record in a file is not required to have a ending line break. For example: - CSV: + _CSV:_ ```csv aaa,bbb,ccc¬ xxx,yyy,zzz ``` - JSON: + _JSON:_ ```json [ ["aaa", "bbb", "ccc"], @@ -106,13 +106,15 @@ character used in any given input CSV-like formatted file/data. names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example: + _CSV:_ + ```csv field_1,field_2,field_3¬ aaa,bbb,ccc¬ xxx,yyy,zzz¬ ``` - JSON (ignoring headers): + _JSON (ignoring headers):_ ```json [ ["field_1", "field_2", "field_3"], @@ -120,7 +122,7 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` - JSON (using headers): + _JSON (using headers):_ ```json [ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}, From a44dd3dfc15e9a70e6f89fc67860d2ae02fd6a44 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:16:15 +0100 Subject: [PATCH 09/20] Update terminology --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b80b76b..ed33a1a 100644 --- a/README.md +++ b/README.md @@ -42,10 +42,10 @@ character used in any given input CSV-like formatted file/data. ## Terminology -- **Field** — A singular String value within a row. -- **Record** (or **Row**) — A collection of fields. -- **Column** — Fields from multiple rows at the same offset. For example the - second column would be a list of the second field from every row. +- **Field** — A singular String value within a record. +- **Record** (or **Row**) — A collection of fields. This is often referred to + as a "line", but a single record can in span multiple text lines if a field + within it contains one or more line breaks. - **Delimiter** — The character used to separate fields withing a row. Commonly this will be a comma (`,`), but semi-colons (`;`) or tabs (`\t`) are two other popular delimiter characters. From da639a1da6e424f900065885e0a9cbb4cad16800 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:16:29 +0100 Subject: [PATCH 10/20] Initial draft of all CSV format definition rules --- README.md | 152 ++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 142 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index ed33a1a..cedccac 100644 --- a/README.md +++ b/README.md @@ -61,7 +61,7 @@ character used in any given input CSV-like formatted file/data. [RFC 4180][def], with minor changes, clarifications and improved examples. - Where relevant, examples include both the CSV text version and the equivalent data in JSON format. -- Line breaks in the CSV examples are displayed as `¬`. +- Line breaks in the CSV examples are displayed using the `¬` character. [def]: http://tools.ietf.org/html/rfc4180#section-2 @@ -70,31 +70,31 @@ character used in any given input CSV-like formatted file/data. 1. Each record is located on a separate line, each line ending with a line break (CRLF). For example: - _CSV:_ + CSV: ```csv aaa,bbb,ccc¬ xxx,yyy,zzz¬ ``` - _JSON:_ + JSON: ```json [ ["aaa", "bbb", "ccc"], ["xxx", "yyy", "zzz"] ] ``` -2. Though recommended, the last record in a file is not required to have a - ending line break. For example: +2. Though it is recommended, the last record in a file is not required to + have a ending line break. For example: - _CSV:_ + CSV: ```csv aaa,bbb,ccc¬ xxx,yyy,zzz ``` - _JSON:_ + JSON: ```json [ ["aaa", "bbb", "ccc"], @@ -106,7 +106,7 @@ character used in any given input CSV-like formatted file/data. names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file. For example: - _CSV:_ + CSV: ```csv field_1,field_2,field_3¬ @@ -114,7 +114,7 @@ character used in any given input CSV-like formatted file/data. xxx,yyy,zzz¬ ``` - _JSON (ignoring headers):_ + JSON (ignoring headers): ```json [ ["field_1", "field_2", "field_3"], @@ -122,14 +122,146 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` - _JSON (using headers):_ + JSON (using headers): ```json [ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}, {"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ] ``` +4. Within each record and the header, there may be one or more fields, + separated by a delimiter (normally a comma). Each record should contain + the same number of fields throughout the file. For example: + CSV (invalid): + + ```csv + aaa,bbb,ccc¬ + 111,222,333,444¬ + xxx,yyy,zzz¬ + ``` + +5. The last field in the record must not be followed by a comma. This results + in a additional field with nothing in it. For example: + + CSV: + + ```csv + aaa,bbb,ccc,¬ + xxx,yyy,zzz,¬ + ``` + + JSON: + + ```json + [ ["aaa", "bbb", "ccc", ""], + ["xxx", "yyy", "zzz", ""] ] + ``` + +6. Spaces are considered part of a field and should not be ignored. For + example: + + CSV: + + ```csv + aaa , bbb , ccc¬ + xxx, yyy ,zzz ¬ + ``` + + JSON: + + ```json + [ ["aaa ", " bbb ", " ccc"], + [" xxx", " yyy ", "zzz "] ] + ``` + +7. Fields containing line breaks, double quotes, or the delimiter character + (normally a comma) must be enclosed in double-quotes. For example: + + CSV: + + ```csv + aaa,"b¬ + bb",ccc¬ + xxx,"y, yy",zzz¬ + ``` + + JSON: + + ```json + [ ["aaa", "b\r\nbb", "ccc"], + ["xxx", "y, yy", "zzz"] ] + ``` + +8. If double-quotes are used to enclose fields, then a double-quote appearing + inside a field must be escaped by preceding it with another double quote. + For example: + + CSV: + + ```csv + aaa,"b""bb",ccc¬ + ``` + + JSON: + + ```json + [ ["aaa", "b\"bb", "ccc"] ] + ``` + +9. Though it is not recommended, each field may be enclosed in double quotes + even if it does not contain a line break, double quote, or delimiter + character. For example: + + CSV: + + ```csv + "aaa","bbb","ccc"¬ + "xxx",yyy,zzz¬ + ``` + + JSON: + + ```json + [ ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` + +10. All fields are always strings. CSV itself does not support type casting to + integers, floats, booleans, or anything else. If type casting is required, + it is be up to the developer using a specific CSV library to ensure types + are correctly dealt with. It is not the responsibility of the CSV + parsing/writing library itself. For example: + + Input JSON: + + ```json + [ [10, true, 0.3, "aaa"], + [11, false, 2.13, "bbb"] ] + ``` + + Output CSV: + + ```csv + 10,true,0.3,aaa¬ + 11,false,2.13,bbb¬ + ``` + + Output CSV parsed back to JSON: + + ```json + [ ["10", "true", "0.3", "aaa"], + ["11", "false", "2.13", "bbb"] ] + ``` + +11. When rendering output CSV data, non-string types should be converted to a + string in such a way that minimal information is lost. For example: + - Integers and floats should simply be rendered as a string version + of themselves. + - Booleans `true` and `false` should be rendered as `true` and `false` + strings, not as `1` or `0` numbers. If numbers are used the resulting + CSV data is indistinguishable from actual integer numbers. + - Null/Nil values should be rendered as empty strings. ## License From b46688b53488cc807f9b04dc483418c6e9e68056 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:29:31 +0100 Subject: [PATCH 11/20] Tweaks to wording --- README.md | 37 +++++++++++++++++++++---------------- 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index cedccac..e0b2a74 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ character used in any given input CSV-like formatted file/data. ### Rules -1. Each record is located on a separate line, each line ending with a line +1. Each record starts at the beginning of its own line, and ends with a line break (CRLF). For example: CSV: @@ -101,9 +101,9 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` -3. There maybe an optional header line appearing as the first line of the - file with the same format as normal record lines. This header will contain - names corresponding to the fields in the file and should contain the same +3. There may be an optional header line appearing as the first line of the + file with the same format as normal records. This header will contain + names corresponding to the fields in the file, and must contain the same number of fields as the records in the rest of the file. For example: CSV: @@ -129,9 +129,9 @@ character used in any given input CSV-like formatted file/data. {"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ] ``` -4. Within each record and the header, there may be one or more fields, - separated by a delimiter (normally a comma). Each record should contain - the same number of fields throughout the file. For example: +4. Within each record and the optional header, there may be one or more + fields, separated by a delimiter (normally a comma). Each record should + contain the same number of fields throughout the file. For example: CSV (invalid): @@ -193,9 +193,9 @@ character used in any given input CSV-like formatted file/data. ["xxx", "y, yy", "zzz"] ] ``` -8. If double-quotes are used to enclose fields, then a double-quote appearing - inside a field must be escaped by preceding it with another double quote. - For example: +8. A double-quote appearing inside a field must be escaped by preceding it + with another double quote, and the field itself must be enclosed in double + quotes. For example: CSV: @@ -228,10 +228,11 @@ character used in any given input CSV-like formatted file/data. ``` 10. All fields are always strings. CSV itself does not support type casting to - integers, floats, booleans, or anything else. If type casting is required, - it is be up to the developer using a specific CSV library to ensure types - are correctly dealt with. It is not the responsibility of the CSV - parsing/writing library itself. For example: + integers, floats, booleans, or anything else. It is not a CSV library's + responsibility to type cast input CSV data. + + If type casting is required, it is be up to the developer using a specific + CSV library to ensure types are correctly dealt with. For example: Input JSON: @@ -254,8 +255,12 @@ character used in any given input CSV-like formatted file/data. ["11", "false", "2.13", "bbb"] ] ``` -11. When rendering output CSV data, non-string types should be converted to a - string in such a way that minimal information is lost. For example: + At this point it is up to the developer themselves to type cast the above + output data from the CSV parser. + +11. However, when rendering type cast input data to CSV text, non-string + types should be converted to a string in such a way that minimal + information is lost. For example: - Integers and floats should simply be rendered as a string version of themselves. - Booleans `true` and `false` should be rendered as `true` and `false` From 8b2c7a22d48c22d63628d0fea0f82c5f5619c019 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:35:48 +0100 Subject: [PATCH 12/20] Removed "For example:" from all rules, it felt redundant --- README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index e0b2a74..e5870d4 100644 --- a/README.md +++ b/README.md @@ -68,7 +68,7 @@ character used in any given input CSV-like formatted file/data. ### Rules 1. Each record starts at the beginning of its own line, and ends with a line - break (CRLF). For example: + break (CRLF). CSV: @@ -85,7 +85,7 @@ character used in any given input CSV-like formatted file/data. ``` 2. Though it is recommended, the last record in a file is not required to - have a ending line break. For example: + have a ending line break. CSV: @@ -104,7 +104,7 @@ character used in any given input CSV-like formatted file/data. 3. There may be an optional header line appearing as the first line of the file with the same format as normal records. This header will contain names corresponding to the fields in the file, and must contain the same - number of fields as the records in the rest of the file. For example: + number of fields as the records in the rest of the file. CSV: @@ -131,7 +131,7 @@ character used in any given input CSV-like formatted file/data. 4. Within each record and the optional header, there may be one or more fields, separated by a delimiter (normally a comma). Each record should - contain the same number of fields throughout the file. For example: + contain the same number of fields throughout the file. CSV (invalid): @@ -142,7 +142,7 @@ character used in any given input CSV-like formatted file/data. ``` 5. The last field in the record must not be followed by a comma. This results - in a additional field with nothing in it. For example: + in a additional field with nothing in it. CSV: @@ -176,7 +176,7 @@ character used in any given input CSV-like formatted file/data. ``` 7. Fields containing line breaks, double quotes, or the delimiter character - (normally a comma) must be enclosed in double-quotes. For example: + (normally a comma) must be enclosed in double-quotes. CSV: @@ -195,7 +195,7 @@ character used in any given input CSV-like formatted file/data. 8. A double-quote appearing inside a field must be escaped by preceding it with another double quote, and the field itself must be enclosed in double - quotes. For example: + quotes. CSV: @@ -211,7 +211,7 @@ character used in any given input CSV-like formatted file/data. 9. Though it is not recommended, each field may be enclosed in double quotes even if it does not contain a line break, double quote, or delimiter - character. For example: + character. CSV: @@ -232,7 +232,7 @@ character used in any given input CSV-like formatted file/data. responsibility to type cast input CSV data. If type casting is required, it is be up to the developer using a specific - CSV library to ensure types are correctly dealt with. For example: + CSV library to ensure types are correctly dealt with. Input JSON: @@ -260,7 +260,7 @@ character used in any given input CSV-like formatted file/data. 11. However, when rendering type cast input data to CSV text, non-string types should be converted to a string in such a way that minimal - information is lost. For example: + information is lost. - Integers and floats should simply be rendered as a string version of themselves. - Booleans `true` and `false` should be rendered as `true` and `false` From 60a244220592d3e5cd72a464d845f13b37bbbe32 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:41:13 +0100 Subject: [PATCH 13/20] Clarified line breaks a little bit --- README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e5870d4..abf73ea 100644 --- a/README.md +++ b/README.md @@ -53,7 +53,8 @@ character used in any given input CSV-like formatted file/data. remaining rows. Header names would be used as key names when CSV data is converted to JSON for example. - **Line Break** — Line breaks in CSV files should be CRLF (`\r\n`). - +- **CRLF** — Means the standard line break used by Windows. It is a carriage + return character (CR or `\r`) and a line feed character (LF or `\n`). ## CSV Format Definition @@ -175,8 +176,8 @@ character used in any given input CSV-like formatted file/data. [" xxx", " yyy ", "zzz "] ] ``` -7. Fields containing line breaks, double quotes, or the delimiter character - (normally a comma) must be enclosed in double-quotes. +7. Fields containing line breaks (CRLF, LF, or CR), double quotes, or the + delimiter character (normally a comma) must be enclosed in double-quotes. CSV: From 9aaf3ccfa0c97504ba74ceee6f89e530a74efc53 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:44:39 +0100 Subject: [PATCH 14/20] Fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index abf73ea..9ace1a9 100644 --- a/README.md +++ b/README.md @@ -232,7 +232,7 @@ character used in any given input CSV-like formatted file/data. integers, floats, booleans, or anything else. It is not a CSV library's responsibility to type cast input CSV data. - If type casting is required, it is be up to the developer using a specific + If type casting is required, it is up to the developer using a specific CSV library to ensure types are correctly dealt with. Input JSON: From a02a27e361fe55588589f8e16176d47829de4ed1 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:44:49 +0100 Subject: [PATCH 15/20] Update type casting examples --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 9ace1a9..e4b0ec6 100644 --- a/README.md +++ b/README.md @@ -238,22 +238,22 @@ character used in any given input CSV-like formatted file/data. Input JSON: ```json - [ [10, true, 0.3, "aaa"], - [11, false, 2.13, "bbb"] ] + [ [10, true, 0.3, null, "aaa"], + [11, false, 2.13, "", "bbb"] ] ``` Output CSV: ```csv - 10,true,0.3,aaa¬ - 11,false,2.13,bbb¬ + 10,true,0.3,,aaa¬ + 11,false,2.13,,bbb¬ ``` Output CSV parsed back to JSON: ```json - [ ["10", "true", "0.3", "aaa"], - ["11", "false", "2.13", "bbb"] ] + [ ["10", "true", "0.3", "", "aaa"], + ["11", "false", "2.13", "", "bbb"] ] ``` At this point it is up to the developer themselves to type cast the above From b19dd95b07516e13fa53faa97f1942ee687b8581 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Wed, 1 Apr 2015 23:48:45 +0100 Subject: [PATCH 16/20] Update roadmap --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e4b0ec6..994c499 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ character used in any given input CSV-like formatted file/data. ## Roadmap -1. Write up core specification rules. _[in-progress]_ +1. Write up core specification rules. _[in-progress - 1st draft]_ 2. Create input/output test files covering all rules in the specification. 3. Create website for [csv-spec.org](http://csv-spec.org/). 4. Create linting tool as a NPM module, allowing easy validation of CSV From 2b3a7ae7d9a8bd7b954002fcfe7f05b426d4e75c Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Thu, 2 Apr 2015 01:05:45 +0100 Subject: [PATCH 17/20] Fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 994c499..f4f11ee 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ character used in any given input CSV-like formatted file/data. - **Field** — A singular String value within a record. - **Record** (or **Row**) — A collection of fields. This is often referred to - as a "line", but a single record can in span multiple text lines if a field + as a "line", but a single record can span multiple text lines if a field within it contains one or more line breaks. - **Delimiter** — The character used to separate fields withing a row. Commonly this will be a comma (`,`), but semi-colons (`;`) or tabs From c3fa7547ddff33ab163c5684056979a93a862f77 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Thu, 2 Apr 2015 01:16:40 +0100 Subject: [PATCH 18/20] Updates regarding line breaks --- README.md | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index f4f11ee..720c0c5 100644 --- a/README.md +++ b/README.md @@ -52,9 +52,12 @@ character used in any given input CSV-like formatted file/data. - **Header** — The first row is often used to contain the column names for all remaining rows. Header names would be used as key names when CSV data is converted to JSON for example. -- **Line Break** — Line breaks in CSV files should be CRLF (`\r\n`). -- **CRLF** — Means the standard line break used by Windows. It is a carriage - return character (CR or `\r`) and a line feed character (LF or `\n`). +- **Line Break** — Line breaks in CSV files can be CRLF (`\r\n`), LF (`\n`), + and even in rare cases CR (`\r`). +- **LF, CR, and CRLF** — Different types of line breaks, typically determined + by the OS. Linux, OSX, and other *NIX operating systems generally use a line + feed (LF or `\n`) character. Windows uses a carriage return (CR or `\r`) and + a line feed character, effectively "CRLF" (`\r\n`). ## CSV Format Definition @@ -69,7 +72,7 @@ character used in any given input CSV-like formatted file/data. ### Rules 1. Each record starts at the beginning of its own line, and ends with a line - break (CRLF). + break (shown as `¬`). CSV: @@ -269,6 +272,10 @@ character used in any given input CSV-like formatted file/data. CSV data is indistinguishable from actual integer numbers. - Null/Nil values should be rendered as empty strings. +12. All forms of line breaks (CRLF, LF, and CR) should be supported when + parsing input CSV data. When rendering output CSV data, CRLF should be + used for line breaks to ensure maximum cross-platform compatibility. + ## License From ee28af780660126d0faf8cd8b2723724077b99a4 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Sun, 8 Oct 2017 13:04:36 +0100 Subject: [PATCH 19/20] Overhaul formatting, improve wording, and add one aditional rule --- README.md | 349 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 183 insertions(+), 166 deletions(-) diff --git a/README.md b/README.md index 720c0c5..050d792 100644 --- a/README.md +++ b/README.md @@ -1,221 +1,238 @@ -# CSV Spec +CSV Spec 0.9.0-draft.0 +====================== -CSV is not a file format, it is typically a loose set of guidelines of how to -structure tabular data into a plain text string. As such there's an endless -amount of `*.csv` files floating around which are highly incompatible with -each other. The closest thing there is to a specification is [RFC 4180][]. +Summary +------- -[rfc 4180]: http://tools.ietf.org/html/rfc4180 +CSV is not a file format, it is a loose set of guidelines of how to structure +tabular data into a plain text string. As such there's an endless amount of +`*.csv` files floating around which are highly incompatible with each other. The +closest thing there is to a specification is [RFC +4180](http://tools.ietf.org/html/rfc4180). - -## Goals +Goals +----- This project is an attempt to summarize RFC 4180 and the information in the -[Comma-separated values (CSV)][csv] Wikipedia article into a easy to -understand format. The spec will also take into account that the comma (`,`) -character is not the only character used as a field delimiter. Semi-colons -(`;`), tabs (`\t`), and more are popular field delimiter characters. As such -the specification will more accurately be describing a CSV-like structured -data format. - -[csv]: http://en.wikipedia.org/wiki/Comma-separated_values +[Comma-separated values +(CSV)](http://en.wikipedia.org/wiki/Comma-separated_values) Wikipedia article +into a easy to understand format. The spec will also take into account that the +comma (`,`) character is not the only character used as a field +delimiter. Semi-colons (`;`), tabs (`\t`), and more are popular field delimiter +characters. As such the specification will more accurately be describing a +CSV-like structured data format. We will also provide input/output test files that CSV parser/writer software libraries can use to validate if they properly adhere to the rules laid out in -this specification. And if possible we will even try to provide code snippets -in various languages that attempts to automatically determine the delimiter +this specification. And if possible we will even try to provide code snippets in +various languages that attempts to automatically determine the delimiter character used in any given input CSV-like formatted file/data. +Roadmap +------- -## Roadmap - -1. Write up core specification rules. _[in-progress - 1st draft]_ +1. Write up core specification rules. _[in-progress]_ 2. Create input/output test files covering all rules in the specification. 3. Create website for [csv-spec.org](http://csv-spec.org/). -4. Create linting tool as a NPM module, allowing easy validation of CSV - data both client-side in a web browser, and server side via a command line - tool. +4. Create linting tool as a NPM module, allowing easy validation of CSV data + both client-side in a web browser, and server side via a command line tool. 5. Create automatic delimiter character detection code snippets in various programming languages which CSV parser developers can freely use to enhance their libraries. - -## Terminology +Terminology +----------- - **Field** — A singular String value within a record. -- **Record** (or **Row**) — A collection of fields. This is often referred to - as a "line", but a single record can span multiple text lines if a field - within it contains one or more line breaks. -- **Delimiter** — The character used to separate fields withing a - row. Commonly this will be a comma (`,`), but semi-colons (`;`) or tabs - (`\t`) are two other popular delimiter characters. +- **Record** (or **Row**) — A collection of fields. This is often referred to as + a "line", but a single record can span multiple text lines if a field within + it contains one or more line breaks. +- **Delimiter** — The character used to separate fields withing a row. Commonly + this will be a comma (`,`), but semi-colons (`;`) or tabs (`\t`) are two other + popular delimiter characters. - **Header** — The first row is often used to contain the column names for all remaining rows. Header names would be used as key names when CSV data is converted to JSON for example. -- **Line Break** — Line breaks in CSV files can be CRLF (`\r\n`), LF (`\n`), - and even in rare cases CR (`\r`). -- **LF, CR, and CRLF** — Different types of line breaks, typically determined - by the OS. Linux, OSX, and other *NIX operating systems generally use a line - feed (LF or `\n`) character. Windows uses a carriage return (CR or `\r`) and - a line feed character, effectively "CRLF" (`\r\n`). +- **Line Break** — Line breaks in CSV files can be CRLF (`\r\n`), LF (`\n`), and + even in rare cases CR (`\r`). +- **LF, CR, and CRLF** — Different types of line breaks, typically determined by + the OS. Linux, OSX, and other *NIX operating systems generally use a line feed + (LF or `\n`) character. Windows uses a carriage return (CR or `\r`) and a line + feed character, effectively "CRLF" (`\r\n`). -## CSV Format Definition +CSV Format Specification +------------------------ -- These rules are mostly based on the corresponding section from - [RFC 4180][def], with minor changes, clarifications and improved examples. -- Where relevant, examples include both the CSV text version and the - equivalent data in JSON format. -- Line breaks in the CSV examples are displayed using the `¬` character. +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", +"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be +interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119). -[def]: http://tools.ietf.org/html/rfc4180#section-2 +These rules are mostly based on the corresponding section from [RFC +4180](http://tools.ietf.org/html/rfc4180#section-2), with minor changes, +clarifications and improved examples. -### Rules +1. Each record starts at the beginning of its own line, and ends with a line + break (shown as `¬`). -1. Each record starts at the beginning of its own line, and ends with a line - break (shown as `¬`). + CSV: - CSV: + ```csv + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + ``` - ```csv - aaa,bbb,ccc¬ - xxx,yyy,zzz¬ - ``` + JSON: - JSON: + ```json + [ ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` - ```json - [ ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] ] - ``` +2. Though it is RECOMMENDED, the last record in a file is not required to have a + ending line break. -2. Though it is recommended, the last record in a file is not required to - have a ending line break. + CSV: - CSV: + ```csv + aaa,bbb,ccc¬ + xxx,yyy,zzz + ``` - ```csv - aaa,bbb,ccc¬ - xxx,yyy,zzz - ``` + JSON: - JSON: + ```json + [ ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` - ```json - [ ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] ] - ``` +3. There may be an OPTIONAL header line appearing as the first line of the file + with the same format as normal records. This header will contain names + corresponding to the fields in the file, and MUST contain the same number of + fields as the records in the rest of the file. -3. There may be an optional header line appearing as the first line of the - file with the same format as normal records. This header will contain - names corresponding to the fields in the file, and must contain the same - number of fields as the records in the rest of the file. + CSV: - CSV: + ```csv + field_1,field_2,field_3¬ + aaa,bbb,ccc¬ + xxx,yyy,zzz¬ + ``` - ```csv - field_1,field_2,field_3¬ - aaa,bbb,ccc¬ - xxx,yyy,zzz¬ - ``` + JSON (ignoring headers): - JSON (ignoring headers): + ```json + [ ["field_1", "field_2", "field_3"], + ["aaa", "bbb", "ccc"], + ["xxx", "yyy", "zzz"] ] + ``` - ```json - [ ["field_1", "field_2", "field_3"], - ["aaa", "bbb", "ccc"], - ["xxx", "yyy", "zzz"] ] - ``` + JSON (using headers): - JSON (using headers): + ```json + [ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}, + {"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ] + ``` - ```json - [ {"field_1": "aaa", "field_2": "bbb", "field_3": "ccc"}, - {"field_1": "xxx", "field_2": "yyy", "field_3": "zzz"} ] - ``` +4. Within each record and the OPTIONAL header, there may be one or more fields, + separated by a delimiter (normally a comma). Each record MUST contain the + same number of fields throughout the file. -4. Within each record and the optional header, there may be one or more - fields, separated by a delimiter (normally a comma). Each record should - contain the same number of fields throughout the file. + CSV (invalid): - CSV (invalid): + ```csv + aaa,bbb,ccc¬ + 111,222,333,444¬ + xxx,yyy,zzz¬ + ``` - ```csv - aaa,bbb,ccc¬ - 111,222,333,444¬ - xxx,yyy,zzz¬ - ``` +5. The last field in a record MUST NOT be followed by a comma. This results in a + additional field with nothing in it. -5. The last field in the record must not be followed by a comma. This results - in a additional field with nothing in it. + CSV: - CSV: + ```csv + aaa,bbb,ccc,¬ + xxx,yyy,zzz,¬ + ``` - ```csv - aaa,bbb,ccc,¬ - xxx,yyy,zzz,¬ - ``` + JSON: - JSON: + ```json + [ ["aaa", "bbb", "ccc", ""], + ["xxx", "yyy", "zzz", ""] ] + ``` - ```json - [ ["aaa", "bbb", "ccc", ""], - ["xxx", "yyy", "zzz", ""] ] - ``` +6. Spaces are considered part of a field and MUST NOT be ignored. -6. Spaces are considered part of a field and should not be ignored. For - example: + CSV: - CSV: + ```csv + aaa , bbb , ccc¬ + xxx, yyy ,zzz ¬ + ``` - ```csv - aaa , bbb , ccc¬ - xxx, yyy ,zzz ¬ - ``` + JSON: - JSON: + ```json + [ ["aaa ", " bbb ", " ccc"], + [" xxx", " yyy ", "zzz "] ] + ``` - ```json - [ ["aaa ", " bbb ", " ccc"], - [" xxx", " yyy ", "zzz "] ] - ``` +7. Fields containing line breaks (CRLF, LF, or CR), double quotes, or the + delimiter character (normally a comma) MUST be enclosed in double-quotes. -7. Fields containing line breaks (CRLF, LF, or CR), double quotes, or the - delimiter character (normally a comma) must be enclosed in double-quotes. + CSV: - CSV: + ```csv + aaa,"b¬ + bb",ccc¬ + xxx,"y, yy",zzz¬ + ``` - ```csv - aaa,"b¬ - bb",ccc¬ - xxx,"y, yy",zzz¬ - ``` + JSON: - JSON: + ```json + [ ["aaa", "b\r\nbb", "ccc"], + ["xxx", "y, yy", "zzz"] ] + ``` - ```json - [ ["aaa", "b\r\nbb", "ccc"], - ["xxx", "y, yy", "zzz"] ] - ``` +8. A double-quote appearing inside a field MUST be escaped by preceding it with + another double quote, and the field itself MUST be enclosed in double quotes. -8. A double-quote appearing inside a field must be escaped by preceding it - with another double quote, and the field itself must be enclosed in double - quotes. + CSV: - CSV: + ```csv + aaa,"b""bb",ccc¬ + ``` - ```csv - aaa,"b""bb",ccc¬ - ``` + JSON: - JSON: + ```json + [ ["aaa", "b\"bb", "ccc"] ] + ``` - ```json - [ ["aaa", "b\"bb", "ccc"] ] - ``` +9. When a field enclosed in double quotes has spaces before and/or after the + double quotes, the spaces MUST be ignored, as the field starts and ends with + the double quotes. However this is considered invalid formatting and the CSV + parser SHOULD report some form of warning message. -9. Though it is not recommended, each field may be enclosed in double quotes - even if it does not contain a line break, double quote, or delimiter - character. + CSV: + + ```csv + aaa,bbb,ccc¬ + xxx, "y, yy" ,zzz¬ + ``` + + JSON: + + ```json + [ ["aaa", "bbb", "ccc"], + ["xxx", "y, yy", "zzz"] ] + ``` + +10. It is possible to enclose every field in double quotes even if they don't + need to be enclosed. However it is RECOMMENDED to only enclose fields in + double quotes that requires it. CSV: @@ -231,12 +248,12 @@ character used in any given input CSV-like formatted file/data. ["xxx", "yyy", "zzz"] ] ``` -10. All fields are always strings. CSV itself does not support type casting to +11. All fields are always strings. CSV itself does not support type casting to integers, floats, booleans, or anything else. It is not a CSV library's responsibility to type cast input CSV data. - If type casting is required, it is up to the developer using a specific - CSV library to ensure types are correctly dealt with. + If type casting is required, it is up to the developer using a specific CSV + library to ensure types are correctly dealt with. Input JSON: @@ -262,21 +279,21 @@ character used in any given input CSV-like formatted file/data. At this point it is up to the developer themselves to type cast the above output data from the CSV parser. -11. However, when rendering type cast input data to CSV text, non-string - types should be converted to a string in such a way that minimal - information is lost. - - Integers and floats should simply be rendered as a string version - of themselves. - - Booleans `true` and `false` should be rendered as `true` and `false` +12. However, when rendering type cast input data to CSV text, non-string types + MUST be converted to a string in such a way that minimal information is + lost. + - Integers and floats MUST be rendered as a string version of themselves. + - Booleans `true` and `false` MUST be rendered as `true` and `false` strings, not as `1` or `0` numbers. If numbers are used the resulting CSV data is indistinguishable from actual integer numbers. - - Null/Nil values should be rendered as empty strings. + - `Null`/`nil` values MUST be rendered as empty strings. -12. All forms of line breaks (CRLF, LF, and CR) should be supported when - parsing input CSV data. When rendering output CSV data, CRLF should be - used for line breaks to ensure maximum cross-platform compatibility. +13. When parsing input CSV data all forms of line breaks (CRLF, LF, and CR) MUST + be supported. +14. When rendering output CSV data, CRLF MUST be used for line breaks to ensure + maximum cross-platform compatibility. - -## License +License +------- [CC0 1.0 Universal](http://creativecommons.org/publicdomain/zero/1.0/) From b63758ef2f8b12ba480900400ecd25260971e453 Mon Sep 17 00:00:00 2001 From: Jim Myhrberg Date: Sun, 8 Oct 2017 13:13:16 +0100 Subject: [PATCH 20/20] Add about section --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index 050d792..45dbdb3 100644 --- a/README.md +++ b/README.md @@ -293,6 +293,14 @@ clarifications and improved examples. 14. When rendering output CSV data, CRLF MUST be used for line breaks to ensure maximum cross-platform compatibility. +About +----- + +This CSV specification is authored by [Jim Myhrberg](https://jimeh.me/). + +If you'd like to leave feedback, +please [open an issue on GitHub](https://github.com/parsecsv/csv-spec/issues). + License -------