Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What is 'plain text' ? Is it ASCII ? 7 or 8 bit ? EBCDIC ? UTF-8 ?


The distinction you are missing is what is generally meant by (and is actually valuable about) plain text is, "simultaneously human and machine readable data".

Many "plain text" formats, like markdown or INI files or json, actually have very strict formatting requirements and character set constraints, but the value-add comes from a human's ability to examine the on-file-system object, examine it with well-known and reliable tools (grep, awk, text editor, etc.), figure out what it's supposed to mean, then feed it to the machine, and compare the machine's behavior with their expectations.

With non-human-readable data, this is much harder, you pretty much need a tool to convert the binary data to readable text to distinguish between "my program is broken" and "my program works but is getting bad input."

Note that even structured ASCII can still make this hard, XML is nominally human-readable, but as a practical matter this can be difficult.


Encodings do not make text a binary format. This pedantry is uncalled for.


I’m not sure the text/binary distinction is that useful.

ASCII is much simpler than Unicode encodings, to the point where text can even become an attack vector. A fully featured UTF-8 parsing and rendering engine is a sophisticated thing.

Does it matter whether one or the other is classified as text or binary? Not as much as it matters which requires the more complex code to process.


> A fully featured UTF-8 parsing and rendering engine is a sophisticated thing.

No, UTF8 decoding is trivial, you can do it in a few dozen lines in just about any language. It's Unicode that is a complex and moving target. But you can also just choose to implement a sane subset of Unicode for your application.

Recommended reading: http://cat-v.org/, https://github.com/cls/libutf


> I’m not sure the text/binary distinction is that useful.

Encoding a number in binary takes 4 or 8 bytes. Encoding it in plain text (ASCII or Unicode) takes as many bytes as there are digits, plus one for the sign / decimal separator. If you're talking about ASCII/Unicode, you're not talking about text/binary.

> ASCII is much simpler than Unicode encodings

I don't disagree with you, but Unicode is not binary.

> Does it matter whether one or the other is classified as text or binary?

They're both text.


I'm just saying there is no such thing as 'plain text'.

I also think that plain text is a bad idea, binary files are much easier to parse.


>I'm just saying there is no such thing as 'plain text'.

And yet, everyone agrees there is, and has no problem telling its case from other formats, even if they're all 0s and 1s underneath.


American programmers with little exposure to the world beyond their personal milieu agree that there is. The rest of the world stringently disagrees.


I'm not an american programmer, and we used to have 2-3 of our own encodings in my country (or "codepages" as it was referred to back in the day) before utf-8 took care of that. Because of that having to handle international text was the default from since I started professionally programming. So, nope.

Still, missing the point.

The difference between plain text formats and binary formats is not that plain text files do no consist of bytes or don't need an encoding to read them.

It's being able to work on them with a plain text editor, and being based on actual written text -- as opposed to packed bytes and custom (proprietary or not) formats.


That is not true. Guessing encoding is really, really difficult, and you can easily end up reading a file with an uncommon Japanese encoding as an Chinese encoding and end up with subtitle errors.

Plain text is anything but plain.


> I'm just saying there is no such thing as 'plain text'.

And I'm just saying it's pedantry. 'plain text' is an umbrella term for 'not binary'.

> I also think that plain text is a bad idea, binary files are much easier to parse.

Binary files are definitely faster and more compact. And tools such as google protocol buffers makes passing information very convenient and efficient. Unfortunately, most APIs out there use JSON, so we just have to live with it.

Maybe the widespread adoption of JSON and plain text APIs is a reflection of how we, as developers, have become more likely to optimize for our own development process rather than the actual hardware (see all the electron craze).


I think I understand where you may be coming from. You are correct in that it is quite possible to get into a situation where the "plain text" editor that you are using would not understand the binary encoding of (all or some of) the characters present in the what otherwise would be expected to be a text file and display some "binary noise" character instead. So, in this sense the definition of "plain text" becomes tied to the definition (or, rather, implementation) of a "plain text editor". But I think this is wrong. For example, is an XML file in ASCII plain text? The definition of the plain text file format is simply that the file must contain no control/meta data other than what is provided by the encoding itself (such as the new line character).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: