In
computingComputing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, a
newline, also known as a
line break or
end-of-line (
EOL) marker, is a special
characterIn computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a
new line—that is, on the next line below the text immediately preceding the newline. The actual codes representing a newline vary across operating systems, which can be a problem when exchanging text files between systems with different newline representations.
There is also some confusion whether newlines terminate or separate lines. If a newline is considered a separator, there will be no newline after the last line of a file. The general convention on most systems is to add a newline even after the last line, i.e. to treat newline as a line terminator. Some programs have problems processing the last line of a file if it is not newline terminated. Conversely, programs that expect newline to be used as a separator will interpret a final newline as starting a new (empty) line.
In text intended primarily to be read by humans
using software which implements the
word wrapIn text display, line wrap is the feature of continuing on a new line when a line is full, such that each line fits in the viewable window, allowing text to be read from top to bottom without any horizontal scrolling....
feature,
a newline character typically only needs to be stored if a line break is required independent of whether the next word would fit on the same line, such as between paragraphs and in vertical lists. See
hard returnA hard return is a paragraph break in a word processor. It differs from a soft return in that it starts a new paragraph. Besides affecting the document statistics, this means that:*Often, extra space and a first line indent will be inserted....
and
soft returnIn word processing and text-oriented markup languages the term soft return can mean a line break due to word wrapping. Alternatively it can mean a stored line break that is not a paragraph break. For example, it is common to print postal addresses in a multiple-line format, but the several lines...
.
Representations
Software applications and
operating systemAn operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s usually represent a newline with one or two control characters:
- Systems based on ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
or a compatible character set use either LF (Line feed, '\n', 0xIn mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
0A, 10 in decimal) or CR (Carriage returnCarriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...
, '\r', 0x0D, 13 in decimal) individually, or CR followed by LF (CR+LF, '\r\n', 0x0D0A). These characters are based on printer commands: The line feed indicated that one line of paper should feed out of the printer thus instructed the printer to advance the paper one line, and a carriage return indicated that the printer carriage should return to the beginning of the current line. Some rare systems, such as QNXQNX is a commercial Unix-like real-time operating system, aimed primarily at the embedded systems market. The product was originally developed by Canadian company, QNX Software Systems, which was later acquired by Canadian BlackBerry-producer Research In Motion.-Description:As a microkernel-based...
before version 4, used the ASCII RS (record separator, 0x1E, 30 in decimal) character as the newline character.
- LF: Multics
Multics was an influential early time-sharing operating system. The project was started in 1964 in Cambridge, Massachusetts...
, UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
and Unix-likeA Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
systems (GNUGNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...
/LinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
, AIXAIX AIX AIX (Advanced Interactive eXecutive, pronounced "a i ex" is a series of proprietary Unix operating systems developed and sold by IBM for several of its computer platforms...
, XenixXenix is a version of the Unix operating system, licensed to Microsoft from AT&T in the late 1970s. The Santa Cruz Operation later acquired exclusive rights to the software, and eventually superseded it with SCO UNIX ....
, Mac OS XMac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
, FreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
, etc.), BeOSBeOS is an operating system for personal computers which began development by Be Inc. in 1991. It was first written to run on BeBox hardware. BeOS was optimized for digital media work and was written to take advantage of modern hardware facilities such as symmetric multiprocessing by utilizing...
, AmigaThe Amiga is a family of personal computers that was sold by Commodore in the 1980s and 1990s. The first model was launched in 1985 as a high-end home computer and became popular for its graphical, audio and multi-tasking abilities...
, RISC OSRISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...
, and others.
- CR+LF: Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, DECDigital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
TOPS-10The TOPS-10 System was a computer operating system from Digital Equipment Corporation for the PDP-10 mainframe computer launched in 1967...
, RT-11RT-11 was a small, single-user real-time operating system for the Digital Equipment Corporation PDP-11 family of 16-bit computers...
and most other early non-Unix and non-IBM OSes, CP/MCP/M was a mass-market operating system created for Intel 8080/85 based microcomputers by Gary Kildall of Digital Research, Inc...
, MP/MMP/M was a multi-user version of the CP/M operating system, created by Digital Research developer Tom Rolander in 1979. It allowed multiple users to connect to a single computer, each using a separate terminal....
, DOSDOS, short for "Disk Operating System", is an acronym for several closely related operating systems that dominated the IBM PC compatible market between 1981 and 1995, or until about 2000 if one includes the partially DOS-based Microsoft Windows versions 95, 98, and Millennium Edition.Related...
(MS-DOSMS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
, PC-DOSIBM PC DOS is a DOS system for the IBM Personal Computer and compatibles, manufactured and sold by IBM from the 1980s to the 2000s....
, etc.), Atari TOSTOS is the operating system of the Atari ST range of computers. This range includes the 520 and 1040ST, their STF/M/FM and STE variants and the Mega ST/STE. Later, 32-bit machines were developed using a new version of TOS, called MultiTOS, which allowed multitasking...
, OS/2OS/2 is a computer operating system, initially created by Microsoft and IBM, then later developed by IBM exclusively. The name stands for "Operating System/2," because it was introduced as part of the same generation change release as IBM's "Personal System/2 " line of second-generation personal...
, Symbian OS, Palm OSPalm OS is a mobile operating system initially developed by Palm, Inc., for personal digital assistants in 1996. Palm OS is designed for ease of use with a touchscreen-based graphical user interface. It is provided with a suite of basic applications for personal information management...
- LF+CR: Acorn BBC
The BBC Microcomputer System, or BBC Micro, was a series of microcomputers and associated peripherals designed and built by Acorn Computers for the BBC Computer Literacy Project, operated by the British Broadcasting Corporation...
and RISC OSRISC OS is a computer operating system originally developed by Acorn Computers Ltd in Cambridge, England for their range of desktop computers, based on their own ARM architecture. First released in 1987, under the name Arthur, the subsequent iteration was renamed as in 1988...
spooled text output.
- CR: Commodore
Commodore is the commonly used name for Commodore Business Machines , the U.S.-based home computer manufacturer and electronics manufacturer headquartered in West Chester, Pennsylvania, which also housed Commodore's corporate parent company, Commodore International Limited...
8-bit machines, Acorn BBCThe BBC Microcomputer System, or BBC Micro, was a series of microcomputers and associated peripherals designed and built by Acorn Computers for the BBC Computer Literacy Project, operated by the British Broadcasting Corporation...
, TRS-80TRS-80 was Tandy Corporation's desktop microcomputer model line, sold through Tandy's Radio Shack stores in the late 1970s and early 1980s. The first units, ordered unseen, were delivered in November 1977, and rolled out to the stores the third week of December. The line won popularity with...
, Apple II family, Mac OSOn January 24, 1984, Apple Computer Inc. introduced the Macintosh personal computer, with the Macintosh 128K model, which came bundled with what was later renamed the Mac OS, but then known simply as the System Software....
up to version 9Mac OS 9 is the final major release of Apple's Mac OS before the launch of Mac OS X. Introduced on October 23, 1999, Apple positioned it as "The Best Internet Operating System Ever," highlighting Sherlock 2's Internet search capabilities, integration with Apple's free online services known as...
and OS-9OS-9 is a family of real-time, process-based, multitasking, multi-user, Unix-like operating systems, developed in the 1980s, originally by Microware Systems Corporation for the Motorola 6809 microprocessor. It is currently owned by RadiSys Corporation....
- RS: QNX
QNX is a commercial Unix-like real-time operating system, aimed primarily at the embedded systems market. The product was originally developed by Canadian company, QNX Software Systems, which was later acquired by Canadian BlackBerry-producer Research In Motion.-Description:As a microkernel-based...
pre-POSIX implementation.
- EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
systems—mainly IBM mainframe systems, including z/OSz/OS is a 64-bit operating system for mainframe computers, produced by IBM. It derives from and is the successor to OS/390, which in turn followed a string of MVS versions.Starting with earliest:*OS/VS2 Release 2 through Release 3.8...
(OS/390OS/390 is an IBM operating system for the System/390 IBM mainframe computers.OS/390 was introduced in late 1995 in an effort, led by the late Randy Stelman, to simplify the packaging and ordering for the key, entitled elements needed to complete a fully functional MVS operating system package...
) and i5/OS (OS/400IBM i is an EBCDIC based operating system that runs on IBM Power Systems. It is the current evolution of the operating system named i5/OS which was originally named OS/400 when it was introduced with the AS/400 computer system in 1988....
)—use NEL (Next Line, 0x15) as the newline character. Note that EBCDIC also has control characters called CR and LF, but the numerical value of LF (0x25) differs from the one used by ASCII (0x0A). Additionally, there are some EBCDIC variants that also use NEL but assign a different numeric code to the character.
- Operating systems for the CDC 6000 series
The CDC 6000 series was a family of mainframe computers manufactured by Control Data Corporation in the 1960s. It consisted of CDC 6400, CDC 6500, CDC 6600 and CDC 6700 computers, which all were extremely rapid and efficient for their time...
defined a newline as two or more zero-valued six-bit characters at the end of a 60-bit word. Some configurations also defined a zero-valued character as a colonThe colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....
character, with the result that multiple colons could be interpreted as a newline depending on position.
- ZX80 and ZX81, home computers from Sinclair Research Ltd used a specific non-ASCII character set with code
(0x76, 118 decimal) as the newline character.
- OpenVMS
OpenVMS , previously known as VAX-11/VMS, VAX/VMS or VMS, is a computer server operating system that runs on VAX, Alpha and Itanium-based families of computers. Contrary to what its name suggests, OpenVMS is not open source software; however, the source listings are available for purchase...
uses a record-based file systemIn computer science, a record-oriented filesystem is a file system where files are stored as collections of records. There are several different record formats; the details vary depending on the particular system...
, which stores text files as one record per line. In most file formats, no line terminators are actually stored, but the Record Management ServicesRecord Management Services are procedures in the VMS, RSTS/E, RT-11 and high-end RSX-11 operating systems that programs may call to process files and records within files. VMS RMS is an integral part of the system software; its procedures run in executive mode...
facility can transparently add a terminator to each line when it is retrieved by an application. The records themselves could contain the same line terminator characters, which could either be considered a feature or a nuisance depending on the application.
- Fixed line length was used by some early mainframe
Mainframes are powerful computers used primarily by corporate and governmental organizations for critical applications, bulk data processing such as census, industry and consumer statistics, enterprise resource planning, and financial transaction processing.The term originally referred to the...
operating systems. In such a system, an implicit end-of-line was assumed every 80 characters, for example. No newline character was stored. If a file was imported from the outside world, lines shorter than the line length had to be padded with spaces, while lines longer than the line length had to be truncated. This mimicked the use of punched cardA punched card, punch card, IBM card, or Hollerith card is a piece of stiff paper that contains digital information represented by the presence or absence of holes in predefined positions...
s, on which each line was stored on a separate card, usually with 80 columns on each card. Many of these systems added an carriage control characterASA control characters are simple printing command characters used by mainframe printers to control the movement of paper through line printers. These commands are presented as special characters in the first column of each text line to be printed, and affect how the paper is advanced before the...
to the start of the next record, this could indicate if the next record was a continuation of the line started by the previous record, or a new line, or should overprint the previous line (similar to a CR). Often this was a normal printing character such as '#' that thus could not be used as the first character in a line. Some early line printers interpreted these characters directly in the records sent to them.
Most textual
InternetThe Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
protocols (including HTTP,
SMTPSimple Mail Transfer Protocol is an Internet standard for electronic mail transmission across Internet Protocol networks. SMTP was first defined by RFC 821 , and last updated by RFC 5321 which includes the extended SMTP additions, and is the protocol in widespread use today...
,
FTPFile Transfer Protocol is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server...
,
IRCInternet Relay Chat is a protocol for real-time Internet text messaging or synchronous conferencing. It is mainly designed for group communication in discussion forums, called channels, but also allows one-to-one communication via private message as well as chat and data transfer, including file...
and many others) mandate the use of ASCII
CR+
LF (
0x0D 0x0A) on the protocol level, but recommend that tolerant applications recognize lone
LF as well. In practice, there are many applications that erroneously use the
CC is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
newline character
'\n' instead (see section Newline in programming languages below). This leads to problems when trying to communicate with systems adhering to a stricter interpretation of the standards; one such system is the
qmailqmail is a mail transfer agent that runs on Unix. It was written, starting December 1995, by Daniel J. Bernstein as a more secure replacement for the popular Sendmail program...
MTAWithin Internet message handling services , a message transfer agent or mail transfer agent or mail relay is software that transfers electronic mail messages from one computer to another using a client–server application architecture...
that actively refuses to accept messages from systems that send bare
LF instead of the required
CR+
LF.
FTP has a feature to transform newlines between CR+LF and LF only when transferring text files. This must not be used on binary files. Usually binary files and text files are recognised by checking their
filename extensionA filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....
.
Unicode
The
UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
standard defines a large number of characters that conforming applications should recognize as line terminators:
LF:
Line Feed,
U+000A
VT:
Vertical Tab,
U+000B
FF:
Form Feed,
U+000C
CR:
Carriage ReturnCarriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...
,
U+000D
CR+
LF:
CR (
U+000D) followed by
LF (
U+000A)
NEL:
Next Line,
U+0085
LS:
Line Separator,
U+2028
PS:
Paragraph Separator,
U+2029
This may seem overly complicated compared to an approach such as converting all line terminators to a single character, for example
LF. However Unicode was designed to preserve all information when converting a text file from any existing encoding to Unicode and back. Therefore Unicode should contain characters included in existing encodings.
NEL is included in ISO-8859-1 and
EBCDICExtended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....
(0x15). The approach taken in the Unicode standard allows round-trip transformation to be information-preserving while still enabling applications to recognize all possible types of line terminators.
Recognizing and using the newline codes greater than 0x7F is not often done. They are multiple bytes in
UTF-8UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...
and the code for NEL has been used as the
ellipsisEllipsis is a series of marks that usually indicate an intentional omission of a word, sentence or whole section from the original text being quoted. An ellipsis can also be used to indicate an unfinished thought or, at the end of a sentence, a trailing off into silence...
('…') character in
Windows-1252Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages...
. For instance:
- YAML
YAML is a human-readable data serialization format that takes concepts from programming languages such as C, Perl, and Python, and ideas from XML and the data format of electronic mail . YAML was first proposed by Clark Evans in 2001, who designed it together with Ingy döt Net and Oren Ben-Kiki...
no longer recognizes them as special in order to be compatible with JSONJSON , or JavaScript Object Notation, is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects...
.
- ECMAScript
ECMAScript is the scripting language standardized by Ecma International in the ECMA-262 specification and ISO/IEC 16262. The language is widely used for client-side scripting on the web, in the form of several well-known dialects such as JavaScript, JScript, and ActionScript.- History :JavaScript...
accepts LS and PS as line breaks, but considers U+0085 (NEL) white space, not a line break.
- Microsoft Windows 2000 does not treat any of NEL, LS or PS as line-break in the default text editor Notepad
- In Linux, a popular editor "gedit
gedit is a text editor for the GNOME desktop environment, Mac OS X and Microsoft Windows. Designed as a general purpose text editor, gedit emphasizes simplicity and ease of use...
" treats LS and PS as newlines but does not for NEL.
History
ASCII was developed simultaneously by the
ISOThe International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
and the ASA, the predecessor organization to
ANSIThe American National Standards Institute is a private non-profit organization that oversees the development of voluntary consensus standards for products, services, processes, systems, and personnel in the United States. The organization also coordinates U.S. standards with international...
. During the period of 1963–1968, the ISO draft standards supported the use of either
CR+
LF or
LF alone as a newline, while the ASA drafts supported only
CR+
LF.
The sequence
CR+
LF was in common use on many early computer systems that had adopted
TeletypeA teleprinter is a electromechanical typewriter that can be used to communicate typed messages from point to point and point to multipoint over a variety of communication channels that range from a simple electrical connection, such as a pair of wires, to the use of radio and microwave as the...
machines, typically a
Teletype Model 33ASRThe Teletype Model ASR-33 was a very popular model of teleprinter. Introduced about 1963 by Teletype Corporation and designed for light-duty office use, it was less rugged and less expensive than earlier Teletype machines or its heavy-duty cousin, the Model 35-ASR.The Model 33's printing mechanism...
, as a console device, because this sequence was required to position those printers at the start of a new line. On these systems, text was often routinely composed to be compatible with these printers, since the concept of
device driverIn computing, a device driver or software driver is a computer program allowing higher-level computer programs to interact with a hardware device....
s hiding such hardware details from the application was not yet well developed; applications had to talk directly to the teletype machine and follow its conventions.
Most minicomputer systems from DEC used this convention. CP/M used it as well, to print on the same terminals that minicomputers used. From there
MS-DOSMS-DOS is an operating system for x86-based personal computers. It was the most commonly used member of the DOS family of operating systems, and was the main operating system for IBM PC compatible personal computers during the 1980s to the mid 1990s, until it was gradually superseded by operating...
(1981) adopted CP/M's
CR+
LF in order to be compatible, and this convention was inherited by Microsoft's later
WindowsMicrosoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
operating system.
The separation of the two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in one-character time. That is why the sequence was always sent with the
CR first. In fact, it was often necessary to send extra characters (extraneous CRs or NULs, which are ignored) to give the print head time to move to the left margin. Even many early video displays required multiple character times to
scrollIn computer graphics, filmmaking, television production, and other kinetic displays, scrolling is sliding text, images or video across a monitor or display. "Scrolling", as such, does not change the layout of the text or pictures, or but incrementally moves the user's view across what is...
the display.
The
MulticsMultics was an influential early time-sharing operating system. The project was started in 1964 in Cambridge, Massachusetts...
operating system began development in 1964 and used
LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was much more convenient for programming. The seemingly more obvious choice of
CR was not used, as a plain
CR provided the useful function of overprinting one line with another, and thus it was useful to not translate it.
UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
followed the Multics practice, and later systems followed Unix.
In programming languages
To facilitate the creation of
portableIn computer science, porting is the process of adapting software so that an executable program can be created for a computing environment that is different from the one for which it was originally designed...
programs, programming languages provide some abstractions to deal with the different types of newline sequences used in different environments.
The
C programming languageC is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
provides the
escape sequenceAn escape sequence is a series of characters used to change the state of computers and their attached peripheral devices. These are also known as control sequences, reflecting their use in device control. Some control sequences are special characters that always have the same meaning...
s
'\n' (newline) and
'\r' (carriage return). However, these are not required to be equivalent to the ASCII
LF and
CR control characters. The C standard only guarantees two things:
- Each of these escape sequences maps to a unique implementation-defined number that can be stored in a single char value.
- When writing a file in text mode, '\n' is transparently translated to the native newline sequence used by the system, which may be longer than one character. When reading in text mode, the native newline sequence is translated back to '\n'. In binary mode, no translation is performed, and the internal representation produced by '\n' is output directly.
On Unix platforms, where C originated, the native newline sequence is ASCII
LF (
0x0A), so
'\n' was simply defined to be that value. With the internal and external representation being identical, the translation performed in text mode is a
no-opIn computer science, NOP or NOOP is an assembly language instruction, sequence of programming language statements, or computer protocol command that effectively does nothing at all....
, and text mode and binary mode behave the same. This has caused many programmers who developed their software on Unix systems simply to ignore the distinction completely, resulting in code that is not portable to different platforms.
The C library function fgets is best avoided in binary mode because any file not written with the UNIX newline convention will be misread. Also, in text mode, any file not written with the system's native newline sequence (such as a file created on a UNIX system, then copied to a Windows system) will be misread as well.
Another common problem is the use of
'\n' when communicating using an Internet protocol that mandates the use of ASCII
CR+
LF for ending lines. Writing
'\n' to a text mode stream works correctly on Windows systems, but produces only
LF on Unix, and something completely different on more exotic systems. Using
"\r\n" in binary mode is slightly better.
Many languages, such as
C++C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...
,
PerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
, and
HaskellHaskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...
provide the same interpretation of
'\n' as C.
JavaJava is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
,
PHPPHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...
, and
PythonPython is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
also provide
'\n' and
'\r' escape sequences. In contrast to C, these are guaranteed to represent the values
U+000A and
U+000D, respectively.
The Java I/O libraries do not transparently translate these into platform-dependent newline sequences on input or output. Instead, they provide functions for writing a full line that automatically add the native newline sequence, and functions for reading lines that accept any of
CR,
LF, or
CR+
LF as a line terminator (see
BufferedReader.readLine). The
System.getProperty method can be used to retrieve the underlying line separator.
Example:
String eol = System.getProperty( "line.separator" );
String lineColor = "Color: Red" + eol;
Python permits "Universal Newline Support" when opening a file for reading, when importing modules, and when executing a file.
Some languages have created special
variableIn computer programming, a variable is a symbolic name given to some known or unknown quantity or information, for the purpose of allowing the name to be used independently of the information it represents...
s, constants, and
subroutineIn computer science, a subroutine is a portion of code within a larger program that performs a specific task and is relatively independent of the remaining code....
s to facilitate newlines during program execution.
Common problems
The different newline conventions often cause text files that have been transferred between systems of different types to be displayed incorrectly. For example, files originating on
UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
or Apple Macintosh systems may appear as a single long line on some
WindowsMicrosoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
programs. Conversely, when viewing a file originating from a Windows computer on a Unix system, the extra
CR may be displayed as
^M at the end of each line or as a second line break.
The problem can be hard to spot if some programs handle the foreign newlines properly while others do not. For example, a
compilerA compiler is a computer program that transforms source code written in a programming language into another computer language...
may fail with obscure syntax errors even though the source file looks correct when displayed on the console or in an
editorA text editor is a type of program used for editing plain text files.Text editors are often provided with operating systems or software development packages, and can be used to change configuration files and programming language source code....
. On a Unix system, the command
cat -v myfile.txt will send the file to stdout (normally the terminal) and make the
^M visible, which can be useful for debugging. Modern text editors generally recognize all flavours of
CR /
LF newlines and allow the user to convert between the different standards.
Web browserA web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...
s are usually also capable of displaying text files and websites which use different types of newlines.
The
File Transfer ProtocolFile Transfer Protocol is a standard network protocol used to transfer files from one host to another host over a TCP-based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server...
can automatically convert newlines in files being transferred between
systemsAn operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
with different newline representations when the transfer is done in "ASCII mode". However, transferring binary files in this mode usually has disastrous results: Any occurrence of the newline byte sequence—which does not have line terminator semantics in this context, but is just part of a normal sequence of bytes—will be translated to whatever newline representation the other system uses, effectively corrupting the file. FTP clients often employ some heuristics (for example, inspection of
filename extensionA filename extension is a suffix to the name of a computer file applied to indicate the encoding of its contents or usage....
s) to automatically select either binary or ASCII mode, but in the end it is up to the user to make sure his or her files are transferred in the correct mode. If there is any doubt as to the correct mode, binary mode should be used, as then no files will be altered by FTP, though they may display incorrectly.
Conversion utilities
Text editors are often used for converting a text file between different newline formats; most modern editors can read and write files using at least the different ASCII
CR/
LF conventions. The standard
WindowsMicrosoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
editor Notepad is not one of them (although
WordpadWordPad is a basic word processor that is included with almost all versions of Microsoft Windows from Windows 95 upwards. It is more advanced than Notepad but simpler than Microsoft Works Word Processor and Microsoft Word. It replaced Microsoft Write....
and the MS-DOS Editor are).
Editors are often unsuitable for converting larger files. For larger files (on Windows NT/2000/XP) the following command is often used:
TYPE unix_file | FIND "" /V > dos_file
On many
UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems, the
dos2unix (sometimes named
fromdos or
d2u) and
unix2dosunix2dos is a Unix tool to convert an ASCII text file from Unix format to DOS format and vice versa...
(sometimes named
todos or
u2d) utilities are used to translate between ASCII
CR+
LF (DOS/Windows) and
LF (Unix) newlines. Different versions of these commands vary slightly in their syntax. However, the
trtr is a command in Unix-like operating systems.When executed, the program reads from the standard input and writes to the standard output. It takes as parameters two sets of characters, and replaces occurrences of the characters in the first set with the corresponding elements from the other set...
command is available on virtually every
Unix-likeA Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification....
system and is used to perform arbitrary replacement operations on single characters. A DOS/Windows text file can be converted to Unix format by simply removing all ASCII
CR characters with
tr -d '\r' <
inputfile >
outputfile
or, if the text has only
CR newlines, by converting all
CR newlines to
LF with
tr '\r' '\n' <
inputfile >
outputfile
The same tasks are sometimes performed with
sedsed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
, or in
PerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
if the platform has a Perl interpreter:
sed -e 's/$/\r/'
inputfile >
outputfile # UNIX to DOS (adding CRs)
sed -e 's/\r$//'
inputfile >
outputfile # DOS to UNIX (removing CRs)
perl -pe 's/\r\n|\n|\r/\r\n/g'
inputfile >
outputfile # Convert to DOS
perl -pe 's/\r\n|\n|\r/\n/g'
inputfile >
outputfile # Convert to UNIX
perl -pe 's/\r\n|\n|\r/\r/g'
inputfile >
outputfile # Convert to old Mac
To identify what type of line breaks a text file contains, the
file command can be used. Moreover, the editor
vimVim is a text editor written by Bram Moolenaar and first released publicly in 1991. Based on the vi editor common to Unix-like systems, Vim is designed for use both from a command line interface and as a standalone application in a graphical user interface...
can be convenient to make
a file compatible with the Windows notepad text editor. For example:
[prompt] > file myfile.txt
myfile.txt: ASCII English text
[prompt] > vim myfile.txt
within vim :set fileformat=dos
:wq
[prompt] > file myfile.txt
myfile.txt: ASCII English text, with CRLF line terminators
The following grep commands echo the filename (in this case
myfile.txt) to the command line if the file is of the specified style:
grep -PL $'\r\n' myfile.txt # show UNIX style file (LF terminated)
grep -Pl $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
For Debian-based systems, these commands are used:
egrep -L $'\r\n' myfile.txt # show UNIX style file (LF terminated)
egrep -l $'\r\n' myfile.txt # show DOS style file (CRLF terminated)
The above grep commands work under
UnixUnix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
systems or in
CygwinCygwin is a Unix-like environment and command-line interface for Microsoft Windows. Cygwin provides native integration of Windows-based applications, data, and other system resources with applications, software tools, and data of the Unix-like environment...
under Windows. Note that these commands make some assumptions about the kinds of files that exist on the system (specifically it's assuming only UNIX and DOS-style files—no Mac OS 9-style files).
This technique is often combined with
findIn Unix-like and some other operating systems, find is a command-line utility that searches through one or more directory trees of a file system, locates files based on some user-specified criteria and applies a user-specified action on each matched file...
to list files recursively. For instance, the following command checks all "regular files" (e.g. it will exclude directories, symbolic links, etc.) to find all UNIX-style files in a directory tree, starting from the current directory (.), and saves the results in file unix_files.txt, overwriting it if the file already exists:
find . -type f -exec grep -PL '\r\n' {} \; > unix_files.txt
This example will find
CC is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
files and convert them to LF style line endings:
find -name '*.[ch]' -exec fromdos {} \;
The
file command also detects the type of EOL used:
file myfile.txt
> myfile.txt: ASCII text, with CRLF line terminators
Other tools permit the user to visualise the EOL characters:
od -a myfile.txt
cat -e myfile.txt
hexdump -c myfile.txt
dos2unix,
unix2dosunix2dos is a Unix tool to convert an ASCII text file from Unix format to DOS format and vice versa...
,
mac2unix,
unix2mac,
mac2dos,
dos2mac can perform conversions. The
flip command is often used.
External links
- The Unicode reference, see paragraph 5.8 in Chapter 5 of the Unicode 4.0 standard (PDF)
- "The End-of-Line Story"
- The [NEL] Newline Character
- The End of Line Puzzle
- Tofrodos - software for Unix that converts to and from DOS newlines
- ToFroWin: a Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
shell extension that is able to convert multiple files from DOS to UNIX (and vice-versa) line endings right from the context menuA context menu is a menu in a graphical user interface that appears upon user interaction, such as a right mouse click or middle click mouse operation...
.
- "Understanding Newlines" on O'Reilly-Net - an article by Xavier Noria.