Overlapping heredocs

Dec 31, 2013

A heredoc (short for “here document”) is defined as follows on Wikipedia:

In computer science, a here document (here-document, heredoc, hereis, here-string or here-script) is a file literal or input stream literal: it is a section of a source code file that is treated as if it were a separate file. The term is also used for a form of multiline string literals that use similar syntax, preserving line breaks and other whitespace (including indentation) in the text.

As far as I know, the most popular use of here documents is to have multiline strings with special characters without the need to manually escape all the special characters. Anyway, according to the same wiki page, a heredoc is syntactically declared as follows:

The most common syntax for here documents, originating in Unix shells, is << followed by a delimiting identifier, followed, starting on the next line, by the text to be quoted, and then closed by the same identifier on its own line.

So far so good. Now, from here onwards I’ll be talking about heredocs specifically in ruby, but as far as I know the discussion applies to most implementations of here documents.

In ruby, a simple heredoc assignment can be as follows:

irb> content = <<HEREDOC
Hello there!
Hurr Burr.
HEREDOC
=> "Hello there!\nHurrBurr.\n

After execution of the above block of code, the variable content will have a value of "Hello there!\nHurrBurr.\n", which is what I expected it to have. However, it is also possible for here documents to overlap. The book The Ruby Programming Language has the following to say about this:

In fact, after reading the content of a here document, the Ruby interpreter goes back to the line it was on and continues parsing it.

So, the following block of code is perfectly valid ruby code:

irb> content = <<HEREDOC + <<THEREDOC + "World"
Hello
HEREDOC
There
THEREDOC

According to the definition of here documents, the created here document should contain everything between the line following <<DELIMITER and the line that contains only DELIMITER. Accordingly, I expected content to be interpreted as

content = "Hello\n" + "Hello\nHERE\nThere\n" + "World"

To clarify on my thought process, I expected that

  1. The first here document would contain everything from the line following <<HEREDOC (line 2) till the line containing only HEREDOC (line 3)
  2. The second here document would have everything from the line following <<THEREDOC (line 2) till the line containing only THEREDOC (line 5)

However, this is not what is observed in practice. In reality, the first here document contains the text between lines 2 and 3, and the second here document contains the text between lines 4 and 5, giving us

=> "Hello\nThere\nWorld"

What actually happens here is that the first here document effectively removes lines 2 and 3 from the input. Hence, the line immediately after the <<THEREDOC declaration is actually line 4 and not line 2. This can be best seen from the following example:

irb> content = <<HEREDOC + <<THEREDOC + "World"
Why
THEREDOC
Hello
HEREDOC
There
THEREDOC
=> "Why\nTHEREDOC\nHello\nThere\nWorld"

The first here document contains everything between line 2 and line 5, and the second here document has everything between lines 6 and 7.