out of space on /root

Paul Smith paul at mad-scientist.net
Mon Mar 6 18:42:36 UTC 2017


On Mon, 2017-03-06 at 18:35 +0100, Xen wrote:
> >> cat "file" | split -b 500M
>> > can be more efficiently written:
>> >   split -b 500M < "file"
> 
> I disagree. The second expression is harder to write mentally and
> more prone to error.

Those are subjective assertions, and people who who have a lot more
experience than either of us have disagreed with you for over 20 years.
I have no interest in arguing about it in general.

> There is barely a functional difference apart from the extra cat
> process, but this doesn't really take any resources. It's either Bash
> doing it, or Cat, there is no other difference.

That's not correct.  In the "with cat" method you have this:

   1. shell creates a pipe
   2. shell forks a new process A
   3. shell forks a new process B
   4. In process A, duplicate the write side of the pipe to stdout
   5. In process A, exec /usr/bin/cat "file"
   6. In process B, duplicate the read side of the pipe to stdin
   7. In process B, exec split -b 500M
   8. The "cat" process reads from the file, writes to the pipe
   9. The "split" process reads from the pipe, writes to files.
  10. Wait for A and B to finish

In this method the entire file is read into memory by one process, then
copied back into the kernel via the pipe, then that data is read by the
second process (split) and worked on.  Essentially you have the
overhead of an entire extra read and write of the file.

In the "without cat" method you have this:

   1. Shell opens "file"
   2. shell forks a new process A
   3. In process A, duplicate the "file" descriptor to stdin
   4. In process A, exec split -b 500M
   5. The "split" process reads the "file" (stdin), writes to files.
   6. Wait for A to finish

Here the split process reads directly from the file; there's no extra
copying around and the shell doesn't ever read from the file and
neither does any other extra process.

Even better would be:

  split -b 500M "file"

with no redirection at all.  In this method you get this:

   1. Shell forks a new process A
   2. In process A, exec split -b 500M "file"
   3. The split process opens "file", reads from it, writes to files
   4. Wait for A to finish

Here the output filenames would be based on "file" which is nice; some
programs can behave better if they have an actual file they can stat(2)
to find the size, etc.

The advantage to "read from the file" methods rather than the first
"read from pipe" is that a pipe is uni-directional (you can't go
backward), and it contains a maximum of 4K (on Linux) bytes at a time.

A program like "tail", if it works on a file, can go to the end of the
file and then back up from there and not have to read the beginning.  A
program like "split" can read blocks much larger than 4k at a time and
gain efficiency.  There could even be special kernel support for bulk
file IO that can be taken advantage of, which clearly can't be used
with pipes.

> So I don't mind if you use that syntax, but please don't berate
> others for using the other style when it is, in fact, better.

Please don't state your personal subjective opinions as if they were
facts.

It's true that in many cases the processing difference is not
significant, because the amount of data involved is small.  But, I
didn't comment on the GENERAL case, I commented on THIS case.  In this
case, where we're dealing with such enormous files, there is absolutely
no question that the non-cat version is far superior and the cat
version should be avoided.

My view on using cat/pipes vs. simple redirection in general disagrees
with yours, but that's an opinion (held by many, but an opinion
nonetheless).




More information about the ubuntu-users mailing list