[erlang-questions] Request for enhancement: Sparse files

Ville Silventoinen ville.silventoinen@REDACTED
Wed Jun 10 10:43:31 CEST 2009


Hi Richard,

Thank you for the reminders. I need to do two things: Scan big
filesystems (billions of files, petabytes of data) and copy them. Both
of these I've implemented in Erlang, but while doing so I've found
some limitations in Erlang file module (some may remember my earlier
post about file:write_link_info). If you read my original email, I
have requested two things:
1) Adding st_blkcnt and st_blksize information to the file_info record.
2) Support for sparse files in file:copy if possible.

I'm perfectly happy with an answer: "We won't provide this, write a C
driver" or "you can modify the Erlang sources, here is how to build
without Clearcase...".

If this is not the right mailing list for asking enhancements, I apologise.

Thanks,
Ville

On Wed, Jun 10, 2009 at 4:21 AM, Richard O'Keefe<ok@REDACTED> wrote:
> In "UNIX" systems (BSD, System V, Solaris, Linux, MacOS) blocks in
> a file that do not exist are read back as all bytes zero.  So if
> you want to copy a file without introducing unnecessary bytes,
> you have to check for all-zero blocks (and it does not matter whether
> an all-zero block was real or faked).  That check can be done in
> application logic IF you know what the block size of the file that
> you are *writing* actually is.
>
> The Single Unix Specification is quite explicit:
>
> blksize_t st_blksize
>    A file system-specific preferred I/O block size for this object.
>    In some file system types, this may vary from file to file.
> blkcnt_t st_blocks
>        Number of blocks allocated for this object.
> Reminder: st_blocks says how many blocks were allocated for the
> _original_ file.  The copy might be on another file system or for
> some other reason have a different block size from the original.
>
> You could see this as a crude form of data deduplication.
>
> Actually, st_blksize is the recommended size of an I/O *transfer*,
> not necessarily the allocation unit on disc.  Considering that the
> hardware block size on IDE discs is defined by the interface to be
> 512 bytes, it would probably be sufficient for a program to check
> 512 bytes at a time.
>
>


More information about the erlang-questions mailing list