Millions of (small) text files in a folder

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP

up vote
14
down vote

favorite

1

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We’ve tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

  1. Write text file from web scrape (shouldn’t be affected by number of files in folder).
  2. Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

share|improve this question

  • 4

    Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
    – Mark Plotnick
    Dec 15 ’17 at 17:14

  • Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
    – JoshuaD
    Dec 15 ’17 at 19:49

  • 2

    @JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
    – Peter Cordes
    Dec 16 ’17 at 4:41

  • Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
    – JoshuaD
    Dec 16 ’17 at 9:07

  • 1

    but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
    – Andrew Henle
    Dec 16 ’17 at 12:59

up vote
14
down vote

favorite

1

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We’ve tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

  1. Write text file from web scrape (shouldn’t be affected by number of files in folder).
  2. Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

share|improve this question

  • 4

    Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
    – Mark Plotnick
    Dec 15 ’17 at 17:14

  • Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
    – JoshuaD
    Dec 15 ’17 at 19:49

  • 2

    @JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
    – Peter Cordes
    Dec 16 ’17 at 4:41

  • Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
    – JoshuaD
    Dec 16 ’17 at 9:07

  • 1

    but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
    – Andrew Henle
    Dec 16 ’17 at 12:59

up vote
14
down vote

favorite

1

up vote
14
down vote

favorite

1
1

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We’ve tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

  1. Write text file from web scrape (shouldn’t be affected by number of files in folder).
  2. Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

share|improve this question

We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We’ve tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

The most straightforward way is to store all files in a folder:

$ ls text_files/
1.txt
2.txt
3.txt

which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

The two FS processes will be:

  1. Write text file from web scrape (shouldn’t be affected by number of files in folder).
  2. Zip selected files, given by list of filenames.

My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

share|improve this question

share|improve this question

share|improve this question

asked Dec 15 ’17 at 16:16

user1717828

1,59111125

1,59111125

  • 4

    Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
    – Mark Plotnick
    Dec 15 ’17 at 17:14

  • Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
    – JoshuaD
    Dec 15 ’17 at 19:49

  • 2

    @JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
    – Peter Cordes
    Dec 16 ’17 at 4:41

  • Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
    – JoshuaD
    Dec 16 ’17 at 9:07

  • 1

    but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
    – Andrew Henle
    Dec 16 ’17 at 12:59

  • 4

    Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
    – Mark Plotnick
    Dec 15 ’17 at 17:14

  • Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
    – JoshuaD
    Dec 15 ’17 at 19:49

  • 2

    @JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
    – Peter Cordes
    Dec 16 ’17 at 4:41

  • Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
    – JoshuaD
    Dec 16 ’17 at 9:07

  • 1

    but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
    – Andrew Henle
    Dec 16 ’17 at 12:59

4

4

Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
– Mark Plotnick
Dec 15 ’17 at 17:14

Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
– Mark Plotnick
Dec 15 ’17 at 17:14

Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
– JoshuaD
Dec 15 ’17 at 19:49

Why not try it quickly on a virtual machine and see what it’s like? With bash it’s trivial to populate a folder with a million text files with random characters inside. I feel like you’ll get really useful information that way, in addition to what you’ll learn here.
– JoshuaD
Dec 15 ’17 at 19:49

2

2

@JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
– Peter Cordes
Dec 16 ’17 at 4:41

@JoshuaD: If you populate it all at once on a fresh FS, you’re likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that’s a pure guess based on very little detailed knowledge; I’ve barely used ext4). Go with abc/def/subdirs.
– Peter Cordes
Dec 16 ’17 at 4:41

Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
– JoshuaD
Dec 16 ’17 at 9:07

Yea, I don’t think the test I suggested will be able to tell the OP “this will work”, but it could definitely quickly tell him “this will not work”, which is useful.
– JoshuaD
Dec 16 ’17 at 9:07

1

1

but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
– Andrew Henle
Dec 16 ’17 at 12:59

but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I’d think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem – I doubt you want to pay for IBM’s GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
– Andrew Henle
Dec 16 ’17 at 12:59

5 Answers
5

active

oldest

votes

up vote
10
down vote

accepted

The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we’re talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

This can be solved by doing your wildcard operations using the find command:

find <directory> -name '<wildcard expression>' -exec <command> {} +

or a similar syntax whenever possible. The find ... -exec ... + will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

share|improve this answer

  • Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
    – dimm
    Dec 15 ’17 at 19:47

  • 4

    Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
    – telcoM
    Dec 15 ’17 at 20:02

up vote
17
down vote

This is perilously close to an opinion-based question/answer but I’ll try to provide some facts with my opinions.

  1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
  2. ls will take longer to enumerate a very large number of files than a small number of files.
  3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

If you are serving up these files from a webserver (I’m not entirely sure whether that’s relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

share|improve this answer

  • The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
    – Stéphane Chazelas
    Dec 15 ’17 at 17:49

  • @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
    – roaima
    Dec 15 ’17 at 19:06

  • Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
    – Octopus
    Dec 15 ’17 at 21:58

  • 1

    @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
    – roaima
    Dec 15 ’17 at 22:01

  • I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
    – Andrew Henle
    Dec 17 ’17 at 12:15

up vote
4
down vote

I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

I ran into the following challenges.

  • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
  • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
  • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
  • I wouldn’t risk trying to use the zip file on Windows.
  • The folder quickly became a no humans allowed zone.

I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

For example, if you have a file named 384295.txt the path would be:

/mnt/file/300000/80000/4000/295.txt

If you know you’ll reach a few million. Use 0 prefixes for 1,000,000

/mnt/file/000000/300000/80000/4000/295.txt

share|improve this answer

    up vote
    1
    down vote

    Write text file from web scrape (shouldn’t be affected by number of files in folder).

    To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that’s large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

    As long as the directory files remain in the system cache, the performance hit from this won’t be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

    Zip selected files, given by list of filenames.

    This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

    will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

    A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

    share|improve this answer

      up vote
      1
      down vote

      Firstly: prevent ‘ls’ from sorting with ‘ls -U’, maybe update your ~/bashrc to have ‘alias ls=”ls -U”‘ or similar.

      For your large fileset, you can try this out like this:

      • create a set of test files

      • see if many filenames cause issues

      • use xargs parmeter-batching and zip’s (default) behaviour of adding files to a zip to avoid problems.

      This worked well:

      # create ~ 100k files
      seq 1 99999 | sed "s/(.*)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_1.txt/" | xargs touch
      # see if zip can handle such a list of names
      zip -q /tmp/bar.zip ./*
          bash: /usr/bin/zip: Argument list too long
      # use xargs to batch sets of filenames to zip
      find . -type f | xargs zip -q /tmp/foo.zip
      l /tmp/foo.zip
          28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip
      

      share|improve this answer

        Your Answer

        StackExchange.ready(function() {
        var channelOptions = {
        tags: “”.split(” “),
        id: “106”
        };
        initTagRenderer(“”.split(” “), “”.split(” “), channelOptions);

        StackExchange.using(“externalEditor”, function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using(“snippets”, function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: ‘answer’,
        convertImagesToLinks: false,
        noModals: false,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: “”,
        onDemand: true,
        discardSelector: “.discard-answer”
        ,immediatelyShowMarkdownHelp:true
        });

        }
        });

         
        draft saved
        draft discarded

        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin(‘.new-post-login’, ‘https%3a%2f%2funix.stackexchange.com%2fquestions%2f411091%2fmillions-of-small-text-files-in-a-folder%23new-answer’, ‘question_page’);
        }
        );

        Post as a guest

        5 Answers
        5

        active

        oldest

        votes

        5 Answers
        5

        active

        oldest

        votes

        active

        oldest

        votes

        active

        oldest

        votes

        up vote
        10
        down vote

        accepted

        The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

        If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

        Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we’re talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

        This can be solved by doing your wildcard operations using the find command:

        find <directory> -name '<wildcard expression>' -exec <command> {} +
        

        or a similar syntax whenever possible. The find ... -exec ... + will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

        share|improve this answer

        • Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
          – dimm
          Dec 15 ’17 at 19:47

        • 4

          Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
          – telcoM
          Dec 15 ’17 at 20:02

        up vote
        10
        down vote

        accepted

        The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

        If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

        Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we’re talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

        This can be solved by doing your wildcard operations using the find command:

        find <directory> -name '<wildcard expression>' -exec <command> {} +
        

        or a similar syntax whenever possible. The find ... -exec ... + will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

        share|improve this answer

        • Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
          – dimm
          Dec 15 ’17 at 19:47

        • 4

          Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
          – telcoM
          Dec 15 ’17 at 20:02

        up vote
        10
        down vote

        accepted

        up vote
        10
        down vote

        accepted

        The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

        If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

        Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we’re talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

        This can be solved by doing your wildcard operations using the find command:

        find <directory> -name '<wildcard expression>' -exec <command> {} +
        

        or a similar syntax whenever possible. The find ... -exec ... + will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

        share|improve this answer

        The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

        If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

        Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we’re talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

        This can be solved by doing your wildcard operations using the find command:

        find <directory> -name '<wildcard expression>' -exec <command> {} +
        

        or a similar syntax whenever possible. The find ... -exec ... + will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

        share|improve this answer

        share|improve this answer

        share|improve this answer

        answered Dec 15 ’17 at 17:11

        telcoM

        10.8k11232

        10.8k11232

        • Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
          – dimm
          Dec 15 ’17 at 19:47

        • 4

          Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
          – telcoM
          Dec 15 ’17 at 20:02

        • Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
          – dimm
          Dec 15 ’17 at 19:47

        • 4

          Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
          – telcoM
          Dec 15 ’17 at 20:02

        Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
        – dimm
        Dec 15 ’17 at 19:47

        Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
        – dimm
        Dec 15 ’17 at 19:47

        4

        4

        Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
        – telcoM
        Dec 15 ’17 at 20:02

        Yes… but if the shell or the ls command won’t get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
        – telcoM
        Dec 15 ’17 at 20:02

        up vote
        17
        down vote

        This is perilously close to an opinion-based question/answer but I’ll try to provide some facts with my opinions.

        1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
        2. ls will take longer to enumerate a very large number of files than a small number of files.
        3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

        One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

        You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

        If you are serving up these files from a webserver (I’m not entirely sure whether that’s relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

        share|improve this answer

        • The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
          – Stéphane Chazelas
          Dec 15 ’17 at 17:49

        • @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
          – roaima
          Dec 15 ’17 at 19:06

        • Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
          – Octopus
          Dec 15 ’17 at 21:58

        • 1

          @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
          – roaima
          Dec 15 ’17 at 22:01

        • I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
          – Andrew Henle
          Dec 17 ’17 at 12:15

        up vote
        17
        down vote

        This is perilously close to an opinion-based question/answer but I’ll try to provide some facts with my opinions.

        1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
        2. ls will take longer to enumerate a very large number of files than a small number of files.
        3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

        One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

        You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

        If you are serving up these files from a webserver (I’m not entirely sure whether that’s relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

        share|improve this answer

        • The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
          – Stéphane Chazelas
          Dec 15 ’17 at 17:49

        • @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
          – roaima
          Dec 15 ’17 at 19:06

        • Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
          – Octopus
          Dec 15 ’17 at 21:58

        • 1

          @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
          – roaima
          Dec 15 ’17 at 22:01

        • I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
          – Andrew Henle
          Dec 17 ’17 at 12:15

        up vote
        17
        down vote

        up vote
        17
        down vote

        This is perilously close to an opinion-based question/answer but I’ll try to provide some facts with my opinions.

        1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
        2. ls will take longer to enumerate a very large number of files than a small number of files.
        3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

        One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

        You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

        If you are serving up these files from a webserver (I’m not entirely sure whether that’s relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

        share|improve this answer

        This is perilously close to an opinion-based question/answer but I’ll try to provide some facts with my opinions.

        1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
        2. ls will take longer to enumerate a very large number of files than a small number of files.
        3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

        One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

        You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

        If you are serving up these files from a webserver (I’m not entirely sure whether that’s relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

        share|improve this answer

        share|improve this answer

        share|improve this answer

        edited Dec 15 ’17 at 19:32

        answered Dec 15 ’17 at 17:03

        roaima

        39.8k546109

        39.8k546109

        • The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
          – Stéphane Chazelas
          Dec 15 ’17 at 17:49

        • @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
          – roaima
          Dec 15 ’17 at 19:06

        • Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
          – Octopus
          Dec 15 ’17 at 21:58

        • 1

          @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
          – roaima
          Dec 15 ’17 at 22:01

        • I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
          – Andrew Henle
          Dec 17 ’17 at 12:15

        • The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
          – Stéphane Chazelas
          Dec 15 ’17 at 17:49

        • @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
          – roaima
          Dec 15 ’17 at 19:06

        • Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
          – Octopus
          Dec 15 ’17 at 21:58

        • 1

          @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
          – roaima
          Dec 15 ’17 at 22:01

        • I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
          – Andrew Henle
          Dec 17 ’17 at 12:15

        The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
        – Stéphane Chazelas
        Dec 15 ’17 at 17:49

        The * expansion will succeed unless you run out of memory, but unless you raise the stacksize limit (on Linux) or use a shell where mv is builtin or can be builtin (ksh93, zsh), the execve() system call may fail with a E2BIG error.
        – Stéphane Chazelas
        Dec 15 ’17 at 17:49

        @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
        – roaima
        Dec 15 ’17 at 19:06

        @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I’ll see if I can alter the words slightly without getting bogged down in complexity.
        – roaima
        Dec 15 ’17 at 19:06

        Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
        – Octopus
        Dec 15 ’17 at 21:58

        Just curious how you would uncompress that zip file if you avoid including the intermediate subdirectory paths in it, without running into the issues you discuss?
        – Octopus
        Dec 15 ’17 at 21:58

        1

        1

        @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
        – roaima
        Dec 15 ’17 at 22:01

        @Octopus the OP states that the zip file will contain “selected files, given by list of filenames“.
        – roaima
        Dec 15 ’17 at 22:01

        I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
        – Andrew Henle
        Dec 17 ’17 at 12:15

        I’d recommend using zip -j - ... and piping the output stream directly to the client’s network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
        – Andrew Henle
        Dec 17 ’17 at 12:15

        up vote
        4
        down vote

        I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

        There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

        My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

        I ran into the following challenges.

        • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
        • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
        • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
        • I wouldn’t risk trying to use the zip file on Windows.
        • The folder quickly became a no humans allowed zone.

        I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

        If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

        For example, if you have a file named 384295.txt the path would be:

        /mnt/file/300000/80000/4000/295.txt
        

        If you know you’ll reach a few million. Use 0 prefixes for 1,000,000

        /mnt/file/000000/300000/80000/4000/295.txt
        

        share|improve this answer

          up vote
          4
          down vote

          I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

          There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

          My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

          I ran into the following challenges.

          • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
          • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
          • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
          • I wouldn’t risk trying to use the zip file on Windows.
          • The folder quickly became a no humans allowed zone.

          I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

          If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

          For example, if you have a file named 384295.txt the path would be:

          /mnt/file/300000/80000/4000/295.txt
          

          If you know you’ll reach a few million. Use 0 prefixes for 1,000,000

          /mnt/file/000000/300000/80000/4000/295.txt
          

          share|improve this answer

            up vote
            4
            down vote

            up vote
            4
            down vote

            I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

            There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

            My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

            I ran into the following challenges.

            • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
            • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
            • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
            • I wouldn’t risk trying to use the zip file on Windows.
            • The folder quickly became a no humans allowed zone.

            I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

            If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

            For example, if you have a file named 384295.txt the path would be:

            /mnt/file/300000/80000/4000/295.txt
            

            If you know you’ll reach a few million. Use 0 prefixes for 1,000,000

            /mnt/file/000000/300000/80000/4000/295.txt
            

            share|improve this answer

            I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

            There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

            My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

            I ran into the following challenges.

            • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
            • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
            • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
            • I wouldn’t risk trying to use the zip file on Windows.
            • The folder quickly became a no humans allowed zone.

            I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

            If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

            For example, if you have a file named 384295.txt the path would be:

            /mnt/file/300000/80000/4000/295.txt
            

            If you know you’ll reach a few million. Use 0 prefixes for 1,000,000

            /mnt/file/000000/300000/80000/4000/295.txt
            

            share|improve this answer

            share|improve this answer

            share|improve this answer

            answered Dec 16 ’17 at 16:52

            cgTag

            1412

            1412

                up vote
                1
                down vote

                Write text file from web scrape (shouldn’t be affected by number of files in folder).

                To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that’s large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

                As long as the directory files remain in the system cache, the performance hit from this won’t be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

                Zip selected files, given by list of filenames.

                This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

                will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

                A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

                share|improve this answer

                  up vote
                  1
                  down vote

                  Write text file from web scrape (shouldn’t be affected by number of files in folder).

                  To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that’s large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

                  As long as the directory files remain in the system cache, the performance hit from this won’t be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

                  Zip selected files, given by list of filenames.

                  This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

                  will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

                  A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

                  share|improve this answer

                    up vote
                    1
                    down vote

                    up vote
                    1
                    down vote

                    Write text file from web scrape (shouldn’t be affected by number of files in folder).

                    To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that’s large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

                    As long as the directory files remain in the system cache, the performance hit from this won’t be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

                    Zip selected files, given by list of filenames.

                    This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

                    will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

                    A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

                    share|improve this answer

                    Write text file from web scrape (shouldn’t be affected by number of files in folder).

                    To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that’s large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

                    As long as the directory files remain in the system cache, the performance hit from this won’t be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

                    Zip selected files, given by list of filenames.

                    This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

                    will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

                    A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

                    share|improve this answer

                    share|improve this answer

                    share|improve this answer

                    answered Dec 16 ’17 at 8:55

                    Peter

                    1066

                    1066

                        up vote
                        1
                        down vote

                        Firstly: prevent ‘ls’ from sorting with ‘ls -U’, maybe update your ~/bashrc to have ‘alias ls=”ls -U”‘ or similar.

                        For your large fileset, you can try this out like this:

                        • create a set of test files

                        • see if many filenames cause issues

                        • use xargs parmeter-batching and zip’s (default) behaviour of adding files to a zip to avoid problems.

                        This worked well:

                        # create ~ 100k files
                        seq 1 99999 | sed "s/(.*)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_1.txt/" | xargs touch
                        # see if zip can handle such a list of names
                        zip -q /tmp/bar.zip ./*
                            bash: /usr/bin/zip: Argument list too long
                        # use xargs to batch sets of filenames to zip
                        find . -type f | xargs zip -q /tmp/foo.zip
                        l /tmp/foo.zip
                            28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip
                        

                        share|improve this answer

                          up vote
                          1
                          down vote

                          Firstly: prevent ‘ls’ from sorting with ‘ls -U’, maybe update your ~/bashrc to have ‘alias ls=”ls -U”‘ or similar.

                          For your large fileset, you can try this out like this:

                          • create a set of test files

                          • see if many filenames cause issues

                          • use xargs parmeter-batching and zip’s (default) behaviour of adding files to a zip to avoid problems.

                          This worked well:

                          # create ~ 100k files
                          seq 1 99999 | sed "s/(.*)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_1.txt/" | xargs touch
                          # see if zip can handle such a list of names
                          zip -q /tmp/bar.zip ./*
                              bash: /usr/bin/zip: Argument list too long
                          # use xargs to batch sets of filenames to zip
                          find . -type f | xargs zip -q /tmp/foo.zip
                          l /tmp/foo.zip
                              28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip
                          

                          share|improve this answer

                            up vote
                            1
                            down vote

                            up vote
                            1
                            down vote

                            Firstly: prevent ‘ls’ from sorting with ‘ls -U’, maybe update your ~/bashrc to have ‘alias ls=”ls -U”‘ or similar.

                            For your large fileset, you can try this out like this:

                            • create a set of test files

                            • see if many filenames cause issues

                            • use xargs parmeter-batching and zip’s (default) behaviour of adding files to a zip to avoid problems.

                            This worked well:

                            # create ~ 100k files
                            seq 1 99999 | sed "s/(.*)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_1.txt/" | xargs touch
                            # see if zip can handle such a list of names
                            zip -q /tmp/bar.zip ./*
                                bash: /usr/bin/zip: Argument list too long
                            # use xargs to batch sets of filenames to zip
                            find . -type f | xargs zip -q /tmp/foo.zip
                            l /tmp/foo.zip
                                28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip
                            

                            share|improve this answer

                            Firstly: prevent ‘ls’ from sorting with ‘ls -U’, maybe update your ~/bashrc to have ‘alias ls=”ls -U”‘ or similar.

                            For your large fileset, you can try this out like this:

                            • create a set of test files

                            • see if many filenames cause issues

                            • use xargs parmeter-batching and zip’s (default) behaviour of adding files to a zip to avoid problems.

                            This worked well:

                            # create ~ 100k files
                            seq 1 99999 | sed "s/(.*)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_1.txt/" | xargs touch
                            # see if zip can handle such a list of names
                            zip -q /tmp/bar.zip ./*
                                bash: /usr/bin/zip: Argument list too long
                            # use xargs to batch sets of filenames to zip
                            find . -type f | xargs zip -q /tmp/foo.zip
                            l /tmp/foo.zip
                                28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip
                            

                            share|improve this answer

                            share|improve this answer

                            share|improve this answer

                            answered Dec 16 ’17 at 20:20

                            jmullee

                            35815

                            35815

                                 
                                draft saved
                                draft discarded

                                 

                                draft saved

                                draft discarded

                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin(‘.new-post-login’, ‘https%3a%2f%2funix.stackexchange.com%2fquestions%2f411091%2fmillions-of-small-text-files-in-a-folder%23new-answer’, ‘question_page’);
                                }
                                );

                                Post as a guest

                                Related Post

                                Leave a Reply

                                Your email address will not be published. Required fields are marked *