Overview
In this article, we go through the principles of searching and indexing when using a ZIP or multifile.
Solution
It is possible to index and search the content of saved ZIP files containing text-based files such as Word or PDF files. However, some rules must be understood in order for the system to work.
In IDOL and Smart Search, the ZIP extension can be included or excluded from the search index.
Please note that ZIP is not included in the default list of accepted file extensions for indexing; the Admin can add it in Advanced Vault Settings: Configuration>Search>Full-Text Search>File Extensions to Index.
Files from ZIP will be extracted and indexed, but there are some limitations on how much content will be indexed because a ZIP file is considered to be a single file and not, for example, a multifile document. This means a total of 100 KB limit applies to all files in it, not per zipped file, but sharing this limit among all files (see below). ZIP can contain all the supported file types.
For Smart Search and IDOL, the content limit per file is 100kB of plain text. Content from zipped files is processed in lexicographical order according to file names (after extraction). Files are processed in that order until the limit of 100kB is reached. In practice, files whose names are at the beginning of lexicographical order are most certainly indexed, but if there is a large amount of, for example, plain-text data that compresses really well, then most probably not everything will be indexed from that ZIP file.
For example, 10 x 10 KB files will be all indexed, but in the case of 10 x 30 KB, only 100 KB of the first 3+ files will be indexed.
In a multifile document, Smart Search will index 100kB from each file until 900kB is reached. In IDOL, you can change values for these under Maximum Length of Single-File Content and Maximum Length of File Contents, which can be found from the admin tool Configuration > Search>Indexes > SS > Additional Options > Limits for Indexed Data. However, caution should be used, as excessive values may cause slowness.
DtSearch can also handle ZIP files in the same way as Smart Search and IDOL, but it has a limit of 2MB of plain text per file, which is indexed. Otherwise, all the same rules as above also apply.
