Using IDM to detect partial contents of files
The primary use case for IDM is to detect file contents (as distinguished from binary files, such as audio or video files, for example). You can use IDM to match partial contents of files (content in common with the indexed document). File contents include text-based content of any document type that the Remote IDM Indexer can extract the contents from Microsoft Office documents (Word, Excel, PowerPoint), PDFs, and many more.
The service does not consider the file format or file size when it creates the cryptographic hash for the index. A document might contain much more content, but the service detects only the file contents that are indexed as part of the Indexed Document Profile. For example, consider a situation where you index a one-page document, and that one-page document is included as part of a 100-page document. The 100-page document is considered a match (100% content in common) because its content matches 100% of the one-page document.
Note that the index does not contain actual document content.
Requirements for using IDM to detect content by partial file matching summarizes the requirements to match partial file contents using IDM.
Requirement | Description |
|---|---|
File formats from which you can extract the contents | The Remote IDM Indexer must be able to extract the file format and extract document content. If the service can extract the document contents, it creates an index for partial file matching. |
Unencapsulated file | The source document cannot be encapsulated in an archive file when the source document is indexed. If a document in the source is encapsulated in an archive file, the service treats the archive file like a binary file and creates an index for exact file matching. |
Minimum amount of text | The source document must contain at a minimum 300 characters of normalized text before the extracted content is indexed. Normalization involves the removal of punctuation and whitespace. If a document contains less than 50 normalized characters, the service performs an exact file match against the file binary. Note that the exact length is variable depending on the file contents and encoding.
|
Maximum amount of text (for content extraction) | The maximum size of a document that can be processed for content extraction is 30,000,000 bytes. |