Upgrading attachments indexers
Version 1.3.0 is ready to be released with some major improvements in the full-text search and full-text indexer. A couple of packaging issues are still being worked on, but the code won’t change significantly from the 1.3.0 tags on the master branches at github, for both manitou-mail-mdx and manitou-mail-ui.
The list of changes from the previous version is currently visible in the NEWS file in the sources.
It should be noted that 1.3.0 will require some adjustments in how attachments in various formats are indexed with the help of user-supplied scripts.
Up to now, indexer plugins were used to index the contents of attached files in PDF or DOC, or even HTML formats.
Starting from 1.3.0, this is no longer desirable: they should be integrated with a new method called “words extractors” and declared in manitou-mdx main configuration file. Example:
index_words_extractors = application/pdf: /opt/scripts/pdf2text \ application/msword: /opt/scripts/word2text
The user-supplied scripts should extract words from the contents in custom format passed to their standard input, and output these words encoded in utf-8 to the standard output.
For installs that didn’t index attachments with plugins, it doesn’t matter. Upgrading to 1.3.0 will just cause starting to index HTML contents, which it does now internally by default, so no manual action is required.
On the other hand, for installs that used indexer plugins, a preliminary step to upgrade to 1.3.0 would be to convert these to word extractor scripts. It’s nothing particularly difficult. As an example, here is a ready-to-use script that extracts words from MS-Word files with antiword.
#!/bin/sh t=$(tempfile --suffix=.doc) || exit 1 trap "rm -f -- '$t'" EXIT cat >>$t antiword -i1 "$t" || exit 1 rm -f -- "$t" trap - EXIT exit 0
This is a preliminary step because it will be recommended to rebuild the inverted word index when upgrading to 1.3.0 , and doing this involves reindexing attachments as well.
Up to version 1.2, that was not possible with indexer plugins. That’s one of the reasons why plugins get deprecated as a way to index attachments contents. In addition, the lack of integration with the words vectors cache was a performance drag, and 1.3 solves that as well with its words extractors method.