See also Implementation of the inverted word index for how the word index is stored in the database, and the definition of tables:
The word index can be queried in SQL. Here is the code of a PL/pgSQL function that retrieves the IDs of messages containing a set of words. The words are given as an array of text. The function is already in the Manitou-Mail database for versions 1.1.0 and above
-- Input: an array of words to search within the entire index -- Output: the mail_id of the matching messages, as a set. CREATE OR REPLACE FUNCTION wordsearch(in_words text[]) RETURNS SETOF INTEGER AS $$ DECLARE var_nb_words INTEGER := array_upper(in_words,1); var_part_no INTEGER; cnt INTEGER; b1 INTEGER; b2 INTEGER; len INTEGER; i INTEGER; j INTEGER; var_vect bytea; and_vect bytea; -- vectors ANDed together var_nz_offset INTEGER; BEGIN FOR var_part_no IN (SELECT part_no FROM inverted_word_index WHERE word_id IN (SELECT word_id FROM words WHERE wordtext=any(in_words)) GROUP BY part_no HAVING COUNT(*)=var_nb_words ORDER BY part_no) LOOP cnt:=0; FOR var_vect,var_nz_offset IN SELECT mailvec,nz_offset FROM inverted_word_index WHERE word_id IN (SELECT word_id FROM words WHERE wordtext=any(in_words)) AND part_no=var_part_no LOOP IF (var_nz_offset>0) THEN var_vect:=repeat(E'\\000', var_nz_offset)::bytea || var_vect; END IF; IF (cnt=0) THEN and_vect:=var_vect; -- first vector ELSE -- next vectors -- reduce result if necessary IF (LENGTH(and_vect) > LENGTH(var_vect)) THEN and_vect:=SUBSTRING(and_vect FOR LENGTH(var_vect)); END IF; len:=LENGTH(and_vect)-1; FOR i IN 0..len LOOP b1:=get_byte(and_vect, i); b2:=get_byte(var_vect, i); IF (b1&b2 <> b1) THEN SELECT set_byte(and_vect, i, b1&b2) INTO and_vect; END IF; END LOOP; END IF; cnt:=cnt+1; END LOOP; -- on vectors -- extract the set of mail_id's for this part_no from the vector len:=LENGTH(and_vect)-1; IF (len>=0) THEN -- len might be NULL OR -1 if no result at all FOR i IN 0..len LOOP b1:=get_byte(and_vect,i); FOR j IN 0..7 LOOP IF ((b1&(1<<j))!=0) THEN RETURN NEXT var_part_no*16384+(i*8)+j+1; -- hit END IF; END LOOP; END LOOP; END IF; END LOOP; -- on part_no RETURN; END; $$ LANGUAGE plpgsql;
The following formula:
var_part_no*16384+(i*8)+j+1
that translates a bit in the vector to a mail_id relies on the fact that the index is split into parts indexing each one at most 16384 messages. This size of parts could actually be increased for larger databases (think millions of messages), and in this case, this code should be updated accordingly.
The simplest form of usage would be to search for one word with no other criteria:
SELECT wordsearch(array['foobar']);
But the result of wordsearch() can be joined with other tables and filtered with criteria. For exemple, the following SQL retrieves the messages that contain 'foo' and 'bar', excluding those in the trashcan
SELECT m.mail_id FROM mail m JOIN (SELECT * FROM wordsearch(array['foo','bar']) AS id) s ON (m.mail_id=s.id) WHERE m.status&16=0;