====== Presence and count of specific mail headers ======
This query extracts, for each of the header fields from a fixed list, the number of occurrences of all distinct values of the field.
SELECT
split_part(substr(lines, 1+position(E'\n'||field in lines), 200), E'\n', 1) as ct,
count(*)
FROM header, (values ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'), ('X-MSMail-Priority'), ('X-MS-Priority')) as h(field)
WHERE position(E'\n'||field IN lines)>0
group by 1
Sample output:
ct | count
---------------------------+-------
Importance: high | 32
Importance: High | 130
Importance: low | 1
Importance: Medium | 3
Importance: normal | 44
Importance: Normal | 802
Precedence: bulk | 19987
Precedence: bulk | 1
Precedence: fm-user | 3
Precedence: junk | 3
Precedence: list | 4026
Priority: non-urgent | 11
Priority: Non-Urgent | 1
Priority: normal | 464
Priority: urgent | 603
Priority: Urgent | 12
X-MSMail-Priority: High | 121
X-MSMail-Priority: Low | 1
X-MSMail-Priority: Lowest | 11
X-MSMail-Priority: Medium | 3
X-MSMail-Priority: Middle | 8
X-MSMail-Priority: Normal | 923
X-Priority: 0 | 1
X-Priority: 1 | 102
X-Priority: 1 (High) | 2
X-Priority: 1 (Highest) | 120
X-Priority: 2 | 6
X-Priority: 2 (High) | 1
X-Priority: 3 | 2892
X-Priority: 3 (Normal) | 635
X-Priority: 5 | 14
X-Priority: Normal | 5
This alternative implementation uses regular expressions and differs in that it doesn't limit header values to 200 characters or any other fixed length.
select FIELD||':'||arr[1], count(*)
from (select FIELD, regexp_matches(lines, '(?:^|\n)' || FIELD || ':\s*([^\n]*)', 'gi') as arr
from header,
(VALUES ('X-Priority'), ('Importance'), ('Precedence'), ('Priority'),
('X-MSMail-Priority'), ('X-MS-Priority')) AS h(FIELD)
where strpos(lines,FIELD)>0) l
GROUP BY 1 ORDER BY 1
In this version, the '' strpos(lines,FIELD)>0 '' condition is not essential: it's introduced only as a first-pass filter to eliminate the headers that don't contain anywhere any of the searched fields.
====== Duplicate messages ======
This query finds each message that share the exact same headers than another message with a lower mail_id, which means that it's a duplicate.
select h1.mail_id from header h1, header h2 where h1.lines=h2.lines and h1.mail_id > h2.mail_id
A stricter version, comparing the md5 hashes of bodies (text and html parts) in addition to the headers:
select b2.mail_id from body b1, body b2, header h1, header h2
where b1.mail_id < b2.mail_id
and h1.mail_id = b1.mail_id
and h2.mail_id = b2.mail_id
and md5(h1.lines) = md5(h2.lines)
and md5(b1.bodytext) is not distinct from md5(b2.bodytext)
and md5(b1.bodyhtml) is not distinct from md5(b2.bodyhtml);
The ''IS NOT DISTINCT'' comparator behaves as expected when ''bodytext'' or ''bodyhtml'' is NULL, as opposed
to the simple equality operator, for which ''NULL=NULL'' is false
====== Hierarchical view of tags ======
Output a list of tags sorted by hierarchy level and names,
with an indentation to represent the hierarchy.
WITH RECURSIVE tagr(a,_tag_id,name,level) as (
select
array[row_number() over (order by name)] as a,
tag_id,
name,
1 as level
from tags where parent_id is null
UNION ALL
select
tagr.a || row_number() over (order by tags.name),
tag_id,
tags.name,
tagr.level+1
FROM tags JOIN tagr ON tagr._tag_id=tags.parent_id
)
select repeat(' ', level-1) || name from tagr order by a;