Unraveling the Ball of Thread

Warning: GeekFactor 7++

inbox.jpg

As I started evaluating forum software for the RETS.org user community, I ran across the need to export the old mailing list archives. The export was easy, as the entire history is still stored in an mbox flat file - so in essence getting the mail was already done.

I ran into the issue that keeping the email in a threaded state was going to need a nice little algorithm. Some Googling led me to an Jamie Zainski’s Threading which outlines the basic principles, all pertaining to interpreting the Email Header:

  • First try to match the In-Reply-to Header
  • Second, try to match a References Header
  • Finally, last ditch attempt is to match reply (RE:) subjects with existing message subjects.

Sitting down and working with this concept, I came up with a PHP Email Threading script that will connect to a mail server, pull down the messages, or more conveniently, read the mbox file directly from a local file system, parse through all the mail messages and thread them all together. The end result is an array that can be traversed via the threaded hierarchy which references back to the raw email messages.

I think the end result works better (IMHO) than the built in Mailman threading, as I am catching a lot of the ‘References’ and Subject nuances as described in the JWZ whitepaper.

The threading could probably use some tweaking, as I am sure there are some corner cases that are not handled well. More or less I don’t do any comparison to date received. There was only one case in the file I was working with where a reply came before its parent email. The code handles this through the way it builds the parent thread array, but necessarily wont handle it if there are no parent id’s or references to existing emails.

Download the sample code here.

- Enjoy

0 Responses to “Unraveling the Ball of Thread”


  1. No Comments

Leave a Reply