The ifile Tutorial

Before using ifile, you need to train it. Let's say that you have three folders, "spam," "ifile" and "friends," and the following directory structure:

/
|
+--spam
|   |
|   +--1
|   +--2
|   +--3
|
+--ifile
|   |
|   +--1
|   +--2
|   +--3
|
+--friends
    |
    +--1
    +--2
    +--3

The following commands build the ifile database in ~/.idata (use the -d option to specify a different location for the database):

ifile -h -i spam /spam/*
ifile -h -i ifile /ifile/*
ifile -h -i friends /friends/*

The -h option strips off headers besides "Subject:," "From:" and "To:". I find that -h improves ifile's performance, but you may find otherwise for your personal collection.

Note that we have made the argument to -i the same as the corresponding folder name. This is not necessary. The argument to -i can be any word you want to use to identify a category of e-mails. The argument to -i must not include space characters (including tab, feedline, etc.).

Use "ifile --help" to find out what these and other options mean.

At this point, your ~/.idata file should look something like this:

spam ifile friends 
662 1020 6451 
3 3 3 
jrennie 9 0:3 1:18 2:16 
mindspring 6 1:7 2:5 
make 9 0:5 1:3 
yahoo 9 0:1 1:22 2:2 
...

The first line is the space-separated list of folders. Their ordering specifies a numbering (spam=0, ifile=1, friends=2). The second line is a token count for each folder (e.g. 662 tokens observed in the three spam messages). The third line is an e-mail count for each folder (e.g. 3 e-mails for each of spam, ifile and friends). Each following line specifies statistics for a word. The format of a line is "<word> <age> <folder>:<count> [<folder>:<count> ...]", where <folder> is the folder number determined by the first line ordering. Folders with a count of zero are not listed. So, the line beginning with "jrennie" indicates that "jrennie" appeared 3 times in "spam" e-mails, 18 times in "ifile" e-mails and 16 times in "friends" e-mails. The "age" is the number of e-mails that have been processed since the word was added to the database. Very infrequent words are pruned from the database to keep the database size down.

Now that you have a database, you might want to filter some e-mails. Say you have the following incoming e-mails:

/
|
+--inbox
    | 
    +--1
    +--2
    +--3

To find out what folders ifile thinks these e-mails belong in, run

ifile -c -q /inbox/1
ifile -c -q /inbox/2
ifile -c -q /inbox/3

Let's say that 1 is about ifile, 2 is spam and 3 is from a friend. Assuming ifile does its job correctly, you'll see output like this:

/inbox/1 ifile
/inbox/2 spam
/inbox/3 friends

With such little training data, ifile is unlikely to get the labels correct, but you should get the idea :-)

Now, if you move the e-mails to the folders suggested by ifile, you'll want to update the database accordingly. You can do this with the -i option, like before. Or, you can simply use -Q in place of -q above. This automatically adds the e-mail to the folder ifile suggests.

Now, assume for a moment that email 1 was actually spam. We've added 1 to ifile and put it in the ifile folder. We need to move it to the spam folder and update the ifile database accordingly. We can update the database with the following command:

ifile -d ifile -i spam /inbox/1

This deletes the e-mail from "ifile" and adds it to "spam."

Written by Jason Rennie.

Last modified: Wed Jan 18 14:19:30 2006