Analyze your email metadata

21 Oct 2013

Downloading email headers

First, you need to be logged in to Immersion1. To download the email headers in JSON format, visit immersion.media.mit.edu/downloademails. If you are not logged in, you will be redirected to the login page. The filename is allemails.json and it contains the metadata of every email you have exchanged. A sample email entry from the JSON file is shown below. There will be many more such entries in the JSON file.

{ 'fromField': ['Deepak Jagdish', '[email protected]'],
  'toField': [['Daniel Smilkov', '[email protected]'],['Cesar Hidalgo', '[email protected]']],
  'dateField': 1372743719,
  'isSent': False,
  'threadid': '1439426117975266137',
}

The toField contains both TO and CC entries.

1. Being logged in to Immersion means having a login cookie on your browser.

Cleaning the data

The first thing you want to do when analyzing data is to clean it. We will walk through a simple python script that parses your email headers and filters out the invalid ones2.

import json
    
def filterEmails(emails):
  return [email for email in emails if email is not None and email['toField'] and email['fromField']]

f = open('/PATH_TO_JSON_FILE/allemails.json')
emails = filterEmails(json.load(f))

2. There are several things that can make an email invalid; missing FROM or TO field, invalid timestamp, or simply being an empty object.

Furthermore, we want to filter out anything that is a mailing list (e.g. company-wide lists, project based lists) or a promotion list (e.g. Facebook and LinkedIn updates). A simple but effective way to do is to keep email A if we have both sent and received at least K emails with A. To do this, we first count the number of sent and received emails for every email address. We will use Counter, a nice dictionary-based structure for counting:

from collections import Counter
def getSentRcvCounters(emails):
  sentCounter, rcvCounter = Counter(), Counter()
  for email in emails:
    if email['isSent']:
      for person in email['toField']:
        sentCounter[person[1]] += 1
    else:
      person = email['fromField']
      rcvCounter[person[1]] += 1
  return sentCounter, rcvCounter

Now, from the sent and received statistics, we can easily obtain the set of filtered email addresses, which we will refer to as collaborators from now on:

def getCollaborators(emails, K):
  sentCounter, rcvCounter = getSentRcvCounters(emails)
  return set([person for person in sentCounter if sentCounter[person] >= K and rcvCounter[person] >= K])
collaborators = getCollaborators(emails,5) # obtain the collaborators for K=5

This set (collaborators) will help us make sure that all future results that involve email addresses belong to collaborators, and not mailing lists or promotion lists.

Analyzing the email metadata

At this point, we are ready to ask some interesting questions about our own metadata.

Let’s find the most “private” collaborators, i.e. people with whom we have a high likelihood of exchanging a private (one to one) email without cc’ing anyone else. In other words, we want to construct a list of tuples, each tuple consisting of an email address and the probability of exchanging a private email, e.g. ('[email protected]',0.657). The probability is computed by dividing the number of private emails by the total number of exchanged emails with that person.

def getPrivateContacts(emails, collaborators):
  ncoll = 0
  counter, pvtCounter = Counter(), Counter()
  for email in emails:
    if email is None: continue
    isPrivate = len(email['toField']) <= 1
    if email['isSent']:
      for person in email['toField']:
        counter[person[1]] += 1
        if isPrivate: pvtCounter[person[1]] += 1
    else:
      person = email['fromField']
      counter[person[1]] += 1
      if isPrivate: pvtCounter[person[1]] += 1
  people = [(person, float(pvtCounter[person])/counter[person]) for person in pvtCounter if person in collaborators]
  people.sort(key=lambda x: x[1], reverse=True)
  return people

Let’s print the top 20 private collaborators:

for person, score in getPrivateContacts(emails, collaborators)[:20]:
  print person,'\t',score
print '-----------------'

It is easy to see that there is a problem :-). Exchanging 80 private emails out of 100 gives the same probability as exchanging 4 out of 5 (0.8). However, the second estimate is much more noisy that the first, i.e. it is not unlikely to get 4 heads out of 5 tosses of an unbiased coin and incorrectly estimate that there is 80% change of getting heads. To correct for this uncertainty, we will use Wilson score interval which gives a confidence interval around the estimated probability. The lower bound of the 99% confidence interval for 4/5 is 0.28 whereas for 80/100 is 0.68. This tells us that we are 99% confident that the actual probability is more than 0.28 for the 4/5 case, and more than 0.68 for the 80/100 case. Here is an implementation of the score interval. This code snippet must be inserted before the implementation of getPrivateContacts.

from math import sqrt
def getLowerBound(pos, n):
  if n == 0: return 0
  phat = float(pos)/n
  z = 2.576 #1.96 is 95% confidence, 2.576 is 99% confidence
  return (phat + z*z/(2*n) - z * sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

Now we can change the implementation of getPrivateContacts to account for this uncertainty by changing this line

people = [(person, float(pvtCounter[person])/counter[person]) for person in pvtCounter if person in collaborators]

to this line

people = [(person, getLowerBound(pvtCounter[person],counter[person])) for person in pvtCounter if person in collaborators]

Similarly, we can get the people with whom we have the most asymmetric relationship, i.e. we want to have a high likelihood of not getting an email back for every email we sent, and vice versa. To estimate this probability, we first need to identify the direction of the relationship. If it’s an outgoing direction (we are sending more to person A than receiving), we want to estimate the probability of not receiving an email for every email we send, i.e. (#sent-#rcv) / #sent. Conversely, if we are receiving more emails than sending, we estimate the probability of not emailing back for every email we receive, i.e. (#rcv-#sent) / #rcv. We will again use the Wilson score interval to account for the uncertainty:

def getAsymmetricContacts(emails, collaborators):
  sentCounter, rcvCounter = getSentRcvCounters(emails)
  people = []
  for person in sentCounter:
    if person not in collaborators: continue
    sent, rcv = sentCounter[person], rcvCounter[person]
    if (sent > rcv):
      people.append((person, getLowerBound(sent-rcv, sent), 'ME-->'))
    else:
      people.append((person, getLowerBound(rcv-sent, rcv), 'ME<--'))
  people.sort(key=lambda x: x[1], reverse=True)
  return people

Let’s print the top 20 asymmetric collaborators:

for person, score, direction in getAsymmetricContacts(emails, collaborators)[:20]:
  print person, '\t', score, '\t', direction
print '-----------------'

We leave as an exercise for the reader to implement getSymmetricContacts, i.e. contacts with whom we have the most symmetric communication3.

You can find the entire python file here.

3. HINT For an outgoing direction, compute the probability that we are going to receive back an email for every email we send. It should be obvious for an incoming direction at this point.