Date

I've been doing a lot of work with text data lately and finally bent the knee to information theory. Here's a quick implementation of Shannon entropy - a measure of information in strings given an alphabet $X$:

$$H(X) = -\sum_{x\in{X}}{p(x)\log{p(x)}}$$

Resources:

Implementing $H$

from __future__ import division
from collections import Counter
import pandas as pd
import math
%matplotlib inline
text = 'aabcddddefffg'

Get symbol frequencies:

print 'Input: %s' % text
print

def symbol_freq(text):
    symbol_counts = Counter(list(text))
    symbols = pd.DataFrame.from_dict(symbol_counts, orient='index')
    symbols.columns = ['count']
    symbols['freq'] = symbols['count'].apply(lambda x: x / symbols['count'].sum())
    return symbols

symbols = symbol_freq(text)
ax = symbols['freq'].sort_values(ascending=True).plot(kind='bar', width=0.8, rot=0, title = 'Symbol Frequencies')

png

Shannon entropy $H$

def symbol_entropy(freq):
    return -1 * freq * math.log(freq, 2)

def entropy(text):
    symbols = symbol_freq(text)
    return symbols['freq'].apply(symbol_entropy).sum()

# Apply and sum entropy
text_entropy = entropy(text)

print 'Shannon entropy:', text_entropy
print 'Minimum number of bits required to encode each symbol:', int(math.ceil(text_entropy))
Shannon entropy: 2.56544837182
Minimum number of bits required to encode each symbol: 3

Alternate implementations: