Random Acts of Pizza

This paper represents an analysis of Kaggle's “Random Acts of Pizza” dataset:

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness. (https://www.kaggle.com/c/random-acts-of-pizza)

While the dataset contains many attributes (for example, time of day when the request was made), this paper will focus on the text of the request and its outcome (did the requester receive a pizza). The paper will proceed as follows: import the data, prepare the data, perform some exploratory data analysis, propose and conduct various data mining methods, and present conclusions.

Ultimately an accurate prediciton model cannot be determined; that is, no attempt will be made to classify a single request as successful or unsuccessful. However, some heuristics will be given which should improve the requester's chance of receiving a random act of pizza.

The data provided by Kaggle is in json format. Here is a snapshot of the input data -- I will be using field request_text_edit_aware, request_title, and requester_received_pizza:

Here is where my data is located:

In [11]:
cd "D:\BQ\School\DePaul\CSC478 Programming Data Mining Applications\csc478_qualls_project"
D:\BQ\School\DePaul\CSC478 Programming Data Mining Applications\csc478_qualls_project

Show snapshot of the JSON data:

In [12]:
# got this trick from http://stackoverflow.com/questions/11854847/display-an-image-from-a-file-in-an-ipython-notebook

from IPython.core.display import Image 
Image(filename='json.jpg')
Out[12]:

Read the file...

In [13]:
import json
import numpy as np
json_data = json.loads(open("train.json").read())
print type(json_data)  # what did we get back? (answer: list)
print len(json_data)   # how many elements in the list? (answer: 4040)
<type 'list'>
4040

Extract the fields of interest...

In [14]:
text_list = []
outcome_list = []

for i in range(len(json_data)):
    text_list.append(json_data[i][u'request_text_edit_aware'])
    outcome_list.append(json_data[i][u'requester_received_pizza'])

# save a copy for use later in this document
original = list(text_list)  # list function makes a copy of a list

Show a sample of the fields of interest...

In [15]:
for i in range(5):
    print i
    print "TEXT: ", text_list[i]
    print "OUTCOME: ", outcome_list[i]
    print

document_count = len(text_list)
print
print "There are %d documents." % document_count
0
TEXT:  Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated
OUTCOME:  False

1
TEXT:  I spent the last money I had on gas today. Im broke until next Thursday :(
OUTCOME:  False

2
TEXT:  My girlfriend decided it would be a good idea to get off at Perth bus station when she was coming to visit me and has since had to spend all her money on a taxi to get to me here in Dundee. Any chance some kind soul would get us some pizza since we don't have any cash anymore?
OUTCOME:  False

3
TEXT:  It's cold, I'n hungry, and to be completely honest I'm broke. My mum said we're having leftovers for dinner. A random pizza arriving would be nice.

Edit: We had leftovers.
OUTCOME:  False

4
TEXT:  hey guys:
 I love this sub. I think it's great. (Except the sob stories. I miss when this place was fun!) Anywho, I've given a pizza out before so thought I would try my luck at getting one. My friend, who lives an hour away and our schedules do not let us see each other too much, decided to come down and visit me for the night! I would love to be able to be a good host and order her a pizza to go with some beer!

Again, no sob story. Just looking to share a pizza with an old friend :)
OUTCOME:  False


There are 4040 documents.

Convert to lower case, expand contractions, and remove remaining punctuation.

In [16]:
contractions_dict = {"ain't" : "am not" \
    , "aren't"    : "are not"       \
    , "can't"     : "cannot"        \
    , "couldn't"  : "could not"     \
    , "didn't"    : "did not"       \
    , "doesn't"   : "does not"      \
    , "don't"     : "do not"        \
    , "hadn't"    : "had not"       \
    , "hasn't"    : "has not"       \
    , "haven't"   : "have not"      \
    , "he'd"      : "he would"      \
    , "he'll"     : "he will"       \
    , "he's"      : "he is"         \
    , "how'd"     : "how did"       \
    , "how's"     : "how is"        \
    , "i'd"       : "i would"       \
    , "i'll"      : "i will"        \
    , "i'm"       : "i am"          \
    , "i've"      : "i have"        \
    , "isn't"     : "is not"        \
    , "it'd"      : "it would"      \
    , "it's"      : "it is"         \
    , "let's"     : "let us"        \
    , "o'clock"   : "of the clock"  \
    , "she'd"     : "she would"     \
    , "she'll"    : "she will"      \
    , "she's"     : "she is"        \
    , "shouldn't" : "should not"    \
    , "that'd"    : "that would"    \
    , "that's"    : "that is"       \
    , "they'd"    : "they would"    \
    , "they'll"   : "they will"     \
    , "they're"   : "they are"      \
    , "wasn't"    : "was not"       \
    , "we'd"      : "we would"      \
    , "we'll"     : "we will"       \
    , "we're"     : "we are"        \
    , "weren't"   : "were not"      \
    , "we've"     : "we have"       \
    , "what'll"   : "what will"     \
    , "what's"    : "what is"       \
    , "what've"   : "what have"     \
    , "who'll"    : "who will"      \
    , "who's"     : "who is"        \
    , "why's"     : "why is"        \
    , "won't"     : "will not"      \
    , "wouldn't"  : "would not"     \
    , "would've"  : "would have"    \
    , "y'all"     : "you all"       \
    , "you'd"     : "you had"       \
    , "you'll"    : "you will"      \
    , "you're"    : "you are"       \
    , "you've"    : "you have"      \
    }
In [17]:
# see http://stackoverflow.com/questions/12437667/how-to-replace-punctuation-in-a-string-python
# I want to replace punctuation with a blank
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))

for d in range(document_count):
    row = text_list[d]  # for convenience only
    
    # for encode, see http://stackoverflow.com/questions/3224268/python-unicode-encode-error
    row = row.encode('ascii', 'ignore') 

    row = row.lower()  # convert to lower case
    for k, v in contractions_dict.iteritems():
        row = row.replace(k, v)
    row = row.translate(replace_punctuation)
    text_list[d] = row

for i in range(5):
    print i
    print "TEXT: ", text_list[i]
    print
0
TEXT:  hi i am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated

1
TEXT:  i spent the last money i had on gas today  im broke until next thursday   

2
TEXT:  my girlfriend decided it would be a good idea to get off at perth bus station when she was coming to visit me and has since had to spend all her money on a taxi to get to me here in dundee  any chance some kind soul would get us some pizza since we do not have any cash anymore 

3
TEXT:  it is cold  i n hungry  and to be completely honest i am broke  my mum said we are having leftovers for dinner  a random pizza arriving would be nice 

edit  we had leftovers 

4
TEXT:  hey guys 
 i love this sub  i think it is great   except the sob stories  i miss when this place was fun   anywho  i have given a pizza out before so thought i would try my luck at getting one  my friend  who lives an hour away and our schedules do not let us see each other too much  decided to come down and visit me for the night  i would love to be able to be a good host and order her a pizza to go with some beer 

again  no sob story  just looking to share a pizza with an old friend   


Import and Test Stemming Package

In [18]:
# stemming package obtained from https://pypi.python.org/pypi/stemming/1.0
from stemming.porter2 import stem 

# test
print [stem(x) for x in ["compute", "computes", "computer", "computers", "computation", "computational"]] 
print [stem(x) for x in ["dive", "divers", "diversity", "diversify", "diversified", "diversification"]]
['comput', 'comput', 'comput', 'comput', 'comput', 'comput']
['dive', 'diver', 'divers', 'diversifi', 'diversifi', 'diversif']

Load Stopwords from File

In [19]:
# Stopwords can be found at https://pypi.python.org/pypi/stop-words/2014.5.26
# But it is just a collection of stopwords as .txt files. 
# There is no python code included.

stopwords = np.genfromtxt("stop-words-2014.5.26/stop_words/stop-words/english.txt", dtype="|S10")
print stopwords
['a' 'about' 'above' 'after' 'again' 'against' 'all' 'am' 'an' 'and' 'any'
 'are' "aren't" 'as' 'at' 'be' 'because' 'been' 'before' 'being' 'below'
 'between' 'both' 'but' 'by' "can't" 'cannot' 'could' "couldn't" 'did'
 "didn't" 'do' 'does' "doesn't" 'doing' "don't" 'down' 'during' 'each'
 'few' 'for' 'from' 'further' 'had' "hadn't" 'has' "hasn't" 'have'
 "haven't" 'having' 'he' "he'd" "he'll" "he's" 'her' 'here' "here's" 'hers'
 'herself' 'him' 'himself' 'his' 'how' "how's" 'i' "i'd" "i'll" "i'm"
 "i've" 'if' 'in' 'into' 'is' "isn't" 'it' "it's" 'its' 'itself' "let's"
 'me' 'more' 'most' "mustn't" 'my' 'myself' 'no' 'nor' 'not' 'of' 'off'
 'on' 'once' 'only' 'or' 'other' 'ought' 'our' 'ours' 'ourselves' 'out'
 'over' 'own' 'same' "shan't" 'she' "she'd" "she'll" "she's" 'should'
 "shouldn't" 'so' 'some' 'such' 'than' 'that' "that's" 'the' 'their'
 'theirs' 'them' 'themselves' 'then' 'there' "there's" 'these' 'they'
 "they'd" "they'll" "they're" "they've" 'this' 'those' 'through' 'to' 'too'
 'under' 'until' 'up' 'very' 'was' "wasn't" 'we' "we'd" "we'll" "we're"
 "we've" 'were' "weren't" 'what' "what's" 'when' "when's" 'where' "where's"
 'which' 'while' 'who' "who's" 'whom' 'why' "why's" 'with' "won't" 'would'
 "wouldn't" 'you' "you'd" "you'll" "you're" "you've" 'your' 'yours'
 'yourself' 'yourselves']

Do it: stem text and then remove stop words

In [20]:
for d in range(document_count):
    row = text_list[d]  # for convenience only
    words = row.split()  # must split because stem function works on one word only
    row = [stem(x) for x in words if x not in stopwords]
    row = ' '.join(row)  # join will undo the split
    text_list[d] = row
    
for i in range(5):
    print i
    print "TEXT: ", text_list[i]
    print
0
TEXT:  hi need food 4 children militari famili realli hit hard time exahust mean help just abl feed famili make anoth night ask know bless come whatev u can find heart give great appreci

1
TEXT:  spent last money gas today im broke next thursday

2
TEXT:  girlfriend decid good idea get perth bus station come visit sinc spend money taxi get dunde chanc kind soul get us pizza sinc cash anymor

3
TEXT:  cold n hungri complet honest broke mum said leftov dinner random pizza arriv nice edit leftov

4
TEXT:  hey guy love sub think great except sob stori miss place fun anywho given pizza thought tri luck get one friend live hour away schedul let us see much decid come visit night love abl good host order pizza go beer sob stori just look share pizza old friend


Create a list of distinct terms. (A single RAOP request may still have duplicate terms.)

In [21]:
# sets do not have duplicates
term_set = set()
for d in range(document_count):
    words = text_list[d].split()
    deduped = set(words)  # removes duplicates
    term_set = term_set.union(deduped)  # union() joins sets
In [22]:
sorted_term_list = sorted(term_set)  # sorted() always returns a list
print sorted_term_list[:10]  # sample from the top
print sorted_term_list[-10:]  # sample from the bottom
['0', '00', '000', '0000', '0011011001111000', '0072', '00pm', '012468', '02', '024856']
['zone', 'zonku', 'zoo', 'zoolog', 'zrssbvz', 'zsk5c3o', 'zucchini', 'zuuri', 'zw', 'zza']

I will often need a count of distinct terms (and documents).

In [23]:
term_count = len(sorted_term_list)
print "There are %d documents and %d distinct terms." % (document_count, term_count)
There are 4040 documents and 8976 distinct terms.

Create a dictionary which maps each term to its index in sorted_terms_list

In [24]:
term_dict = {}
for i in range(term_count):
    term_dict[sorted_term_list[i]] = i
    
# test dictionary
print term_dict["pizza"]  
print sorted_term_list[term_dict["pizza"]]  # inverse -- should show pizza
6014
pizza

Build the term document matrix

In [25]:
# build the term document 
# each row is a (stemmed) term
# each column is a document (RAOP request)
# each cell is the number of times that (stemmed) term appears in that document

TD = np.zeros((term_count, document_count))
for d in range(document_count):
    row = text_list[d]  # convenience
    terms = row.split()
    for i in range(len(terms)):
        t = term_dict[terms[i]]
        TD[t][d] += 1
In [26]:
# checking...
def inspect(d):  # d = document number
    print "Document %d: " % d
    print text_list[d]
    sum = 0
    count = 0
    for t in range(term_count):
        sum += TD[t][d]
        if (TD[t][d] > 0):
            count += 1
    print "Document %d had %d distinct terms and %d total terms." % (d, sum, count)
    print
    
inspect(3)
inspect(4)
Document 3: 
cold n hungri complet honest broke mum said leftov dinner random pizza arriv nice edit leftov
Document 3 had 16 distinct terms and 15 total terms.

Document 4: 
hey guy love sub think great except sob stori miss place fun anywho given pizza thought tri luck get one friend live hour away schedul let us see much decid come visit night love abl good host order pizza go beer sob stori just look share pizza old friend
Document 4 had 49 distinct terms and 43 total terms.


Create document frequency matrix

In [27]:
# In how many different documents does each term occur?
# (This actually called the df in Mobasher's slide #10)

DF = np.zeros((term_count))
for t in range(term_count):
    DF[t] = np.sum(1 for x in TD[t] if x > 0)
In [28]:
# function returns the number of documents containing term
def docs_containing(word):
    return DF[term_dict[word]]

for word in ("pizza", "mom", "haiku"):
    print "%-s is found in %d documents." % (word, docs_containing(word))
pizza is found in 2534 documents.
mom is found in 159 documents.
haiku is found in 6 documents.

This will make subsequent output a little cleaner...

In [29]:
np.set_printoptions(precision=3)

Determine the number of successful (true) and unsuccessful (false) outcomes...

In [30]:
outcome_true = np.sum(1 for x in outcome_list if x == True)
outcome_false = np.sum(1 for x in outcome_list if x == False)
print "There were %d successful requests and %d unsuccessful requests." % \
    (outcome_true, outcome_false)
There were 994 successful requests and 3046 unsuccessful requests.

In [31]:
# This line configures matplotlib to show figures embedded in the notebook, 
# instead of opening a new window for each figure. More about that later. 
# If you are using an old version of IPython, try using '%pylab inline' instead.

%matplotlib inline
In [32]:
import matplotlib.pyplot as plt
In [33]:
# make a square figure and axes
plt.figure(1, figsize=(6,6))
ax = plt.axes([0.1, 0.1, 0.8, 0.8])

# The slices will be ordered and plotted counter-clockwise.
labels = 'Successful', 'Unsuccessful'
fracs = [outcome_true, outcome_false]

plt.pie(fracs, labels=labels, autopct='%1.1f%%', shadow=False, startangle=0, colors=('y','c'))
# The default startangle is 0, which would start
# the True slice on the x-axis.  With startangle=90,
# everything is rotated counter-clockwise by 90 degrees,
# so the plotting starts on the positive y-axis.

plt.title('RAOP Outcome', bbox={'facecolor':'0.8', 'pad':5})

plt.show()

This is a crosstab function which I wrote and will use shortly...

In [34]:
# function to produce crosstab
import collections
def crosstab(vecA, vecB):
    keysA = collections.Counter(vecA).keys()
    keysB = collections.Counter(vecB).keys()
    keysA.sort()
    keysB.sort()

    # use a dictionary to convert values to indices
    dictA = {}
    for i in range(len(keysA)):
        dictA[keysA[i]] = i
    
    dictB = {}
    for i in range(len(keysB)):
        dictB[keysB[i]] = i
    
    # array which will hold crosstab frequencies
    freqs = np.zeros((len(keysA), len(keysB)))

    # count 'em
    pairs = len(vecA)
    for i in range(pairs):
        freqs[dictA[vecA[i]]][dictB[vecB[i]]] += 1

    # reminder: print statement ending in comma suppresses newline
    # print column headings
    for clm in range(len(keysB)):
        print "\t%s" % keysB[clm],
    print
    for row in range(len(keysA)):
        print keysA[row],
        for clm in range(len(keysB)):
            print "\t%d" % freqs[row][clm],
        print

Examine document length

In [35]:
# document length by outcome?
termsInWinningDocuments = np.zeros(outcome_true)
w = 0
termsInLosingDocuments = np.zeros(outcome_false)
l = 0
for d in range(document_count):
    count = 0
    for t in range(term_count):
        if (TD[t][d] > 0):
            # count += 1
            count += TD[t][d]
    if (outcome_list[d] == True):
        termsInWinningDocuments[w] = count
        w += 1
    else:
        termsInLosingDocuments[l] = count
        l += 1
In [36]:
print "DOCUMENT LENGTH IN LOSING DOCUMENTS:"
print "  N = %d" % len(termsInLosingDocuments)
print "  Min = %d" % termsInLosingDocuments.min()
print "  Max = %d" % termsInLosingDocuments.max()
print "  Mean = %.1f" % termsInLosingDocuments.mean()
print "  Median = %.1f" % np.median(termsInLosingDocuments)
print 
print "DOCUMENT LENGTH IN WINNING DOCUMENTS:"
print "  N = %d" % len(termsInWinningDocuments)
print "  Min = %d" % termsInWinningDocuments.min()
print "  Max = %d" % termsInWinningDocuments.max()
print "  Mean = %.1f" % termsInWinningDocuments.mean()
print "  Median = %.1f" % np.median(termsInWinningDocuments)
DOCUMENT LENGTH IN LOSING DOCUMENTS:
  N = 3046
  Min = 0
  Max = 419
  Mean = 36.5
  Median = 28.0

DOCUMENT LENGTH IN WINNING DOCUMENTS:
  N = 994
  Min = 0
  Max = 392
  Mean = 45.8
  Median = 36.0

In [37]:
# use normed=True for relative histogram (vs. absolute)

plt.hist(termsInLosingDocuments, normed=True, bins=25, color='r', alpha=0.5, label="Losing")
plt.hist(termsInWinningDocuments, normed=True, bins=25, color='b', alpha=0.5, label="Winning")
plt.xlabel("terms")
plt.ylabel("probability")
plt.title("Document Length by Outcome")
plt.legend()
plt.show()

Determine distinct terms per document by outcome: possible proxy for intelligence / sincerity?

In [38]:
# distinct terms per document by outcome?
distinctTermsInWinningDocuments = np.zeros(outcome_true)
w = 0
distinctTermsInLosingDocuments = np.zeros(outcome_false)
l = 0
for d in range(document_count):
    count = 0
    for t in range(term_count):
        if (TD[t][d] > 0):
            count += 1
    if (outcome_list[d] == True):
        distinctTermsInWinningDocuments[w] = count
        w += 1
    else:
        distinctTermsInLosingDocuments[l] = count
        l += 1
In [39]:
print "DISTINCT TERMS PER DOCUMENT IN LOSING DOCUMENTS:"
print "  N = %d" % len(distinctTermsInLosingDocuments)
print "  Min = %d" % distinctTermsInLosingDocuments.min()
print "  Max = %d" % distinctTermsInLosingDocuments.max()
print "  Mean = %.1f" % distinctTermsInLosingDocuments.mean()
print "  Median = %.1f" % np.median(distinctTermsInLosingDocuments)
print 
print "DISTINCT TERMS PER DOCUMENT IN WINNING DOCUMENTS:"
print "  N = %d" % len(distinctTermsInWinningDocuments)
print "  Min = %d" % distinctTermsInWinningDocuments.min()
print "  Max = %d" % distinctTermsInWinningDocuments.max()
print "  Mean = %.1f" % distinctTermsInWinningDocuments.mean()
print "  Median = %.1f" % np.median(distinctTermsInWinningDocuments)
DISTINCT TERMS PER DOCUMENT IN LOSING DOCUMENTS:
  N = 3046
  Min = 0
  Max = 290
  Mean = 32.2
  Median = 26.0

DISTINCT TERMS PER DOCUMENT IN WINNING DOCUMENTS:
  N = 994
  Min = 0
  Max = 247
  Mean = 39.9
  Median = 34.0

In [40]:
# use normed=True for relative histogram (vs. absolute)

plt.hist(distinctTermsInLosingDocuments, normed=True, bins=25, color='r', alpha=0.5, label="Losing")
plt.hist(distinctTermsInWinningDocuments, normed=True, bins=25, color='b', alpha=0.5, label="Winning")
plt.xlabel("distinct terms")
plt.ylabel("probability")
plt.title("Distinct Terms in Document by Outcome")
plt.legend()
plt.show()

For each term, determine how many "winning" documents and "losing" documents have that term.

In [41]:
winning = np.zeros(term_count)
losing = np.zeros(term_count)
for t in range(term_count):
    for d in range(document_count):
        if (TD[t][d] > 0):
            if (outcome_list[d] == True):
                winning[t] += 1
            else:
                losing[t] += 1
In [43]:
print "Helpful terms (at least 10 wins, and percent of wins >= 40%"
print "-----------------------------------------------------------"
print "term              win    lose    pct"
for t in range(term_count):
    term = sorted_term_list[t]
    win = winning[t]
    lose = losing[t]
    pct = 1.0 * win / (win + lose)
    if (win >=10 and pct >= .4):
        print "%-15s\t%5d\t%5d\t%4.2f" % (sorted_term_list[t], win, lose, pct)

print
print "Harmful terms (least 1 win but percent of wins <= 15%"
print "-----------------------------------------------------"
print "term              win    lose    pct"
for t in range(term_count):
    term = sorted_term_list[t]
    win = winning[t]
    lose = losing[t]
    pct = 1.0 * win / (win + lose)
    if (win >=1 and pct <= .15):
        print "%-15s\t%5d\t%5d\t%4.2f" % (sorted_term_list[t], win, lose, pct)
Helpful terms (at least 10 wins, and percent of wins >= 40%
-----------------------------------------------------------
term              win    lose    pct
24             	   12	   16	0.43
40             	   12	   14	0.46
7              	   36	   33	0.52
accid          	   11	   16	0.41
activ          	   10	   13	0.43
aid            	   18	   23	0.44
avail          	   15	   22	0.41
awhil          	   12	   16	0.43
babi           	   21	   29	0.42
bean           	   21	   30	0.41
buck           	   22	   25	0.47
certain        	   13	   19	0.41
chain          	   10	   14	0.42
cheap          	   16	   24	0.40
chicago        	   10	   12	0.45
constant       	   11	    9	0.55
cover          	   25	   28	0.47
cup            	   10	   11	0.48
deal           	   32	   40	0.44
engin          	   10	   15	0.40
especi         	   17	   23	0.42
exchang        	   19	   20	0.49
father         	   18	   21	0.46
fee            	   16	   19	0.46
form           	   11	   16	0.41
generos        	   11	   15	0.42
heat           	   14	   13	0.52
hire           	   10	   12	0.45
hunt           	   10	   12	0.45
hurt           	   21	   27	0.44
imgur          	   68	   95	0.42
includ         	   23	   29	0.44
incred         	   14	   14	0.50
item           	   10	   11	0.48
jpg            	   43	   49	0.47
landlord       	   12	   12	0.50
larg           	   24	   35	0.41
learn          	   13	   17	0.43
leftov         	   15	   17	0.47
mail           	   17	   25	0.40
marri          	   10	    9	0.53
mention        	   22	   25	0.47
oatmeal        	   10	   15	0.40
quick          	   13	   14	0.48
rather         	   25	   36	0.41
rice           	   58	   66	0.47
second         	   19	   25	0.43
spare          	   23	   32	0.42
stretch        	   16	   16	0.50
stupid         	   12	   17	0.41
sunday         	   20	   29	0.41
surpris        	   39	   58	0.40
surviv         	   19	   26	0.42
tight          	   37	   49	0.43
total          	   34	   47	0.42
tough          	   25	   28	0.47
unexpect       	   15	   18	0.45
updat          	   12	   10	0.55
visit          	   24	   36	0.40

Harmful terms (least 1 win but percent of wins <= 15%
-----------------------------------------------------
term              win    lose    pct
28th           	    1	    6	0.14
4th            	    2	   15	0.12
acquir         	    1	    6	0.14
agre           	    1	    8	0.11
altern         	    2	   14	0.12
america        	    1	   10	0.09
angel          	    1	   12	0.08
anim           	    1	   10	0.09
appl           	    1	    7	0.12
asleep         	    1	   10	0.09
assum          	    1	    9	0.10
atlanta        	    1	   10	0.09
attent         	    1	    6	0.14
await          	    1	    6	0.14
awkward        	    1	    7	0.12
bail           	    1	    9	0.10
ball           	    1	    6	0.14
bay            	    1	    7	0.12
becam          	    1	    8	0.11
bro            	    1	    7	0.12
build          	    3	   20	0.13
bunch          	    4	   32	0.11
cafeteria      	    1	   14	0.07
camera         	    1	    7	0.12
canada         	    5	   35	0.12
chair          	    1	    6	0.14
chase          	    1	    6	0.14
cheaper        	    1	    7	0.12
clear          	    4	   23	0.15
closest        	    1	   10	0.09
complic        	    1	    8	0.11
contract       	    4	   23	0.15
convinc        	    1	   10	0.09
cool           	    8	   50	0.14
countri        	    3	   20	0.13
crap           	    2	   15	0.12
creat          	    2	   14	0.12
ct             	    1	    7	0.12
cuz            	    1	    6	0.14
daili          	    1	    6	0.14
dark           	    2	   12	0.14
dead           	    2	   12	0.14
dear           	    1	   11	0.08
delight        	    1	    7	0.12
digit          	    1	   12	0.08
disappear      	    1	    8	0.11
dota           	    1	    6	0.14
drag           	    1	    7	0.12
drawn          	    1	    7	0.12
dude           	    2	   16	0.11
ear            	    1	    8	0.11
eas            	    1	    7	0.12
economi        	    1	   17	0.06
electron       	    1	   12	0.08
emot           	    1	   11	0.08
episod         	    1	   10	0.09
equal          	    1	    8	0.11
facebook       	    3	   23	0.12
fast           	    4	   23	0.15
favourit       	    1	    7	0.12
femal          	    1	    6	0.14
finger         	    1	    9	0.10
five           	    3	   18	0.14
fl             	    1	    7	0.12
float          	    1	    9	0.10
floor          	    2	   12	0.14
frustrat       	    1	    9	0.10
georgia        	    2	   12	0.14
given          	    5	   32	0.14
grandfath      	    1	    7	0.12
grub           	    1	    7	0.12
grumbl         	    1	    7	0.12
hamburg        	    1	    7	0.12
hang           	    4	   32	0.11
heck           	    1	    9	0.10
hip            	    1	    6	0.14
homeless       	    5	   30	0.14
homework       	    1	   14	0.07
hotel          	    2	   16	0.11
hurrican       	    1	    8	0.11
imag           	    1	    7	0.12
increas        	    1	    7	0.12
intern         	    1	   10	0.09
ireland        	    1	    6	0.14
joy            	    1	    9	0.10
kansa          	    1	    7	0.12
labor          	    2	   12	0.14
level          	    1	    9	0.10
librari        	    2	   12	0.14
licens         	    2	   12	0.14
ll             	    5	   29	0.15
london         	    1	   10	0.09
louisvill      	    1	    7	0.12
lover          	    1	   15	0.06
male           	    2	   12	0.14
map            	    1	    6	0.14
math           	    1	    7	0.12
menu           	    1	    7	0.12
metro          	    1	    7	0.12
miser          	    1	    6	0.14
ms             	    2	   13	0.13
munchi         	    1	    6	0.14
musician       	    1	    6	0.14
nearbi         	    3	   18	0.14
necessari      	    3	   18	0.14
needless       	    1	   18	0.05
netflix        	    1	    7	0.12
notic          	    4	   30	0.12
novemb         	    1	    9	0.10
nowher         	    1	   12	0.08
ny             	    2	   19	0.10
occasion       	    1	    6	0.14
older          	    2	   16	0.11
origin         	    2	   16	0.11
pa             	    1	   11	0.08
page           	    2	   28	0.07
pathet         	    1	   10	0.09
pawn           	    1	    7	0.12
penni          	    1	   17	0.06
per            	    2	   13	0.13
perhap         	    2	   12	0.14
pipe           	    1	    6	0.14
play           	    8	   57	0.12
pre            	    1	   10	0.09
prefer         	    6	   38	0.14
pressur        	    1	    6	0.14
print          	    1	    9	0.10
randomactsofpizza	    2	   12	0.14
regist         	    2	   14	0.12
relief         	    1	    8	0.11
repeat         	    1	    6	0.14
research       	    1	    8	0.11
result         	    4	   23	0.15
roomat         	    1	    9	0.10
sale           	    2	   12	0.14
scarc          	    1	    9	0.10
screw          	    4	   24	0.14
seven          	    1	    6	0.14
shelter        	    1	   10	0.09
shes           	    1	    8	0.11
sibl           	    2	   12	0.14
sit            	   13	   79	0.14
smoke          	    1	   22	0.04
sooner         	    1	    6	0.14
southern       	    2	   13	0.13
standard       	    1	    7	0.12
strain         	    1	    7	0.12
street         	    4	   26	0.13
stuf           	    1	   14	0.07
stumbl         	    1	    6	0.14
task           	    1	    6	0.14
teacher        	    1	   11	0.08
tech           	    1	    9	0.10
teenag         	    1	    7	0.12
ten            	    1	   11	0.08
terribl        	    6	   43	0.12
text           	    2	   14	0.12
there          	    1	    6	0.14
threw          	    1	    7	0.12
tide           	    1	   12	0.08
til            	    6	   40	0.13
tini           	    2	   14	0.12
topic          	    1	    7	0.12
toy            	    1	    6	0.14
tucson         	    1	    7	0.12
uk             	    5	   29	0.15
uni            	    2	   12	0.14
unlik          	    1	    6	0.14
usernam        	    1	   14	0.07
valu           	    1	    7	0.12
via            	    3	   23	0.12
washington     	    2	   16	0.11
wi             	    1	    6	0.14
wine           	    1	    6	0.14
wipe           	    1	    8	0.11
wisconsin      	    1	    6	0.14
yummi          	    2	   12	0.14
za             	    2	   12	0.14

Observations

→ The following terms might help your chances: accident, aid, chicago, father, generosity, landlord, leftovers, rice, tough. These terms might hurt your chances: america, angel, atlanta, bail, canada, facebook, hang, homeless, homework, kansas, pawn, street, washington, wisconsin.

K Nearest Neighbors is often used when analyzing text data, in particular, in attempting to classify data into predefined categories, such as successful and unsuccessful in the current case. In this section we will attempt that method.

Create Inverse Document Frequency (IDF) matrix

In [44]:
import math

IDF = [math.log(1.0 * document_count / x, 2) for x in DF]

np.set_printoptions(precision=3)
print IDF[:10]
[7.073248982030639, 7.336283387864433, 8.658211482751796, 11.980139577639159, 11.980139577639159, 11.980139577639159, 11.980139577639159, 11.980139577639159, 11.980139577639159, 11.980139577639159]

Create the tfidf matrix

In [45]:
TFIDF = np.copy(TD)
for t in range(term_count):
    for d in range(document_count):
        TFIDF[t][d] = TD[t][d] * IDF[t]

KNN requires a distance function: I will use cosine similarity

In [46]:
# cosine similarity function
# taken from my homework2
def cosineSimilarity(v1, v2):
    dotProduct = np.dot(v1, v2)
    ss1 = (v1 ** 2).sum()
    ss2 = (v2 ** 2).sum()
    if (ss1 == 0 or ss2 == 0):
        answer = 0  # do not divide by zero!
    else:
        answer = dotProduct / ((ss1 ** 0.5) * (ss2 ** 0.5))
    return answer

a = np.array([2, 11, 5])
b = np.array([7, 3, 15])

# test my cosineSimilarity function
# given (2, 11, 5), (7, 3, 15)
# expect ((2*7)+(11*3)+(5*15)) / (sqrt(2**2 + 11**2 + 5**2) * sqrt(7**2 + 3**2 + 15**2)) = 0.59
cs = cosineSimilarity(a, b)
print cs



def cosineDistance(v1, v2):
    cs = cosineSimilarity(v1, v2)
    return 1 - cs

# test my cosineDistance function
# given (2, 11, 5), (7, 3, 15)
# expect 1 - 0.59 = .41
cd = cosineDistance(a, b)
print cd
0.592135342502
0.407864657498

Transpose the tfidf matrix to create a document-term (DT) matrix

In [47]:
DT = TFIDF.T
print DT.shape
(4040L, 8976L)

In [48]:
# checking...I know document 4 has term "sob" twice

t = term_dict["sob"]
print t             # sob is the 7316th term
print TD[t][4]      # sob occurs twice in document 4
print DF[t]         # sob occurs in 119 documents
print document_count
print IDF[t]        # calc follows
print math.log(4040.0/119, 2)  # should match IDF -- it does
print TFIDF[t][4]
print DT[4][t]
print math.log(4040.0/119, 2) * 2  # should match TFIDF and DT -- it does
7316
2.0
119.0
4040
5.08532181433
5.08532181433
10.1706436287
10.1706436287
10.1706436287

Split that data into 80% training and 20% testing

In [49]:
training_percent = 0.8
training_size = int(training_percent * len(DT))
print "training_size=", training_size

training_matrix = DT[:training_size,:]
testing_matrix = DT[training_size:,:]
print "training_matrix.shape=", training_matrix.shape
print "testing_matrix.shape=", testing_matrix.shape

training_outcome_list = outcome_list[:training_size]
testing_outcome_list = outcome_list[training_size:]

print
print "type(training_outcome_list)=", type(training_outcome_list)
print "len(training_outcome_list)=", len(training_outcome_list)
t = np.sum(1 for x in training_outcome_list if x == True)
f = np.sum(1 for x in training_outcome_list if x == False)
print "%.1f%% of training dataset had successful outcome." % (100.0 * t / (t + f))
print
print "type(testing_outcome_list)=", type(testing_outcome_list)
print "len(testing_outcome_list)=", len(testing_outcome_list)
t = np.sum(1 for x in testing_outcome_list if x == True)
f = np.sum(1 for x in testing_outcome_list if x == False)
print "%.1f%% of testing dataset had successful outcome." % (100.0 * t / (t + f))
training_size= 3232
training_matrix.shape= (3232L, 8976L)
testing_matrix.shape= (808L, 8976L)

type(training_outcome_list)= <type 'list'>
len(training_outcome_list)= 3232
24.7% of training dataset had successful outcome.

type(testing_outcome_list)= <type 'list'>
len(testing_outcome_list)= 808
24.3% of testing dataset had successful outcome.

Definition of knnClassifier function (taken from MLA, page 21-22)

In [50]:
import operator  # see MLA pages 21-22
def knnClassifier(trainingMatrix, labels, instance, K, distanceFunction):
    rows = trainingMatrix.shape[0]
    distance = np.empty(rows)
    for i in range(rows):
        distance[i] = distanceFunction(trainingMatrix[i], instance)
    labeledDistances = np.rec.fromarrays((distance, labels))
    sortedLabeledDistances = labeledDistances.argsort()  # returns array of indexes indicating sort order
    classCount = {}
    for i in range(K):
        label = labeledDistances[sortedLabeledDistances[i]][1]
        classCount[label] = classCount.get(label, 0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    predictedClass = sortedClassCount[0][0]
    topKNeighbors = sortedLabeledDistances[0:K]  # slice to get index of K nearest neighbors
    return predictedClass, topKNeighbors

Test of knnClassifier function

In [51]:
# Test my knnClassifier function
# I will use the data found in MLA, page 21-24
group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = np.array(['A', 'A', 'B', 'B'])
print group
print labels
print

print "test using cosineDistance function:"
label, topKNeighbors = knnClassifier(group, labels, np.array([0.9, 0.1]), 3, cosineDistance)
print label, topKNeighbors
label, topKNeighbors = knnClassifier(group, labels, np.array([0.1, 0.9]), 3, cosineDistance)
print label, topKNeighbors
[[ 1.   1.1]
 [ 1.   1. ]
 [ 0.   0. ]
 [ 0.   0.1]]
['A' 'A' 'B' 'B']

test using cosineDistance function:
A [1 0 3]
A [3 0 1]

Try it for different values of k

In [52]:
def classificationAccuracy(trainingMatrix, trainingClasses, testingMatrix, testingClasses, K, distanceFunction):
    testCases = testingMatrix.shape[0]
    correct = 0
    for i in range(testCases):
        testInstance = testingMatrix[i]
        predictedClass, topKNeighbors = knnClassifier(trainingMatrix, trainingClasses, testInstance, K, distanceFunction)
        actualClass = testingClasses[i]
        # print "Instance #%d: Predicted=%d, Actual=%d" % (i, predictedClass, actualClass)
        if (predictedClass == actualClass):
            correct = correct + 1
    accuracy = 1.0 * correct / testCases
    return accuracy
In [53]:
for K in (1, 2, 3, 10, 15, 20):
    accuracy = classificationAccuracy(training_matrix, training_outcome_list, \
        testing_matrix, testing_outcome_list, K, cosineDistance)
    print "K = %d, accuracy = %5.3f" % (K, accuracy)
K = 1, accuracy = 0.666
K = 2, accuracy = 0.745
K = 3, accuracy = 0.717
K = 10, accuracy = 0.751
K = 15, accuracy = 0.752
K = 20, accuracy = 0.754

Discussion of KNN

Well that was certainly discouraging! At first glance an accuracy rate of 75% sounds pretty good, until one considers that -- as we saw in the Exploratory Data Analysis section, 75% of the ROAP requests are unsuccessful. So if we just guess "unsuccessful" all the time then we will be correct 75% of the time.

Time to try another method.

Sentiment Analysis Plan

I decided to try some sentiment analysis next. Are positive (entertaining?) requests or negative (desparate?) requests likely to be more successful. My approach will be to determine the sentiment for each request, and then create summary statistics on sentiment by outcome -- pizza given or no.

For sentiment analysis I will use the TextBlob Python package.

Import TextBlob and run a simple test

The simple examples I have included here show that the sentiment analysis algorithm in TextBlob must be quite simplistic.

  • Punctuation matters. ("things are going great!" scores higher than "things are going great."
  • It would appear that the proximity of negation matters. ("going ok" and "not going ok" get the same score, but "i am happy" and "i am not happy" do not.)
  • Context matters. (Many ROAP requests talk about having money. Here we see that "i have cash" and "i have no cash" both have a polarity of 0.0. That certainly is a problem in this case!)

Given these findings, my sentiment analysis will be done on the original text of the request, not the version which has been stemmed and has had stopwords removed ("not" is a stopword, but clearly can be relevant in sentment analysis.)

In [54]:
from textblob import TextBlob

# a simple test function which I wrote...
def testTextBlob(phrase):
    blob = TextBlob(phrase)
    sentiment = blob.sentiment
    polarity = sentiment.polarity  # a measure of the mood
    subjectivity = sentiment.subjectivity  # a measure of confidence in the polarity
    print "The phrase \"%s\" has polarity of %.2f and subjectivity of %.2f." % \
        (phrase, polarity, subjectivity)
        
for phrase in ("things are going great!", "things are going great.", 
    "things are going great?", "things are going ok", "things are not going ok",\
    "things are going poorly", "things are going real bad", \
    "i have plenty of cash", "i am out of cash", "i have no cash", \
    "i am happy", "i am not happy"):
    testTextBlob(phrase)
The phrase "things are going great!" has polarity of 1.00 and subjectivity of 0.75.
The phrase "things are going great." has polarity of 0.80 and subjectivity of 0.75.
The phrase "things are going great?" has polarity of 0.80 and subjectivity of 0.75.
The phrase "things are going ok" has polarity of 0.50 and subjectivity of 0.50.
The phrase "things are not going ok" has polarity of 0.50 and subjectivity of 0.50.
The phrase "things are going poorly" has polarity of -0.40 and subjectivity of 0.60.
The phrase "things are going real bad" has polarity of -1.00 and subjectivity of 1.00.
The phrase "i have plenty of cash" has polarity of 0.00 and subjectivity of 0.00.
The phrase "i am out of cash" has polarity of 0.00 and subjectivity of 0.00.
The phrase "i have no cash" has polarity of 0.00 and subjectivity of 0.00.
The phrase "i am happy" has polarity of 0.80 and subjectivity of 1.00.
The phrase "i am not happy" has polarity of -0.40 and subjectivity of 1.00.

What about spelling?

TextBlob can correct spelling but it would appear that in the process of performing sentiment analysis it is either correcting spelling or it is stemming or both, because the sentiment scores below are the same.

In [55]:
incorrect = "I havv good speling"

print "BEFORE CORRECTING SPELLING"
testTextBlob(incorrect)

print
print "AFTER CORRECTING SPELLING"
corrected = str(TextBlob(incorrect).correct())
print corrected
testTextBlob(corrected)

print
print "SHORTCUT NOTATION"
print TextBlob(incorrect).correct().sentiment.polarity
BEFORE CORRECTING SPELLING
The phrase "I havv good speling" has polarity of 0.70 and subjectivity of 0.60.

AFTER CORRECTING SPELLING
I have good spelling
The phrase "I have good spelling" has polarity of 0.70 and subjectivity of 0.60.

SHORTCUT NOTATION
0.7

Find polarity score for each request using the original text.

(How's this for a cool example of list comprehension!)

In [56]:
polarity = [TextBlob(x).sentiment.polarity for x in original]  # creates a list
In [57]:
print "DOCUMENT POLARITY (USING ORIGINAL TEXT):"
print "  N = %d" % len(polarity)
print "  Min = %d" % min(polarity)
print "  Max = %d" % max(polarity)
print "  Std = %.2f" % np.std(polarity)
print "  Mean = %.2f" % np.mean(polarity)
print "  Median = %.2f" % np.median(polarity)
DOCUMENT POLARITY (USING ORIGINAL TEXT):
  N = 4040
  Min = -1
  Max = 1
  Std = 0.20
  Mean = 0.13
  Median = 0.11

So let's examine polarity by outcome.

In [58]:
polarityWin = np.zeros(outcome_true)
w = 0
polarityLose = np.zeros(outcome_false)
l = 0
for d in range(document_count):
    if (outcome_list[d] == True):
        polarityWin[w] = polarity[d]
        w += 1
    else:
        polarityLose[l] = polarity[d]
        l += 1

print "DOCUMENT POLARITY IN LOSING DOCUMENTS (USING ORIGINAL TEXT):"
print "  N = %d" % len(polarityLose)
print "  Min = %d" % polarityLose.min()
print "  Max = %d" % polarityLose.max()
print "  Std = %.2f" % polarityLose.std()
print "  Mean = %.2f" % polarityLose.mean()
print "  Median = %.2f" % np.median(polarityLose)
print 
print "DOCUMENT POLARITY IN WINNING DOCUMENTS (USING ORIGINAL TEXT):"
print "  N = %d" % len(polarityWin)
print "  Min = %d" % polarityWin.min()
print "  Max = %d" % polarityWin.max()
print "  Std = %2f" % polarityWin.std()
print "  Mean = %.2f" % polarityWin.mean()
print "  Median = %.2f" % np.median(polarityWin)
DOCUMENT POLARITY IN LOSING DOCUMENTS (USING ORIGINAL TEXT):
  N = 3046
  Min = -1
  Max = 1
  Std = 0.20
  Mean = 0.13
  Median = 0.11

DOCUMENT POLARITY IN WINNING DOCUMENTS (USING ORIGINAL TEXT):
  N = 994
  Min = 0
  Max = 1
  Std = 0.175234
  Mean = 0.13
  Median = 0.12

In [59]:
# use normed=True for relative histogram (vs. absolute)

plt.hist(polarityLose, normed=True, bins=20, color='r', alpha=0.5, label="Losing")
plt.hist(polarityWin, normed=True, bins=20, color='b', alpha=0.5, label="Winning")
plt.xlabel("polarity")
plt.ylabel("probability")
plt.title("Sentiment Analysis Polarity by Outcome")
plt.legend()
plt.show()

Discussion of Sentiment Analysis

Well, that was discouraging! But this was not a wasted effort on my part. I recently attended a web conference on SAS' sentiment analysis package. It is very expensive. So people naturally look for cheaper or, better yet, free alternatives. Python was mentioned. I think it's clear to me that TextBlob's sentiment analysis is very weak compared to SAS'.

n-grams Plan

The next method I will try is to use n-grams, that is, rather than look at single words, look at pairs or triplets of words. Do requests containing certain pairs or triplets have a higher probablity of winning a pizza?

In [60]:
def nGramAnalysis(alist, N, minWin, minPct):
    winning_ngrams = {}
    losing_ngrams = {}
    for d in range(len(alist)):
        blob = TextBlob(alist[d])
        ngrams = blob.ngrams(n=N)
        for n in range(len(ngrams)):
            ng = ngrams[n]
            s = ""
            for k in range(len(ng)):  # x,y,z becomes *x*y*z
                s = s + "*" + str(ng[k].encode('ascii', 'ignore'))
            s = s.lower()
            if (outcome_list[d] == True):
                if s in winning_ngrams:
                    winning_ngrams[s] += 1
                else:
                    winning_ngrams[s] = 1
            else:
                if s in losing_ngrams:
                    losing_ngrams[s] += 1
                else:
                    losing_ngrams[s] = 1

    keys = winning_ngrams.keys() + losing_ngrams.keys()
    keys_as_set = set(keys)
    sorted_keys = sorted(keys_as_set) 

    print "N = %d, minimum win count = %d, minimum win percent = %.2f" % (N, minWin, minPct)
    findAny = False
    for i in range(len(sorted_keys)):
        key = sorted_keys[i]
        if (key in winning_ngrams):
            win = winning_ngrams[key]
        else:
            win = 0
        if (key in losing_ngrams):
            lose = losing_ngrams[key]
        else:
            lose = 0
        pct = 1.0 * win / (win + lose)
        if (win >= minWin and pct >= minPct):
            print "  %-25s %d \t%d \t%5.2f" % (key, win, lose, 1.0 * win / (win + lose))
            findAny = True
            
    if (findAny == False):
        print "  (none found)"
In [61]:
print "ORIGINAL TEXT"
nGramAnalysis(original, 1, 10, .5)
nGramAnalysis(original, 2, 10, .4)
nGramAnalysis(original, 2, 10, .5)
nGramAnalysis(original, 3, 10, .4)
nGramAnalysis(original, 3, 10, .5)
print
print "STEM/STOP TEXT"
nGramAnalysis(text_list, 1, 10, .5)
nGramAnalysis(text_list, 2, 10, .4)
nGramAnalysis(text_list, 2, 10, .5)
nGramAnalysis(text_list, 3, 10, .4)
nGramAnalysis(text_list, 3, 10, .5)
ORIGINAL TEXT
N = 1, minimum win count = 10, minimum win percent = 0.50
  *basic                    10 	5 	 0.67
  *bucks                    27 	25 	 0.52
  *checks                   14 	12 	 0.54
  *date                     19 	15 	 0.56
  *disability               10 	8 	 0.56
  *exchange                 23 	20 	 0.53
  *gt                       14 	14 	 0.50
  *heat                     11 	11 	 0.50
  *landlord                 13 	12 	 0.52
  *partner                  11 	8 	 0.58
  *program                  10 	8 	 0.56
  *stopped                  13 	10 	 0.57
  *stretch                  13 	8 	 0.62
  *surviving                10 	6 	 0.62
  *total                    12 	11 	 0.52
N = 2, minimum win count = 10, minimum win percent = 0.40
  *'ll*get                  11 	11 	 0.50
  *'m*currently             20 	25 	 0.44
  *'m*doing                 10 	12 	 0.45
  *'s*just                  23 	25 	 0.48
  *'s*no                    16 	11 	 0.59
  *3*days                   12 	18 	 0.40
  *3*weeks                  11 	13 	 0.46
  *a*single                 15 	22 	 0.41
  *account*is               16 	22 	 0.42
  *after*the                11 	13 	 0.46
  *an*empty                 13 	14 	 0.48
  *and*after                11 	16 	 0.41
  *and*food                 11 	16 	 0.41
  *and*got                  13 	13 	 0.50
  *and*only                 12 	14 	 0.46
  *and*some                 21 	29 	 0.42
  *and*they                 28 	36 	 0.44
  *and*was                  24 	32 	 0.43
  *and*when                 11 	14 	 0.44
  *and*will                 37 	51 	 0.42
  *anything*but             14 	15 	 0.48
  *are*n't                  25 	29 	 0.46
  *around*the               11 	14 	 0.44
  *as*we                    15 	20 	 0.43
  *ask*for                  37 	33 	 0.53
  *because*he               11 	12 	 0.48
  *been*on                  11 	16 	 0.41
  *been*pretty              11 	11 	 0.50
  *been*unemployed          13 	14 	 0.48
  *broke*down               16 	19 	 0.46
  *but*at                   10 	6 	 0.62
  *but*this                 12 	17 	 0.41
  *by*my                    11 	11 	 0.50
  *car*broke                13 	15 	 0.46
  *cheesy*goodness          10 	9 	 0.53
  *could*make               13 	10 	 0.57
  *could*send               12 	9 	 0.57
  *couple*months            10 	6 	 0.62
  *do*a                     10 	10 	 0.50
  *down*to                  24 	34 	 0.41
  *else*i                   10 	12 	 0.45
  *even*a                   10 	11 	 0.48
  *even*though              12 	17 	 0.41
  *exchange*for             14 	8 	 0.64
  *fact*that                10 	13 	 0.43
  *feed*me                  16 	15 	 0.52
  *financial*aid            17 	15 	 0.53
  *food*for                 38 	45 	 0.46
  *for*anything             10 	14 	 0.42
  *for*everything           12 	6 	 0.67
  *for*help                 32 	23 	 0.58
  *for*two                  14 	18 	 0.44
  *forward*i                11 	9 	 0.55
  *found*a                  11 	12 	 0.48
  *friday*i                 20 	20 	 0.50
  *get*by                   10 	13 	 0.43
  *get*it                   14 	18 	 0.44
  *going*through            17 	25 	 0.40
  *had*been                 11 	10 	 0.52
  *had*the                  10 	14 	 0.42
  *hate*asking              11 	10 	 0.52
  *have*about               10 	15 	 0.40
  *have*an                  18 	24 	 0.43
  *have*left                12 	15 	 0.44
  *have*never               10 	12 	 0.45
  *having*to                20 	28 	 0.42
  *hello*i                  16 	16 	 0.50
  *help*from                10 	10 	 0.50
  *i*made                   14 	15 	 0.48
  *i*posted                 12 	12 	 0.50
  *i*went                   17 	22 	 0.44
  *im*not                   11 	12 	 0.48
  *in*exchange              20 	12 	 0.62
  *in*town                  11 	10 	 0.52
  *is*currently             11 	10 	 0.52
  *is*i                     12 	17 	 0.41
  *is*willing               10 	15 	 0.40
  *it*in                    24 	31 	 0.44
  *it*so                    16 	18 	 0.47
  *it*up                    20 	30 	 0.40
  *job*that                 12 	16 	 0.43
  *just*recently            10 	15 	 0.40
  *large*pizza              10 	4 	 0.71
  *last*me                  28 	17 	 0.62
  *last*of                  27 	33 	 0.45
  *last*two                 15 	11 	 0.58
  *life*and                 10 	12 	 0.45
  *like*this                20 	30 	 0.40
  *living*on                17 	24 	 0.41
  *made*a                   10 	10 	 0.50
  *me*from                  10 	12 	 0.45
  *months*now               10 	9 	 0.53
  *much*money               10 	12 	 0.45
  *my*checking              10 	9 	 0.53
  *my*daughter              16 	17 	 0.48
  *my*next                  18 	23 	 0.44
  *my*partner               11 	5 	 0.69
  *my*paycheck              26 	28 	 0.48
  *my*rent                  21 	28 	 0.43
  *my*son                   24 	22 	 0.52
  *n't*start                12 	9 	 0.57
  *next*couple              12 	11 	 0.52
  *next*friday              14 	13 	 0.52
  *not*only                 12 	12 	 0.50
  *now*we                   14 	21 	 0.40
  *of*days                  15 	20 	 0.43
  *of*rice                  11 	14 	 0.44
  *of*this                  37 	47 	 0.44
  *of*what                  11 	16 	 0.41
  *off*and                  10 	14 	 0.42
  *off*i                    11 	10 	 0.52
  *on*i                     10 	5 	 0.67
  *on*our                   18 	25 	 0.42
  *on*thursday              13 	11 	 0.54
  *or*a                     14 	12 	 0.54
  *our*bills                10 	6 	 0.62
  *our*food                 12 	12 	 0.50
  *out*in                   11 	12 	 0.48
  *pay*my                   13 	15 	 0.46
  *paycheck*to              13 	15 	 0.46
  *paying*it                10 	14 	 0.42
  *pictures*of              17 	18 	 0.49
  *pizza*that               16 	16 	 0.50
  *plan*on                  12 	16 	 0.43
  *quite*a                  10 	14 	 0.42
  *raop*i                   17 	21 	 0.45
  *read*this                13 	16 	 0.45
  *really*hard              10 	8 	 0.56
  *recently*and             11 	9 	 0.55
  *rice*and                 33 	23 	 0.59
  *run*out                  19 	24 	 0.44
  *she*has                  29 	37 	 0.44
  *since*we                 10 	15 	 0.40
  *someone*who              11 	12 	 0.48
  *something*for            10 	11 	 0.48
  *spent*the                18 	23 	 0.44
  *surprise*my              10 	11 	 0.48
  *tell*you                 12 	11 	 0.52
  *than*a                   11 	16 	 0.41
  *thanks*reddit            13 	15 	 0.46
  *the*kind                 10 	6 	 0.62
  *the*kindness             11 	8 	 0.58
  *the*person               12 	14 	 0.46
  *them*to                  13 	16 	 0.45
  *them*with                10 	11 	 0.48
  *think*it                 10 	15 	 0.40
  *this*has                 10 	15 	 0.40
  *this*time                10 	14 	 0.42
  *through*a                16 	22 	 0.42
  *thursday*and             10 	13 	 0.43
  *time*and                 23 	34 	 0.40
  *tired*of                 12 	18 	 0.40
  *to*ask                   44 	55 	 0.44
  *to*cheer                 13 	13 	 0.50
  *to*cover                 17 	12 	 0.59
  *to*keep                  34 	50 	 0.40
  *to*last                  27 	26 	 0.51
  *to*paycheck              10 	12 	 0.45
  *to*stay                  22 	26 	 0.46
  *to*surprise              17 	23 	 0.42
  *to*wait                  15 	17 	 0.47
  *too*much                 14 	19 	 0.42
  *unemployed*for           10 	10 	 0.50
  *until*my                 14 	19 	 0.42
  *was*my                   11 	12 	 0.48
  *was*supposed             11 	13 	 0.46
  *we*got                   12 	10 	 0.55
  *well*i                   23 	34 	 0.40
  *wife*and                 28 	35 	 0.44
  *will*gladly              20 	25 	 0.44
  *work*i                   26 	37 	 0.41
  *would*anyone             12 	11 	 0.52
N = 2, minimum win count = 10, minimum win percent = 0.50
  *'ll*get                  11 	11 	 0.50
  *'s*no                    16 	11 	 0.59
  *and*got                  13 	13 	 0.50
  *ask*for                  37 	33 	 0.53
  *been*pretty              11 	11 	 0.50
  *but*at                   10 	6 	 0.62
  *by*my                    11 	11 	 0.50
  *cheesy*goodness          10 	9 	 0.53
  *could*make               13 	10 	 0.57
  *could*send               12 	9 	 0.57
  *couple*months            10 	6 	 0.62
  *do*a                     10 	10 	 0.50
  *exchange*for             14 	8 	 0.64
  *feed*me                  16 	15 	 0.52
  *financial*aid            17 	15 	 0.53
  *for*everything           12 	6 	 0.67
  *for*help                 32 	23 	 0.58
  *forward*i                11 	9 	 0.55
  *friday*i                 20 	20 	 0.50
  *had*been                 11 	10 	 0.52
  *hate*asking              11 	10 	 0.52
  *hello*i                  16 	16 	 0.50
  *help*from                10 	10 	 0.50
  *i*posted                 12 	12 	 0.50
  *in*exchange              20 	12 	 0.62
  *in*town                  11 	10 	 0.52
  *is*currently             11 	10 	 0.52
  *large*pizza              10 	4 	 0.71
  *last*me                  28 	17 	 0.62
  *last*two                 15 	11 	 0.58
  *made*a                   10 	10 	 0.50
  *months*now               10 	9 	 0.53
  *my*checking              10 	9 	 0.53
  *my*partner               11 	5 	 0.69
  *my*son                   24 	22 	 0.52
  *n't*start                12 	9 	 0.57
  *next*couple              12 	11 	 0.52
  *next*friday              14 	13 	 0.52
  *not*only                 12 	12 	 0.50
  *off*i                    11 	10 	 0.52
  *on*i                     10 	5 	 0.67
  *on*thursday              13 	11 	 0.54
  *or*a                     14 	12 	 0.54
  *our*bills                10 	6 	 0.62
  *our*food                 12 	12 	 0.50
  *pizza*that               16 	16 	 0.50
  *really*hard              10 	8 	 0.56
  *recently*and             11 	9 	 0.55
  *rice*and                 33 	23 	 0.59
  *tell*you                 12 	11 	 0.52
  *the*kind                 10 	6 	 0.62
  *the*kindness             11 	8 	 0.58
  *to*cheer                 13 	13 	 0.50
  *to*cover                 17 	12 	 0.59
  *to*last                  27 	26 	 0.51
  *unemployed*for           10 	10 	 0.50
  *we*got                   12 	10 	 0.55
  *would*anyone             12 	11 	 0.52
N = 3, minimum win count = 10, minimum win percent = 0.40
  *'d*like*to               28 	29 	 0.49
  *'ve*been*living          14 	21 	 0.40
  *a*few*days               39 	48 	 0.45
  *a*few*weeks              22 	32 	 0.41
  *able*to*help             11 	15 	 0.42
  *and*a*pizza              16 	22 	 0.42
  *and*i*'ve                28 	41 	 0.41
  *and*now*i                17 	24 	 0.41
  *ask*for*help             11 	5 	 0.69
  *bank*account*is          10 	12 	 0.45
  *been*living*on           12 	11 	 0.52
  *but*i*am                 19 	28 	 0.40
  *car*broke*down           13 	14 	 0.48
  *couple*of*days           14 	20 	 0.41
  *day*and*i                10 	11 	 0.48
  *feed*me*for              10 	2 	 0.83
  *feel*like*i              10 	10 	 0.50
  *for*a*couple             14 	19 	 0.42
  *get*back*on              17 	18 	 0.49
  *i*'d*like                25 	25 	 0.50
  *i*'m*currently           20 	25 	 0.44
  *i*'m*doing               10 	12 	 0.45
  *i*can*provide            13 	19 	 0.41
  *i*get*back               10 	12 	 0.45
  *i*had*a                  19 	24 	 0.44
  *i*had*to                 32 	46 	 0.41
  *i*only*have              13 	18 	 0.42
  *if*anyone*could          28 	42 	 0.40
  *if*you*can               23 	29 	 0.44
  *in*a*few                 16 	19 	 0.46
  *in*exchange*for          13 	6 	 0.68
  *in*my*account            14 	13 	 0.52
  *is*in*the                12 	14 	 0.46
  *is*willing*to            10 	14 	 0.42
  *it*'s*just               16 	19 	 0.46
  *it*forward*i             11 	5 	 0.69
  *it*would*really          10 	13 	 0.43
  *last*of*my               17 	19 	 0.47
  *my*wife*and              21 	26 	 0.45
  *of*rice*and              11 	7 	 0.61
  *paycheck*to*paycheck     10 	12 	 0.45
  *says*it*all              10 	15 	 0.40
  *so*i*'ve                 14 	18 	 0.44
  *so*i*am                  11 	15 	 0.42
  *so*i*do                  15 	21 	 0.42
  *so*much*for              10 	13 	 0.43
  *spent*the*last           11 	13 	 0.46
  *thank*you*so             17 	24 	 0.41
  *that*i*'ve               11 	14 	 0.44
  *the*last*of              23 	32 	 0.42
  *the*last*two             10 	7 	 0.59
  *the*money*to             10 	14 	 0.42
  *there*'s*no              12 	11 	 0.52
  *time*to*read             13 	15 	 0.46
  *to*ask*for               30 	17 	 0.64
  *to*be*able               26 	38 	 0.41
  *to*get*it                11 	7 	 0.61
  *to*last*me               14 	11 	 0.56
  *up*with*a                13 	16 	 0.45
  *was*supposed*to          11 	13 	 0.46
  *we*'re*in                12 	18 	 0.40
  *wife*and*i               21 	19 	 0.53
  *will*be*able             10 	12 	 0.45
  *willing*to*help          31 	40 	 0.44
  *you*can*help             16 	13 	 0.55
  *you*for*your             11 	10 	 0.52
  *you*so*much              16 	24 	 0.40
N = 3, minimum win count = 10, minimum win percent = 0.50
  *ask*for*help             11 	5 	 0.69
  *been*living*on           12 	11 	 0.52
  *feed*me*for              10 	2 	 0.83
  *feel*like*i              10 	10 	 0.50
  *i*'d*like                25 	25 	 0.50
  *in*exchange*for          13 	6 	 0.68
  *in*my*account            14 	13 	 0.52
  *it*forward*i             11 	5 	 0.69
  *of*rice*and              11 	7 	 0.61
  *the*last*two             10 	7 	 0.59
  *there*'s*no              12 	11 	 0.52
  *to*ask*for               30 	17 	 0.64
  *to*get*it                11 	7 	 0.61
  *to*last*me               14 	11 	 0.56
  *wife*and*i               21 	19 	 0.53
  *you*can*help             16 	13 	 0.55
  *you*for*your             11 	10 	 0.52

STEM/STOP TEXT
N = 1, minimum win count = 10, minimum win percent = 0.50
  *7                        40 	37 	 0.52
  *bonus                    11 	2 	 0.85
  *buck                     27 	25 	 0.52
  *constant                 12 	9 	 0.57
  *cup                      12 	11 	 0.52
  *exchang                  23 	20 	 0.53
  *gt                       14 	14 	 0.50
  *heat                     14 	14 	 0.50
  *kidney                   10 	4 	 0.71
  *landlord                 15 	12 	 0.56
  *marri                    10 	9 	 0.53
  *overdraft                11 	8 	 0.58
  *partner                  11 	9 	 0.55
  *stretch                  16 	16 	 0.50
  *updat                    12 	10 	 0.55
N = 2, minimum win count = 10, minimum win percent = 0.40
  *3*week                   11 	14 	 0.44
  *abl*help                 11 	16 	 0.41
  *ask*help                 23 	25 	 0.48
  *back*school              10 	10 	 0.50
  *car*broke                13 	16 	 0.45
  *cheesi*good              10 	10 	 0.50
  *coupl*month              13 	14 	 0.48
  *day*will                 11 	16 	 0.41
  *day*work                 12 	14 	 0.46
  *eat*ramen                12 	18 	 0.40
  *els*need                 11 	13 	 0.46
  *even*though              12 	17 	 0.41
  *financi*aid              17 	16 	 0.52
  *food*fridg               10 	13 	 0.43
  *get*home                 10 	14 	 0.42
  *hate*ask                 14 	13 	 0.52
  *help*someon              14 	10 	 0.58
  *home*work                10 	11 	 0.48
  *http*imgur               85 	118 	 0.42
  *imgur*com                87 	121 	 0.42
  *just*get                 10 	5 	 0.67
  *just*recent              11 	15 	 0.42
  *larg*pizza               11 	6 	 0.65
  *last*two                 18 	13 	 0.58
  *month*now                11 	11 	 0.50
  *much*money               12 	13 	 0.48
  *need*thank               12 	5 	 0.71
  *next*coupl               12 	11 	 0.52
  *next*friday              14 	13 	 0.52
  *paycheck*paycheck        11 	13 	 0.46
  *pizza*last               12 	9 	 0.57
  *realli*hard              10 	8 	 0.56
  *sound*like               12 	11 	 0.52
  *spent*last               17 	20 	 0.46
  *thank*reddit             17 	24 	 0.41
  *thank*time               10 	15 	 0.40
  *time*read                14 	18 	 0.44
  *tonight*will             13 	16 	 0.45
  *us*will                  10 	8 	 0.56
  *week*get                 14 	18 	 0.44
  *week*now                 10 	15 	 0.40
  *will*glad                25 	36 	 0.41
  *will*post                10 	14 	 0.42
N = 2, minimum win count = 10, minimum win percent = 0.50
  *back*school              10 	10 	 0.50
  *cheesi*good              10 	10 	 0.50
  *financi*aid              17 	16 	 0.52
  *hate*ask                 14 	13 	 0.52
  *help*someon              14 	10 	 0.58
  *just*get                 10 	5 	 0.67
  *larg*pizza               11 	6 	 0.65
  *last*two                 18 	13 	 0.58
  *month*now                11 	11 	 0.50
  *need*thank               12 	5 	 0.71
  *next*coupl               12 	11 	 0.52
  *next*friday              14 	13 	 0.52
  *pizza*last               12 	9 	 0.57
  *realli*hard              10 	8 	 0.56
  *sound*like               12 	11 	 0.52
  *us*will                  10 	8 	 0.56
N = 3, minimum win count = 10, minimum win percent = 0.40
  *anyon*will*help          17 	20 	 0.46
  *get*back*feet            15 	20 	 0.43
  *http*imgur*com           85 	118 	 0.42
  *take*time*read           11 	14 	 0.44
  *thank*take*time          11 	16 	 0.41
N = 3, minimum win count = 10, minimum win percent = 0.50
  (none found)

Discussion of n-grams

The probablity of success can be improved by (examples):

  1. including an image ("http imgur com")
  2. discussing hardship ("disability", "landlord", "heat", "been unemployed")
  3. discussing money, specifically, your lack thereof ("financial aid", "my paycheck", "next friday", "rent", "eat ramen", "paycheck to paycheck")
  4. being polite ("hate asking", "thank take time", "take time read")
  5. and offering to pay it forward ("will gladly", "it forward")

From the above analysis we can draw the following conclusions:

  • Successful requests tend to have more words than do unsuccessful requests. (here)
  • Successful requests tend to have more distinct words than do unsuccessful requests. (here)
  • There are some single words which might help or hurt your chances. (here)
  • KNN clustering with tfidf and cosine similarity provided no lift in classification accuracy. (here)
  • Sentiment analysis was inconclusive. This is because TextBlob proved to be very inadequate at identifying sentiment, as I demonstrated with some very simple examples. (here)
  • The use of 3-grams was perhaps most revealing. The probablity of success can be improved by including an image, discussing your hardship, discussing money (specifically, your lack thereof), being polite, and offering to "pay it forward". (here)

This dataset has been the subject of study by many students. In one such study, Althoff, Salehi and Nguyen of Stanford University came to similar conclusions as I have shown here.

"...our findings suggests that there are several factors that the user can control that are significantly correlated with success. First, the request should be fairly long allowing the user to introduce herself and her situation. It also helps to put in additional effort to upload a photo. This is often used to increase the level of trust, e.g. by attempting to verify certain claims through the photo such as identity, location, financial situation, or simply an empty fridge. Our findings also suggest that pizza givers value the requesters ambition to give back to the community by forwarding a pizza later (even though some never do). With respect to the request content we found that talking about your friends and partners as well as your leisure activities can have a negative impact on your success rate. Instead it seems advisable to talk more about money, most likely a bad financial situation, and work. It also seems to help to express gratitude and appreciation in your request." (http://web.stanford.edu/class/cs224w/projects2013/cs224w-025-final.pdf, retrieved November 20, 2014)

While I was disappointed that I was unable to find some magic predictive model, I can take solace in the fact that the findings from my solo effort were consistent with those of three guys from Stanford!

Bill Qualls, November 21, 2014