This paper represents an analysis of Kaggle's “Random Acts of Pizza” dataset:
This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness. (https://www.kaggle.com/c/random-acts-of-pizza)
While the dataset contains many attributes (for example, time of day when the request was made), this paper will focus on the text of the request and its outcome (did the requester receive a pizza). The paper will proceed as follows: import the data, prepare the data, perform some exploratory data analysis, propose and conduct various data mining methods, and present conclusions.
Ultimately an accurate prediciton model cannot be determined; that is, no attempt will be made to classify a single request as successful or unsuccessful. However, some heuristics will be given which should improve the requester's chance of receiving a random act of pizza.
The data provided by Kaggle is in json format. Here is a snapshot of the input data -- I will be using field request_text_edit_aware, request_title, and requester_received_pizza:
cd "D:\BQ\School\DePaul\CSC478 Programming Data Mining Applications\csc478_qualls_project"
# got this trick from http://stackoverflow.com/questions/11854847/display-an-image-from-a-file-in-an-ipython-notebook
from IPython.core.display import Image
Image(filename='json.jpg')
import json
import numpy as np
json_data = json.loads(open("train.json").read())
print type(json_data) # what did we get back? (answer: list)
print len(json_data) # how many elements in the list? (answer: 4040)
text_list = []
outcome_list = []
for i in range(len(json_data)):
text_list.append(json_data[i][u'request_text_edit_aware'])
outcome_list.append(json_data[i][u'requester_received_pizza'])
# save a copy for use later in this document
original = list(text_list) # list function makes a copy of a list
for i in range(5):
print i
print "TEXT: ", text_list[i]
print "OUTCOME: ", outcome_list[i]
print
document_count = len(text_list)
print
print "There are %d documents." % document_count
contractions_dict = {"ain't" : "am not" \
, "aren't" : "are not" \
, "can't" : "cannot" \
, "couldn't" : "could not" \
, "didn't" : "did not" \
, "doesn't" : "does not" \
, "don't" : "do not" \
, "hadn't" : "had not" \
, "hasn't" : "has not" \
, "haven't" : "have not" \
, "he'd" : "he would" \
, "he'll" : "he will" \
, "he's" : "he is" \
, "how'd" : "how did" \
, "how's" : "how is" \
, "i'd" : "i would" \
, "i'll" : "i will" \
, "i'm" : "i am" \
, "i've" : "i have" \
, "isn't" : "is not" \
, "it'd" : "it would" \
, "it's" : "it is" \
, "let's" : "let us" \
, "o'clock" : "of the clock" \
, "she'd" : "she would" \
, "she'll" : "she will" \
, "she's" : "she is" \
, "shouldn't" : "should not" \
, "that'd" : "that would" \
, "that's" : "that is" \
, "they'd" : "they would" \
, "they'll" : "they will" \
, "they're" : "they are" \
, "wasn't" : "was not" \
, "we'd" : "we would" \
, "we'll" : "we will" \
, "we're" : "we are" \
, "weren't" : "were not" \
, "we've" : "we have" \
, "what'll" : "what will" \
, "what's" : "what is" \
, "what've" : "what have" \
, "who'll" : "who will" \
, "who's" : "who is" \
, "why's" : "why is" \
, "won't" : "will not" \
, "wouldn't" : "would not" \
, "would've" : "would have" \
, "y'all" : "you all" \
, "you'd" : "you had" \
, "you'll" : "you will" \
, "you're" : "you are" \
, "you've" : "you have" \
}
# see http://stackoverflow.com/questions/12437667/how-to-replace-punctuation-in-a-string-python
# I want to replace punctuation with a blank
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
for d in range(document_count):
row = text_list[d] # for convenience only
# for encode, see http://stackoverflow.com/questions/3224268/python-unicode-encode-error
row = row.encode('ascii', 'ignore')
row = row.lower() # convert to lower case
for k, v in contractions_dict.iteritems():
row = row.replace(k, v)
row = row.translate(replace_punctuation)
text_list[d] = row
for i in range(5):
print i
print "TEXT: ", text_list[i]
print
# stemming package obtained from https://pypi.python.org/pypi/stemming/1.0
from stemming.porter2 import stem
# test
print [stem(x) for x in ["compute", "computes", "computer", "computers", "computation", "computational"]]
print [stem(x) for x in ["dive", "divers", "diversity", "diversify", "diversified", "diversification"]]
# Stopwords can be found at https://pypi.python.org/pypi/stop-words/2014.5.26
# But it is just a collection of stopwords as .txt files.
# There is no python code included.
stopwords = np.genfromtxt("stop-words-2014.5.26/stop_words/stop-words/english.txt", dtype="|S10")
print stopwords
for d in range(document_count):
row = text_list[d] # for convenience only
words = row.split() # must split because stem function works on one word only
row = [stem(x) for x in words if x not in stopwords]
row = ' '.join(row) # join will undo the split
text_list[d] = row
for i in range(5):
print i
print "TEXT: ", text_list[i]
print
# sets do not have duplicates
term_set = set()
for d in range(document_count):
words = text_list[d].split()
deduped = set(words) # removes duplicates
term_set = term_set.union(deduped) # union() joins sets
sorted_term_list = sorted(term_set) # sorted() always returns a list
print sorted_term_list[:10] # sample from the top
print sorted_term_list[-10:] # sample from the bottom
term_count = len(sorted_term_list)
print "There are %d documents and %d distinct terms." % (document_count, term_count)
term_dict = {}
for i in range(term_count):
term_dict[sorted_term_list[i]] = i
# test dictionary
print term_dict["pizza"]
print sorted_term_list[term_dict["pizza"]] # inverse -- should show pizza
# build the term document
# each row is a (stemmed) term
# each column is a document (RAOP request)
# each cell is the number of times that (stemmed) term appears in that document
TD = np.zeros((term_count, document_count))
for d in range(document_count):
row = text_list[d] # convenience
terms = row.split()
for i in range(len(terms)):
t = term_dict[terms[i]]
TD[t][d] += 1
# checking...
def inspect(d): # d = document number
print "Document %d: " % d
print text_list[d]
sum = 0
count = 0
for t in range(term_count):
sum += TD[t][d]
if (TD[t][d] > 0):
count += 1
print "Document %d had %d distinct terms and %d total terms." % (d, sum, count)
print
inspect(3)
inspect(4)
# In how many different documents does each term occur?
# (This actually called the df in Mobasher's slide #10)
DF = np.zeros((term_count))
for t in range(term_count):
DF[t] = np.sum(1 for x in TD[t] if x > 0)
# function returns the number of documents containing term
def docs_containing(word):
return DF[term_dict[word]]
for word in ("pizza", "mom", "haiku"):
print "%-s is found in %d documents." % (word, docs_containing(word))
np.set_printoptions(precision=3)
outcome_true = np.sum(1 for x in outcome_list if x == True)
outcome_false = np.sum(1 for x in outcome_list if x == False)
print "There were %d successful requests and %d unsuccessful requests." % \
(outcome_true, outcome_false)
# This line configures matplotlib to show figures embedded in the notebook,
# instead of opening a new window for each figure. More about that later.
# If you are using an old version of IPython, try using '%pylab inline' instead.
%matplotlib inline
import matplotlib.pyplot as plt
# make a square figure and axes
plt.figure(1, figsize=(6,6))
ax = plt.axes([0.1, 0.1, 0.8, 0.8])
# The slices will be ordered and plotted counter-clockwise.
labels = 'Successful', 'Unsuccessful'
fracs = [outcome_true, outcome_false]
plt.pie(fracs, labels=labels, autopct='%1.1f%%', shadow=False, startangle=0, colors=('y','c'))
# The default startangle is 0, which would start
# the True slice on the x-axis. With startangle=90,
# everything is rotated counter-clockwise by 90 degrees,
# so the plotting starts on the positive y-axis.
plt.title('RAOP Outcome', bbox={'facecolor':'0.8', 'pad':5})
plt.show()
# function to produce crosstab
import collections
def crosstab(vecA, vecB):
keysA = collections.Counter(vecA).keys()
keysB = collections.Counter(vecB).keys()
keysA.sort()
keysB.sort()
# use a dictionary to convert values to indices
dictA = {}
for i in range(len(keysA)):
dictA[keysA[i]] = i
dictB = {}
for i in range(len(keysB)):
dictB[keysB[i]] = i
# array which will hold crosstab frequencies
freqs = np.zeros((len(keysA), len(keysB)))
# count 'em
pairs = len(vecA)
for i in range(pairs):
freqs[dictA[vecA[i]]][dictB[vecB[i]]] += 1
# reminder: print statement ending in comma suppresses newline
# print column headings
for clm in range(len(keysB)):
print "\t%s" % keysB[clm],
print
for row in range(len(keysA)):
print keysA[row],
for clm in range(len(keysB)):
print "\t%d" % freqs[row][clm],
print
# document length by outcome?
termsInWinningDocuments = np.zeros(outcome_true)
w = 0
termsInLosingDocuments = np.zeros(outcome_false)
l = 0
for d in range(document_count):
count = 0
for t in range(term_count):
if (TD[t][d] > 0):
# count += 1
count += TD[t][d]
if (outcome_list[d] == True):
termsInWinningDocuments[w] = count
w += 1
else:
termsInLosingDocuments[l] = count
l += 1
print "DOCUMENT LENGTH IN LOSING DOCUMENTS:"
print " N = %d" % len(termsInLosingDocuments)
print " Min = %d" % termsInLosingDocuments.min()
print " Max = %d" % termsInLosingDocuments.max()
print " Mean = %.1f" % termsInLosingDocuments.mean()
print " Median = %.1f" % np.median(termsInLosingDocuments)
print
print "DOCUMENT LENGTH IN WINNING DOCUMENTS:"
print " N = %d" % len(termsInWinningDocuments)
print " Min = %d" % termsInWinningDocuments.min()
print " Max = %d" % termsInWinningDocuments.max()
print " Mean = %.1f" % termsInWinningDocuments.mean()
print " Median = %.1f" % np.median(termsInWinningDocuments)
# use normed=True for relative histogram (vs. absolute)
plt.hist(termsInLosingDocuments, normed=True, bins=25, color='r', alpha=0.5, label="Losing")
plt.hist(termsInWinningDocuments, normed=True, bins=25, color='b', alpha=0.5, label="Winning")
plt.xlabel("terms")
plt.ylabel("probability")
plt.title("Document Length by Outcome")
plt.legend()
plt.show()
# distinct terms per document by outcome?
distinctTermsInWinningDocuments = np.zeros(outcome_true)
w = 0
distinctTermsInLosingDocuments = np.zeros(outcome_false)
l = 0
for d in range(document_count):
count = 0
for t in range(term_count):
if (TD[t][d] > 0):
count += 1
if (outcome_list[d] == True):
distinctTermsInWinningDocuments[w] = count
w += 1
else:
distinctTermsInLosingDocuments[l] = count
l += 1
print "DISTINCT TERMS PER DOCUMENT IN LOSING DOCUMENTS:"
print " N = %d" % len(distinctTermsInLosingDocuments)
print " Min = %d" % distinctTermsInLosingDocuments.min()
print " Max = %d" % distinctTermsInLosingDocuments.max()
print " Mean = %.1f" % distinctTermsInLosingDocuments.mean()
print " Median = %.1f" % np.median(distinctTermsInLosingDocuments)
print
print "DISTINCT TERMS PER DOCUMENT IN WINNING DOCUMENTS:"
print " N = %d" % len(distinctTermsInWinningDocuments)
print " Min = %d" % distinctTermsInWinningDocuments.min()
print " Max = %d" % distinctTermsInWinningDocuments.max()
print " Mean = %.1f" % distinctTermsInWinningDocuments.mean()
print " Median = %.1f" % np.median(distinctTermsInWinningDocuments)
# use normed=True for relative histogram (vs. absolute)
plt.hist(distinctTermsInLosingDocuments, normed=True, bins=25, color='r', alpha=0.5, label="Losing")
plt.hist(distinctTermsInWinningDocuments, normed=True, bins=25, color='b', alpha=0.5, label="Winning")
plt.xlabel("distinct terms")
plt.ylabel("probability")
plt.title("Distinct Terms in Document by Outcome")
plt.legend()
plt.show()
winning = np.zeros(term_count)
losing = np.zeros(term_count)
for t in range(term_count):
for d in range(document_count):
if (TD[t][d] > 0):
if (outcome_list[d] == True):
winning[t] += 1
else:
losing[t] += 1
print "Helpful terms (at least 10 wins, and percent of wins >= 40%"
print "-----------------------------------------------------------"
print "term win lose pct"
for t in range(term_count):
term = sorted_term_list[t]
win = winning[t]
lose = losing[t]
pct = 1.0 * win / (win + lose)
if (win >=10 and pct >= .4):
print "%-15s\t%5d\t%5d\t%4.2f" % (sorted_term_list[t], win, lose, pct)
print
print "Harmful terms (least 1 win but percent of wins <= 15%"
print "-----------------------------------------------------"
print "term win lose pct"
for t in range(term_count):
term = sorted_term_list[t]
win = winning[t]
lose = losing[t]
pct = 1.0 * win / (win + lose)
if (win >=1 and pct <= .15):
print "%-15s\t%5d\t%5d\t%4.2f" % (sorted_term_list[t], win, lose, pct)
→ The following terms might help your chances: accident, aid, chicago, father, generosity, landlord, leftovers, rice, tough. These terms might hurt your chances: america, angel, atlanta, bail, canada, facebook, hang, homeless, homework, kansas, pawn, street, washington, wisconsin.
K Nearest Neighbors is often used when analyzing text data, in particular, in attempting to classify data into predefined categories, such as successful and unsuccessful in the current case. In this section we will attempt that method.
import math
IDF = [math.log(1.0 * document_count / x, 2) for x in DF]
np.set_printoptions(precision=3)
print IDF[:10]
TFIDF = np.copy(TD)
for t in range(term_count):
for d in range(document_count):
TFIDF[t][d] = TD[t][d] * IDF[t]
# cosine similarity function
# taken from my homework2
def cosineSimilarity(v1, v2):
dotProduct = np.dot(v1, v2)
ss1 = (v1 ** 2).sum()
ss2 = (v2 ** 2).sum()
if (ss1 == 0 or ss2 == 0):
answer = 0 # do not divide by zero!
else:
answer = dotProduct / ((ss1 ** 0.5) * (ss2 ** 0.5))
return answer
a = np.array([2, 11, 5])
b = np.array([7, 3, 15])
# test my cosineSimilarity function
# given (2, 11, 5), (7, 3, 15)
# expect ((2*7)+(11*3)+(5*15)) / (sqrt(2**2 + 11**2 + 5**2) * sqrt(7**2 + 3**2 + 15**2)) = 0.59
cs = cosineSimilarity(a, b)
print cs
def cosineDistance(v1, v2):
cs = cosineSimilarity(v1, v2)
return 1 - cs
# test my cosineDistance function
# given (2, 11, 5), (7, 3, 15)
# expect 1 - 0.59 = .41
cd = cosineDistance(a, b)
print cd
DT = TFIDF.T
print DT.shape
# checking...I know document 4 has term "sob" twice
t = term_dict["sob"]
print t # sob is the 7316th term
print TD[t][4] # sob occurs twice in document 4
print DF[t] # sob occurs in 119 documents
print document_count
print IDF[t] # calc follows
print math.log(4040.0/119, 2) # should match IDF -- it does
print TFIDF[t][4]
print DT[4][t]
print math.log(4040.0/119, 2) * 2 # should match TFIDF and DT -- it does
training_percent = 0.8
training_size = int(training_percent * len(DT))
print "training_size=", training_size
training_matrix = DT[:training_size,:]
testing_matrix = DT[training_size:,:]
print "training_matrix.shape=", training_matrix.shape
print "testing_matrix.shape=", testing_matrix.shape
training_outcome_list = outcome_list[:training_size]
testing_outcome_list = outcome_list[training_size:]
print
print "type(training_outcome_list)=", type(training_outcome_list)
print "len(training_outcome_list)=", len(training_outcome_list)
t = np.sum(1 for x in training_outcome_list if x == True)
f = np.sum(1 for x in training_outcome_list if x == False)
print "%.1f%% of training dataset had successful outcome." % (100.0 * t / (t + f))
print
print "type(testing_outcome_list)=", type(testing_outcome_list)
print "len(testing_outcome_list)=", len(testing_outcome_list)
t = np.sum(1 for x in testing_outcome_list if x == True)
f = np.sum(1 for x in testing_outcome_list if x == False)
print "%.1f%% of testing dataset had successful outcome." % (100.0 * t / (t + f))
import operator # see MLA pages 21-22
def knnClassifier(trainingMatrix, labels, instance, K, distanceFunction):
rows = trainingMatrix.shape[0]
distance = np.empty(rows)
for i in range(rows):
distance[i] = distanceFunction(trainingMatrix[i], instance)
labeledDistances = np.rec.fromarrays((distance, labels))
sortedLabeledDistances = labeledDistances.argsort() # returns array of indexes indicating sort order
classCount = {}
for i in range(K):
label = labeledDistances[sortedLabeledDistances[i]][1]
classCount[label] = classCount.get(label, 0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
predictedClass = sortedClassCount[0][0]
topKNeighbors = sortedLabeledDistances[0:K] # slice to get index of K nearest neighbors
return predictedClass, topKNeighbors
# Test my knnClassifier function
# I will use the data found in MLA, page 21-24
group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
labels = np.array(['A', 'A', 'B', 'B'])
print group
print labels
print
print "test using cosineDistance function:"
label, topKNeighbors = knnClassifier(group, labels, np.array([0.9, 0.1]), 3, cosineDistance)
print label, topKNeighbors
label, topKNeighbors = knnClassifier(group, labels, np.array([0.1, 0.9]), 3, cosineDistance)
print label, topKNeighbors
def classificationAccuracy(trainingMatrix, trainingClasses, testingMatrix, testingClasses, K, distanceFunction):
testCases = testingMatrix.shape[0]
correct = 0
for i in range(testCases):
testInstance = testingMatrix[i]
predictedClass, topKNeighbors = knnClassifier(trainingMatrix, trainingClasses, testInstance, K, distanceFunction)
actualClass = testingClasses[i]
# print "Instance #%d: Predicted=%d, Actual=%d" % (i, predictedClass, actualClass)
if (predictedClass == actualClass):
correct = correct + 1
accuracy = 1.0 * correct / testCases
return accuracy
for K in (1, 2, 3, 10, 15, 20):
accuracy = classificationAccuracy(training_matrix, training_outcome_list, \
testing_matrix, testing_outcome_list, K, cosineDistance)
print "K = %d, accuracy = %5.3f" % (K, accuracy)
Well that was certainly discouraging! At first glance an accuracy rate of 75% sounds pretty good, until one considers that -- as we saw in the Exploratory Data Analysis section, 75% of the ROAP requests are unsuccessful. So if we just guess "unsuccessful" all the time then we will be correct 75% of the time.
Time to try another method.
I decided to try some sentiment analysis next. Are positive (entertaining?) requests or negative (desparate?) requests likely to be more successful. My approach will be to determine the sentiment for each request, and then create summary statistics on sentiment by outcome -- pizza given or no.
For sentiment analysis I will use the TextBlob Python package.
The simple examples I have included here show that the sentiment analysis algorithm in TextBlob must be quite simplistic.
Given these findings, my sentiment analysis will be done on the original text of the request, not the version which has been stemmed and has had stopwords removed ("not" is a stopword, but clearly can be relevant in sentment analysis.)
from textblob import TextBlob
# a simple test function which I wrote...
def testTextBlob(phrase):
blob = TextBlob(phrase)
sentiment = blob.sentiment
polarity = sentiment.polarity # a measure of the mood
subjectivity = sentiment.subjectivity # a measure of confidence in the polarity
print "The phrase \"%s\" has polarity of %.2f and subjectivity of %.2f." % \
(phrase, polarity, subjectivity)
for phrase in ("things are going great!", "things are going great.",
"things are going great?", "things are going ok", "things are not going ok",\
"things are going poorly", "things are going real bad", \
"i have plenty of cash", "i am out of cash", "i have no cash", \
"i am happy", "i am not happy"):
testTextBlob(phrase)
TextBlob can correct spelling but it would appear that in the process of performing sentiment analysis it is either correcting spelling or it is stemming or both, because the sentiment scores below are the same.
incorrect = "I havv good speling"
print "BEFORE CORRECTING SPELLING"
testTextBlob(incorrect)
print
print "AFTER CORRECTING SPELLING"
corrected = str(TextBlob(incorrect).correct())
print corrected
testTextBlob(corrected)
print
print "SHORTCUT NOTATION"
print TextBlob(incorrect).correct().sentiment.polarity
(How's this for a cool example of list comprehension!)
polarity = [TextBlob(x).sentiment.polarity for x in original] # creates a list
print "DOCUMENT POLARITY (USING ORIGINAL TEXT):"
print " N = %d" % len(polarity)
print " Min = %d" % min(polarity)
print " Max = %d" % max(polarity)
print " Std = %.2f" % np.std(polarity)
print " Mean = %.2f" % np.mean(polarity)
print " Median = %.2f" % np.median(polarity)
polarityWin = np.zeros(outcome_true)
w = 0
polarityLose = np.zeros(outcome_false)
l = 0
for d in range(document_count):
if (outcome_list[d] == True):
polarityWin[w] = polarity[d]
w += 1
else:
polarityLose[l] = polarity[d]
l += 1
print "DOCUMENT POLARITY IN LOSING DOCUMENTS (USING ORIGINAL TEXT):"
print " N = %d" % len(polarityLose)
print " Min = %d" % polarityLose.min()
print " Max = %d" % polarityLose.max()
print " Std = %.2f" % polarityLose.std()
print " Mean = %.2f" % polarityLose.mean()
print " Median = %.2f" % np.median(polarityLose)
print
print "DOCUMENT POLARITY IN WINNING DOCUMENTS (USING ORIGINAL TEXT):"
print " N = %d" % len(polarityWin)
print " Min = %d" % polarityWin.min()
print " Max = %d" % polarityWin.max()
print " Std = %2f" % polarityWin.std()
print " Mean = %.2f" % polarityWin.mean()
print " Median = %.2f" % np.median(polarityWin)
# use normed=True for relative histogram (vs. absolute)
plt.hist(polarityLose, normed=True, bins=20, color='r', alpha=0.5, label="Losing")
plt.hist(polarityWin, normed=True, bins=20, color='b', alpha=0.5, label="Winning")
plt.xlabel("polarity")
plt.ylabel("probability")
plt.title("Sentiment Analysis Polarity by Outcome")
plt.legend()
plt.show()
Well, that was discouraging! But this was not a wasted effort on my part. I recently attended a web conference on SAS' sentiment analysis package. It is very expensive. So people naturally look for cheaper or, better yet, free alternatives. Python was mentioned. I think it's clear to me that TextBlob's sentiment analysis is very weak compared to SAS'.
The next method I will try is to use n-grams, that is, rather than look at single words, look at pairs or triplets of words. Do requests containing certain pairs or triplets have a higher probablity of winning a pizza?
def nGramAnalysis(alist, N, minWin, minPct):
winning_ngrams = {}
losing_ngrams = {}
for d in range(len(alist)):
blob = TextBlob(alist[d])
ngrams = blob.ngrams(n=N)
for n in range(len(ngrams)):
ng = ngrams[n]
s = ""
for k in range(len(ng)): # x,y,z becomes *x*y*z
s = s + "*" + str(ng[k].encode('ascii', 'ignore'))
s = s.lower()
if (outcome_list[d] == True):
if s in winning_ngrams:
winning_ngrams[s] += 1
else:
winning_ngrams[s] = 1
else:
if s in losing_ngrams:
losing_ngrams[s] += 1
else:
losing_ngrams[s] = 1
keys = winning_ngrams.keys() + losing_ngrams.keys()
keys_as_set = set(keys)
sorted_keys = sorted(keys_as_set)
print "N = %d, minimum win count = %d, minimum win percent = %.2f" % (N, minWin, minPct)
findAny = False
for i in range(len(sorted_keys)):
key = sorted_keys[i]
if (key in winning_ngrams):
win = winning_ngrams[key]
else:
win = 0
if (key in losing_ngrams):
lose = losing_ngrams[key]
else:
lose = 0
pct = 1.0 * win / (win + lose)
if (win >= minWin and pct >= minPct):
print " %-25s %d \t%d \t%5.2f" % (key, win, lose, 1.0 * win / (win + lose))
findAny = True
if (findAny == False):
print " (none found)"
print "ORIGINAL TEXT"
nGramAnalysis(original, 1, 10, .5)
nGramAnalysis(original, 2, 10, .4)
nGramAnalysis(original, 2, 10, .5)
nGramAnalysis(original, 3, 10, .4)
nGramAnalysis(original, 3, 10, .5)
print
print "STEM/STOP TEXT"
nGramAnalysis(text_list, 1, 10, .5)
nGramAnalysis(text_list, 2, 10, .4)
nGramAnalysis(text_list, 2, 10, .5)
nGramAnalysis(text_list, 3, 10, .4)
nGramAnalysis(text_list, 3, 10, .5)
The probablity of success can be improved by (examples):
From the above analysis we can draw the following conclusions:
This dataset has been the subject of study by many students. In one such study, Althoff, Salehi and Nguyen of Stanford University came to similar conclusions as I have shown here.
"...our findings suggests that there are several factors that the user can control that are significantly correlated with success. First, the request should be fairly long allowing the user to introduce herself and her situation. It also helps to put in additional effort to upload a photo. This is often used to increase the level of trust, e.g. by attempting to verify certain claims through the photo such as identity, location, financial situation, or simply an empty fridge. Our findings also suggest that pizza givers value the requesters ambition to give back to the community by forwarding a pizza later (even though some never do). With respect to the request content we found that talking about your friends and partners as well as your leisure activities can have a negative impact on your success rate. Instead it seems advisable to talk more about money, most likely a bad financial situation, and work. It also seems to help to express gratitude and appreciation in your request." (http://web.stanford.edu/class/cs224w/projects2013/cs224w-025-final.pdf, retrieved November 20, 2014)
While I was disappointed that I was unable to find some magic predictive model, I can take solace in the fact that the findings from my solo effort were consistent with those of three guys from Stanford!
Bill Qualls, November 21, 2014