Click here for the previous installment in this seriesSo I’m totally new to building Markov bots. This was a fun little experiment in pushing my bounds a little further than they had been when I started, python-wise. If you want the same experience, I recommend not reading further, and instead do what I did: read this amazing stackoverflow explanation of how they work, and then fail at making a Markov Bot for a whole frustrating week. It’s gonna suck but I’m happy I did it.
So first up, the top of the code. We import Set from sets, and we begin to make our fuction called make_dictionary(). I saved all of the kimono files for each author as harryRaw.txt or walterRaw.txt
from sets import Set
def make_dictionary(writer_name):
file_in = writer_name + "Raw.txt"
z = open(file_in,"r")
out = ""
This batch of code cycles through every single line in the file. Because of the format of each line — which is “link”,“paragraph” — we need to remove everything that’s not the paragraph.
counter = 0
numbers = Set(['1','2','3','4','5','6','7','8','9','0'])
while(counter < 10000):
line = z.readline()
line = line.replace('"","","','')
if( '","' in line):
line = line[line.find('","')+1:-1]
if (len(line)>0):
if(line[-1] in numbers):
line = line[:-1]
if (len(line)>0):
if(line[-1] in numbers):
line = line[:-1]
out = out + line + "\n"
counter = counter + 1
else:
counter = counter + 1
We’re out of the while loop, still in the def. Alright, now we’re going to clean the file up and strip out characters that might screw us up. We’re going to hunt out phrase breaks — commas, periods, exclamation points, etc — to break up these paragraphs into phrases. To do that, we need to make sure that things like “No. 1” and “Mr. Silver” don’t mess us up. Is there an easier way to do this? Probably! I was just looking to get something that worked. But you can totally do this with, say, regular expressions I think.
# makes sure that periods don't fuck you up
out = out.replace("No. ","No.")
out = out.replace("Nos. ","Nos.")
out = out.replace("(","")
out = out.replace("U.S.","U.S")
out = out.replace("U.K.","U.K")
out = out.replace("W. Bush","W Bush")
out = out.replace("Gov.","Gov")
out = out.replace("Mrs.","Mrs")
out = out.replace("Mr.","Mr")
out = out.replace("Ms.","Ms")
out = out.replace("Dr.","Dr")
out = out.replace("Sen.","Sen")
out = out.replace("Rep.","Rep")
out = out.replace("Jan.","Jan")
out = out.replace("Feb.","Feb")
out = out.replace("Mar.","Mar")
out = out.replace("Apr.","Apr")
out = out.replace("Jun.","Jun")
out = out.replace("Jul.","Jul")
out = out.replace("Aug.","Aug")
out = out.replace("Sept.","Sep")
out = out.replace("Oct.","Oct")
out = out.replace("Nov.","Nov")
out = out.replace("Dec.","Dec")
out = out.replace(")","")
out = out.replace("vs.","vs")
Now this replaces every phrase break with the tilde character and nukes extraneous punctuation
out = out.replace(". ","~\n")
out = out.replace(".\n","~\n")
out = out.replace("? ","~\n")
out = out.replace("?","~")
out = out.replace("! ","~\n")
out = out.replace("!","~")
out = out.replace("; ","~\n")
out = out.replace(": "," ")
out = out.replace("- "," ")
out = out.replace(",","")
out = out.replace('"',"")
This cleaned output — which is one phrase per line — is eported to a temporary file called phrase_lined_temp.txt
s = open("phrase_lined_temp.txt","w")
s.write(out)
s.close()
z.close()
#------- Turn lines into threepieces
phrase = open("phrase_lined_temp.txt","r")
count_phrases = str(phrase.read()).count("~")
print count_phrases
phrase.close()
So we’ve got a file where all of the source material is broken down so that each line is a distinct phrase. Now we’re going to turn that into a dictionary.
To do this, we need each three-word phrase in the set on its own separate line. We’re also going to need a database of starter words, so we’re going to grab the first word from each line and save that in a dictionary.
counter = 0
out2 = ""
first_words = "{' ': ["
phrases = open("phrase_lined_temp.txt","r")
while(count_phrases > counter):
line = phrases.readline()
n = True
if(2 > line.count(" ") or line.count("~") == 0):
n = False
counter = counter + 1
else:
try:
word1 = line[:line.find(" ")]
line = line[line.find(" ")+1:]
word2 = line[:line.find(" ")]
line = line[line.find(" ")+1:]
word3 = line[:line.find(" ")]
line = line[line.find(" ")+1:]
first_words = first_words + "'"+ word1.replace("'","").replace('{',"").replace('}',"") + " " + word2.replace("'","").replace('{',"").replace('}',"") + "', "
except:
n = False
while(n):
out2 = out2 + word1 + "," + word2 + "~" + word3 + "\n"
try:
word1 = word2
word2 = word3
if(" " in line):
word3 = line[:line.find(" ")]
line = line[line.find(" ")+1:]
else:
word3 = line[:line.find("~")]
out2 = out2 + word1 + "," + word2 + "~" + word3 + "\n"
out2 = out2 + word2 + "," + word3 + "~" + "...END..." + "\n"
n = False
except:
n = False
counter = counter + 1
if(counter%int(count_phrases/100) == 0):
print counter/int(count_phrases/100)
So we’re going to write all this to two files. temp_3gram.txt is going to have every three gram broken down like this: firstword,secondword~thirdword . When the line is over, we also add in a final threegram with the second to last word, the last word, and “…END…” as the last one. Everyone gets a Firstwords.txt file, too.
out2file = open("temp_3gram.txt","w")
out2file.write(out2)
out2file.close
firstfile_name = writer_name + "Firstwords.txt"
firstfile = open(firstfile_name,"w")
firstfile.write(first_words[:-2]+"]}")
firstfile.close()
phrases.close()
Now we’re going to make that threegram file into a python dictionary
The first two words are made into a key — “FIRSTWORD SECONDWORD” — and the following word is made into the value. So let’s take an example of this: every time Neil wrote something like “I am X.” Let’s say he’s said that construction 4 times: “I am Neil”, “I am American”,“I am Neil”, “I am Groot”. The dictionary would have “I am” as the key, and “Neil”, “Neil”, “American”, “Groot” as the values.
d = dict()
countfile = open("temp_3gram.txt","r")
threegramcount = str(countfile.read()).count("~")
print threegramcount
countfile.close
source_file = open("temp_3gram.txt","r")
counter = 0
while(threegramcount > counter):
line1 = source_file.readline()
key = line1[:line1.find(",")].upper() +" " + line1[line1.find(",")+1:line1.find("~")].upper()
value = line1[line1.find("~")+1:].replace("\n","")
if key in d:
# append the new number to the existing array at this slot
d[key].append(value)
else:
# create a new array in this slot
d[key] = [value]
counter = counter + 1
if(counter%int(threegramcount/100) == 0):
print counter/int(threegramcount/100)
source_file.close()
filename_out = writer_name + "Dict.txt"
lastout = open(filename_out,"w")
lastout.write(str(d).replace("\n",""))
lastout.close
And that’s the end of the big def! All you need to do then, for instance, is run this:
make_dictionary("neil")
For whoever you please, to generate
neilDict.txt
And that’ll be enough for the next step. Next up: How to generate the markov bot output
Click here for the next installment in this series