Python- need fast algorithm that removes all words in a file that are derivatives in other words -
we have file named wordlist, contains 1,876 kb worth of alphabetized words, of longer 4 letters , contain 1 carriage return between each new two-letter construction (ab, ac, ad, etc., words contain returns between them):
wfile = open("wordlist.txt", "r+")
i want create new file contains words not derivatives of other, smaller words. example, wordlist contains following words ["abuser, abused, abusers, abuse, abuses, etc.] new file created should retain word "abuse" because "lowest common denominator" (if will) between words. similarly, word "rodeo" removed because contains word rode.
i tried implementation:
def root_words(wordlist): result = [] base = wordlist[1] word in wordlist: if not word.startswith(base): result.append(base) print base base=word result.append(base) return result; def main(): wordlist = [] wfile = open("wordlist.txt", "r+") line in wfile: wordlist.append(line[:-1]) wordlist = root_words(wordlist) newfile = open("newwordlist.txt", "r+") newfile.write(wordlist)
but froze computer. solutions?
i this:
def bases(words): base = next(words) yield base word in words: if word , not word.startswith(base): yield word base = word def get_bases(infile, outfile): open(infile) f_in: words = (line.strip() line in f_in) open(outfile, 'w') f_out: f_out.writelines(word + '\n' word in bases(words))
this goes through corncob list of 58,000 words in fifth of second on old laptop. it's old enough have 1 gig of memory.
$ time python words.py real 0m0.233s user 0m0.180s sys 0m0.012s
it uses iterators everywhere can go easy on memory. increase performance slicing off end of lines instead of using strip rid of newlines.
also note relies on input being sorted , non-empty. part of stated preconditions though don't feel too bad ;)
Comments
Post a Comment