Python- need fast algorithm that removes all words in a file that are derivatives in other words -

- May 15, 2015

we have file named wordlist, contains 1,876 kb worth of alphabetized words, of longer 4 letters , contain 1 carriage return between each new two-letter construction (ab, ac, ad, etc., words contain returns between them):

 wfile = open("wordlist.txt", "r+")

i want create new file contains words not derivatives of other, smaller words. example, wordlist contains following words ["abuser, abused, abusers, abuse, abuses, etc.] new file created should retain word "abuse" because "lowest common denominator" (if will) between words. similarly, word "rodeo" removed because contains word rode.

i tried implementation:

def root_words(wordlist):     result = []     base = wordlist[1]     word in wordlist:         if not word.startswith(base):             result.append(base)             print base             base=word     result.append(base)     return result;   def main():     wordlist = []     wfile = open("wordlist.txt", "r+")      line in wfile:         wordlist.append(line[:-1])      wordlist = root_words(wordlist)     newfile = open("newwordlist.txt", "r+")         newfile.write(wordlist)

but froze computer. solutions?

i this:

def bases(words):     base = next(words)     yield base     word in words:         if word , not word.startswith(base):             yield word             base = word   def get_bases(infile, outfile):     open(infile) f_in:         words = (line.strip() line in f_in)         open(outfile, 'w') f_out:             f_out.writelines(word + '\n' word in bases(words))

this goes through corncob list of 58,000 words in fifth of second on old laptop. it's old enough have 1 gig of memory.

$ time python words.py  real        0m0.233s user        0m0.180s sys         0m0.012s

it uses iterators everywhere can go easy on memory. increase performance slicing off end of lines instead of using strip rid of newlines.

also note relies on input being sorted , non-empty. part of stated preconditions though don't feel too bad ;)

Search This Blog

Sohocode

Python- need fast algorithm that removes all words in a file that are derivatives in other words -

Comments

Post a Comment

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

windows - Python Service Installation - "Could not find PythonClass entry" -

Determine if a XmlNode is empty or null in C#? -