performance - Efficient file buffering & scanning methods for large files in python -

- July 15, 2011

the description of problem having bit complicated, , err on side of providing more complete information. impatient, here briefest way can summarize it:

what fastest (least execution time) way split text file in (overlapping) substrings of size n (bound n, eg 36) while throwing out newline characters.

i writing module parses files in fasta ascii-based genome format. these files comprise known 'hg18' human reference genome, can download ucsc genome browser (go slugs!) if like.

as notice, genome files composed of chr[1..22].fa , chr[xy].fa, set of other small files not used in module.

several modules exist parsing fasta files, such biopython's seqio. (sorry, i'd post link, don't have points yet.) unfortunately, every module i've been able find doesn't specific operation trying do.

my module needs split genome data ('cagtacgtcagactatacggagcta' line, instance) in every single overlapping n-length substring. let me give example using small file (the actual chromosome files between 355 , 20 million characters long) , n=8

 >>>import cstringio >>>example_file = cstringio.stringio("""\ >header cagtcag tfgcacf """) >>>for read in parse(example_file): ...    print read ... cagtcagtf agtcagtfg gtcagtfgc tcagtfgca cagtfgcac agtfgcacf

the function found had absolute best performance methods think of this:

 def parse(file):   size = 8 # of course in code function argument   file.readline() # skip past header   buffer = ''   line in file:     buffer += line.rstrip().upper()     while len(buffer) >= size:       yield buffer[:size]       buffer = buffer[1:]

this works, unfortunately still takes 1.5 hours (see note below) parse human genome way. perhaps best going see method (a complete code refactor might in order, i'd avoid approach has specific advantages in other areas of code), thought turn on community.

thanks!

note, time includes lot of calculation, such computing opposing strand read , doing hashtable lookups on hash of approximately 5g in size.

post-answer conclusion: turns out using fileobj.read() , manipulating resulting string (string.replace(), etc.) took relatively little time , memory compared remainder of program, , used approach. everyone!

some classic io bound changes.

use lower level read operation os.read , read in large fixed buffer.
use threading/multiprocessing 1 reads , buffers , other processes.
if have multiple processors/machines use multiprocessing/mq divy processing across cpus ala map-reduce.

using lower level read operation wouldn't of rewrite. others pretty large rewrites.

Search This Blog

Sohocode

performance - Efficient file buffering & scanning methods for large files in python -

Comments

Post a Comment

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

windows - Python Service Installation - "Could not find PythonClass entry" -

Determine if a XmlNode is empty or null in C#? -