Splitting up visual blocks of text in java -
i have block of text i'm trying interpret in java (or grep/awk/etc) looking following:
differently, plaques of rn8 , rn9 mutants , human coronavirus oc43 more divergent of wild-type size, indicating suppressor mu- sars-cov, human coronavirus hku1, , bat coronaviruses tations, in isolation, not noticeably deleterious hku4, hku5, , hku9 (fig. 6b). thus, not mem- -- able effect on viral phenotype. potentially related obser- sented existence of interaction between nsp9 vation mutation a2u, neutral itself, nsp8 (56). hexadecameric complex of sars-cov nsp8 , lethal in combination aacaag insertion (data not nsp7 has been found bind double-stranded rna.
and i'd split 2 parts: left , right. i'm having trouble coming regex or other method split block of text visually split, not obvious programming language. lengths of lines variable.
i've considered looking first block , finding second looking multiple spaces, i'm not sure that's robust solution. ideas, snippets, pseudo code, links, etc?
text source
the text has been ran follows through pdftotext pdftotext -layout mypdf.pdf
blur text , come array of character density per column of text. gaps , split there.
string blurredtext = text.replaceall("(?<=\\s) (?=\\s)", "."); string[] blurredlines = text.split("\r\n?|\n"); int maxrowlength = 0; (string blurredline : blurredlines) { maxrowlength = math.max(maxrowlength, blurredline.length()); } int[] columncounts = new int[maxrowlength]; (string blurredline : blurredlines) { (int = 0, n = blurredline.length(); < n; ++i) { if (blurredline.charat(i) != ' ') { ++columncounts[i]; } } } // runs of 0 of @ least length 3. // alternatively, might n longest runs of zeros. // alternatively, might runs of length min(columncounts) ignore // horizontal rules. int minbreaklen = 3; // tuning parameter. list<integer> breaks = new arraylist<integer>(); outer: (int = 0; < maxrowlength - minbreaklen; ++i) { if (columncounts[i] != 0) { continue; } int runlength = 1; while (i + runlength < maxrowlength && 0 == columncounts[i + runlength]) { ++runlength; } if (runlength >= minbreaklen) { breaks.add(i); } += runlength - 1; } system.out.println(breaks);
Comments
Post a Comment