Splitting up visual blocks of text in java -


i have block of text i'm trying interpret in java (or grep/awk/etc) looking following:

   differently, plaques of rn8 , rn9 mutants            , human coronavirus oc43 more divergent    of wild-type size, indicating suppressor mu-    sars-cov, human coronavirus hku1, , bat coronaviruses    tations, in isolation, not noticeably deleterious       hku4, hku5, , hku9 (fig. 6b). thus, not mem-    --    able effect on viral phenotype. potentially related obser-    sented existence of interaction between nsp9    vation mutation a2u, neutral itself,   nsp8 (56). hexadecameric complex of sars-cov nsp8 ,    lethal in combination aacaag insertion (data not        nsp7 has been found bind double-stranded rna. 

and i'd split 2 parts: left , right. i'm having trouble coming regex or other method split block of text visually split, not obvious programming language. lengths of lines variable.

i've considered looking first block , finding second looking multiple spaces, i'm not sure that's robust solution. ideas, snippets, pseudo code, links, etc?

text source

enter image description here

the text has been ran follows through pdftotext pdftotext -layout mypdf.pdf

blur text , come array of character density per column of text. gaps , split there.

string blurredtext = text.replaceall("(?<=\\s) (?=\\s)", "."); string[] blurredlines = text.split("\r\n?|\n");  int maxrowlength = 0; (string blurredline : blurredlines) {   maxrowlength = math.max(maxrowlength, blurredline.length()); }  int[] columncounts = new int[maxrowlength]; (string blurredline : blurredlines) {   (int = 0, n = blurredline.length(); < n; ++i) {     if (blurredline.charat(i) != ' ') { ++columncounts[i]; }    } }      // runs of 0 of @ least length 3. // alternatively, might n longest runs of zeros. // alternatively, might runs of length min(columncounts) ignore // horizontal rules.  int minbreaklen = 3;  // tuning parameter. list<integer> breaks = new arraylist<integer>(); outer: (int = 0; < maxrowlength - minbreaklen; ++i) {   if (columncounts[i] != 0) { continue; }   int runlength = 1;   while (i + runlength < maxrowlength && 0 == columncounts[i + runlength]) {     ++runlength;   }   if (runlength >= minbreaklen) {     breaks.add(i);   }   += runlength - 1; }  system.out.println(breaks); 

Comments

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

windows - Python Service Installation - "Could not find PythonClass entry" -

Determine if a XmlNode is empty or null in C#? -