Sunday, March 13, 2016

DATA SIMPLIFICATION: Doublet Lists


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

Yesterday's blog covered lists of single words. Today we'll do doublets.

Doublet lists (lists of two-word terms that occur in common usage or in a body of text) are a highly underutilized resource. The special value of doublets is that single word terms tend to have multiple meanings, while doublets tend to have specific meaning.

Here are a few examples:

The word "rose" can mean the past tense of rise, or the flower. The doublet "rose garden" refers specifically to a place where the rose flower grows.

The word "lead" can mean a verb form of the infinitive, "to lead", or it can refer to the metal. The term "lead paint" has a different meaning than "lead violinist". Furthermore, every multiword term of length greater than two can be constructed with overlapping doublets, with each doublet having a specific meaning.

For example, "Lincoln Continental convertible" = "Lincoln Continental" + "Continental convertible". The three words, "Lincoln", "Continental", and "convertible" all have different meanings, under different circumstances. But the two doublets, "Lincoln Continental" and "Continental Convertible" would be unusual to encounter on their own, and produce a unique meaning, when combined.

Perusal of any nomenclature will reveal that most of the terms included in nomenclatures consist of two or more words. This is because single word terms often lack specificity. For example, in a nomenclature of recipes, you might expect to find, "Eggplant Parmesan" but you may be disappointed if you look for "Eggplant" or "Parmesan". In a taxonomy of neoplasms, available at: http://www.julesberman.info/figs/neocl_f.htm, containing over 120,000 terms, only a few hundred of those terms are single word terms (1).

Lists of doublets, collected from a corpus of text, or from a nomenclature, have a variety of uses in data simplification projects (1-3). We will show examples in Section 5.4, and in "On-the-fly indexing scripts" later in this chapter.

For now, you should know that compiling doublet lists, from any corpus of text, is extremely easy.

Here is a perl script, doublet_maker.pl, that creates a list of alphabetized doublets occurring in any text file of your choice (filename.txt in this example):
#!/usr/local/bin/perl
open(TEXT,"filename.txt")||die"cannot";
open(OUT,">doublets.txt")||die"cannot";
undef($/);
$var = ;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
  {
  $doublet = "$oldthing $thing";
  if ($doublet =~ /^[a-z]+ [a-z]+$/)
    {
    $doublethash{$doublet}="";
    }
  $oldthing = $thing;
  }
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;
exit;
Here is an equivalent Python script, doublet_maker.py:
#!/usr/local/bin/python
import anydbm, string, re
in_file = open('filename.txt', "r")
out_file = open('doubs.txt',"w")
doubhash = {}
for line in in_file:
  line = line.lower()
  line = re.sub('[.,<>?/;:"[]\{}|=+-_ ()*&^%$#@!`~1234567890]', ' ', line)
  hoparray = line.split()
  hoparray.append(" ")
  for i in range(len(hoparray)-1):
     doublet = hoparray[i] + " " + hoparray[i + 1]
     if doubhash.has_key(doublet):
          continue
     doubhash_match = re.search(r'[a-z]+ [a-z]+',  doublet)
     if doubhash_match:
         doubhash[doublet] = ""
for keys,values in sorted(doubhash.items()):
    out_file.write(keys + '\n')
exit
Here is an equivalent Ruby script, doublet_maker.rb that creates a doublet list from file filename.txt:
#!/usr/local/bin/ruby
intext = File.open("filename.txt", "r")
outtext = File.open("doubs.txt", "w")
doubhash = Hash.new(0)
line_array = Array.new(0)
while record = intext.gets
  oldword = ""
  line_array = record.chomp.strip.split(/\s+/)
  line_array.each do
    |word|
    doublet = [oldword, word].join(" ")
    oldword = word
    next unless (doublet =~ /^[a-z]+\s[a-z]+$/)
    doubhash[doublet] = ""
    end
end
doubhash.each {|k,v| outtext.puts k }
exit
I have deposited a public domain doublet list, available for download at:

http://www.julesberman.info/doublets.htm

The first few lines of the list are shown:
a bachelor
a background
a bacteremia
a bacteria
a bacterial
a bacterium
a bad
a balance
a balanced
a banana

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, word lists, doublet lists, n-grams, complexity, open source tools, jules j berman

References:

[1] Berman JJ. Automatic extraction of candidate nomenclature terms using the doublet method. BMC Medical Informatics and Decision Making 5:35, 2005.

[2] Berman JJ. Doublet method for very fast autocoding. BMC Med Inform Decis Mak, 4:16, 2004.

[3] Berman JJ. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching. In Silico Biol, 5:0029, 2005. Available at: http://www.bioinfo.de/isb/2005/05/0029/, viewed on September 6, 2015.

No comments: