next up previous
Next: About this document ... Up: medMatch: Software for Generating Previous: Download

Description

Once downloaded, medMatch can be unpacked and compiled with the following commands:
tar -xvzf medMatch_xxx.tgz
cd MedMatch_xxx
make
The options of the program are then printed using
./medMatch -h
which returns
medMatch version 1.2, copyright (c) 2006, Bernhard Haubold & Olaf Bininda-Emonds
        distributed under the GNU General Public License
purpose: create a gene index of Medline; search for Approved Symbol & Previous Symbols is case-sensitive;
         search for other patterns is not case-sensitive.
input arguments:
        -p FILE pattern file in HUGO format (cf. http://www.gene.ucl.ac.uk/nomenclature/data/gdlw_index.html)
        [-t FILE text file in Medline format; default: stdin]
        [-o FILE output file; default: stdout]
symbol match arguments:
        [-s NUM minimum length of Approved Symbol or Previous Symbols; default: 5]
        [-u NUM minimum length of Approved Symbol or Previous Symbols in conjunction with -x; default: 3]
        [-v NUM minimum length of Approved Symbol or Previous Symbols in conjunction with -y; default: 2]
word match arguments:
        [-w NUM minimum word count in Approved Name or Previous Names or Aliases; default: 4]
        [-x NUM minimum word count in Approved Name or Previous Names or Aliases; default: 1]
        [-y NUM minimum word count in Approved Name or Previous Names or Aliases; default: 2]
        [-z NUM minimum fraction of Approved Name or Previous Names; default: 0.5]
        [-a NUM minimum word count in Aliases in conjunction with -z applied to Previous Name; default: 1]
        [-A match Aliases case-sensitively]
miscellaneous arguments:
        [-O FILE output file for patterns (debugging); default: patterns not written to file]
        [-d diagnostic output (debugging)
        [-D verbose diagnostic output (debugging)]
        [-h print this help message]
        [-c print match criteria (AN = Approved Name; PN = Previous Names; A = Aliases)
            symbol match
                1 true if symbol length >= 5
                2 true if symbol length >= 3 and >= 1 words present form AN, PN, or A
                3 true if symbol length == 2 and >= 2 words present form AN, PN, or A
            word match (can only be true if symbol match is not true)
                4 true if >= 4 words present from AN
                5 true if >= 4 words present from PN
                6 true if fraction of words present from AN >= 0.5
                7 true if fraction of words present from PN >= 0.5
                8 true if (1 word from AGN or PGN present) AND (Previous Symbols or Aliases present)]
limits:
        maximum length of line in text or pattern file: 1000000
        maximum number of patterns: 40000
This may appear to be a somewhat daunting list, but most of the options can be left in their default state. In order to carry out a trial run, download the example data and copy it into the directory containing medMatch. Then unpack it by typing
tar -xvzf exampleData.tgz
which generates two new files:
  1. gdlw.txt: contains the HGNC nomenclature of human genes
  2. medsamp2006d.xml: contains sample Medline data
Run medMatch on these two files by typing:
./medMatch -p gdlw.txt -t medsamp2006d.xml
The first ten lines of output look like this:
10846528        TAL1
10846530        CD44
10846539        TIE1
11256901        GCNT2 ERMAP
11322285        SERPINC1
11322288        DMD
11322289        PCNA
11322300        NBR1
11322327        PCNA
11322334        FRRS1 TIE1 TGFBR1
The first column shows the PubMed-Id (PMID) of entries with a hit to a human gene. These genes are then printed as space-delineated data in the second column. Genes are identified by their approved symbol and hence we learn that, according to medMatch, Medline entry 11850949 references gene AMPH, entry 15154230 references genes BCAM and ERMAP, and so on. You can use the program getme to retrieve a particular entry in Medline:
getme -p 10846530 -i medsamp2006d.xml

In order to make transparent how medMatch arrives at its gene assignments, the program contains the -c switch. If we add this to the previous command we get

./medMatch -p gdlw.txt -t medsamp2006d.xml -c
The first 10 lines of output from this command are now
10846528        TAL1:5
10846530        CD44:2
10846539        TIE1:5
11256901        GCNT2:3 ERMAP:5
11322285        SERPINC1:8
11322288        DMD:5
11322289        PCNA:2
11322300        NBR1:5
11322327        PCNA:2
11322334        FRRS1:5 TIE1:5 TGFBR1:5
This tells us that eight out of thirteen matches were assigned on the basis of criterion 5, i.e. at least four words making up the field ``Previous Names'' were identified in the entry. Further, three matches were assigned on the basis of a match to a ``Symbol'' consisting of at least three characters, and in addition a match to at least one word from ``Approved Name'', ``Previous Names'', or ``Aliases''. There was also one match each based on criterion 3 (match to ``Symbol'' of length 2 and at least 2 words present form ``Approved Name'', ``Previous Names'', or ``Aliases'') and criterion 8 (true if 1 word from ``Approved Name'', ``Previous Names'' and one word from ``Previous Symbols'' or ``Aliases'' were found). In contrast to criterion 5, where a fixed number of word matches is required, criterion 6 (there is no example for this in the output presented here) is based on the idea that a fraction, default is 0.5, of all words contained in ``Approved Name'' have an exact match in the Medline entry concerned. Criterion 7 is defined analogously.

We might want to extract only a subset of genes according to specific match criteria. Say you are only interested in genes that fit criteria 1 and 5. Then simply use the program selectCriteria, which is included in this distribution, to post-process the medMatch output:

./medMatch -p gdlw.txt -t medsamp2006d.xml -c | selectCriteria 1 5 -
In case the output from medMatch is saved in someFile, type instead:
selectCriteria 1 5 someFile

next up previous
Next: About this document ... Up: medMatch: Software for Generating Previous: Download
Bernhard Haubol 2007-03-13