order1vec.pl
Creates first order context vectors or feature vectors. Both of these vectors indicate whether or not a given feature occurs in the given context. The possible features are identified via Perl regular expressions of the form created by nsp2regex.pl.
order1vec.pl [OPTIONS] SVAL2 FEATURE_REGEX
A tokenized, preprocessed and well formatted Senseval-2 instance file showing instances whose context vectors are to be generated.
Context of each instance should be delimited within <context> and </context> tags. It is required that each XML tag in the Senseval-2 file appears on a separate line. Tokens should be space separated.
A file containing Perl regular expressions for features as created by nsp2regex.pl.
Sample FEATURE_REGEX files -
-------------------------------------------------------------------- /\s(<[^>]*>)*time(<[^>]*>)*\s/ @name = time /\s(<[^>]*>)*task(<[^>]*>)*\s/ @name = task /\s(<[^>]*>)*believe(<[^>]*>)*\s/ @name = believe /\s(<[^>]*>)*life(<[^>]*>)*\s/ @name = life /\s(<[^>]*>)*control(<[^>]*>)*\s/ @name = control /\s(<[^>]*>)*words(<[^>]*>)*\s/ @name = words /\s(<[^>]*>)*define(<[^>]*>)*\s/ @name = define --------------------------------------------------------------------
Explanation -
-------------------------------------------------------------------- /\s(<[^>]*>)*personal(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*computer(<[^>]*>)*\s/ @name = personal<>computer /\s(<[^>]*>)*stock(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*market(<[^>]*>)*\s/ @name = stock<>market /\s(<[^>]*>)*electronic(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*systems(<[^>]*>)*\s/ @name = electronic<>systems /\s(<[^>]*>)*toll(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*free(<[^>]*>)*\s/ @name = toll<>free
--------------------------------------------------------------------
Shows a bigram feature file in which each feature includes two tokens separated by single space or any number of non-token sequences in <> brackets.
More explanation on feature regex creation is given in the perldoc of the nsp2regex program.
NOTE: Null columns are discarded i.e. the features which do not occur in any of the contexts are dropped, and when --transpose option is specified (see below for details), contexts that do not contain any features are dropped as well.
By default, order1vec creates frequency context vectors that show how many times each feature occurs in the context. --binary will instead create binary context vectors where 1 indicates presence of feature and 0 indicates absence of feature in the context.
By default, context vectors will have sparse format. --dense will display output context vectors in dense format.
Creates a RLABELFILE containing row labels for Cluto's --rlabelfile option. Each line in the RLABELFILE shows an instance id of the instance whose context vector is shown on the corresponding line on STDOUT.
Instance ids are extracted from the SVAL2 file by matching regex
/instance id\s*=\s*"IID"/
where 'IID' is an instance id of the <context> that follows this <instance> tag.
NOTE: When the --transpose option is specified, the contents of the RLABELFILE and the CLABELFILE are swapped.
Creates RCLASSFILE for Cluto's --rclassfile option. Each line in the RCLASSFILE shows true sense id of the instance whose context vector appears on the corresponding line on STDOUT.
Sense ids are extracted from the SVAL2 file by matching regex
/sense\s*id\s*=\s*"SID"\/>/
where SID shows a true sense tag of the instance whose IID is recently extracted by matching
/instance id\s*=\s*"IID"/
This option cannot be specified when the --transpose option is specified.
Creates a CLABELFILE containing column labels for Cluto's --clabelfile option. Each line in the CLABELFILE shows a feature representing corresponding column of the output context vectors.
Features are extracted from the FEATURE_REGEX file by matching string ``@name = FEATURE'' where FEATURE shows the feature name.
NOTE: When the --transpose option is specified, the contents of the RLABELFILE and the CLABELFILE are swapped.
Creates feature vectors instead of the default context vectors. The output is a Latent Semantic Analysis style feature-by-context matrix, instead of the default context-by-feature matrix that is native to SenseClusters. As a result, the contents of the RLABELFILE and CLABELFILE are swapped, i.e. the list of features is output to the RLABELFILE and the list of contexts is output to the CLABELFILE.
Creates a TEST_REGEX file containing only those regular expressions from the input FEATURE_REGEX file that matched at least once in the input SVAL2 file. This list can be different from the original list in FEATURE_REGEX when different training data has been used to identify features or when a different scope has been used for training and test data creation.
This option is required when the --transpose option is specified, in order to ensure creation of a compatible TEST_REGEX file that corresponds to the output of order1vec.pl in --transpose mode, so that both the output and the TEST_REGEX can be directly passed as inputs to the order2vec.pl program.
Displays the name of a system generated KEY file on the first line of STDOUT. KEY file preserves the instance ids and sense tags of the instances in the given SVAL2 file. This information will be automatically used by some of the clustering and evaluation programs in SenseClusters that operate on purely numeric instance formats. The option should be selected if the user is planning to run SenseClusters' clustering code.
This option cannot be specified when the --transpose option is specified, as no KEY file is generated in --transpose mode.
Specifies a file containing Perl regex/s that define the target word. By default, target.regex file is assumed to exist in the current directory.
This will exclude the target word from features if the target word (as specified by the --target option or default target.regex file) appears in the FEATURE_REGEX file. In other words, the feature dimensions of the output context vectors will not include the target word even if target word is listed in the FEATURE_REGEX file.
Displays this message.
Displays the version information.
When --transpose is not specified, order1vec automatically generates a KEY file that preserves the instance ids and sense tags of the SVAL2 instances.
Each line in the KEY file shows an instance id and one or more sense tags of the instance represented by a context vector on the corresponding line on STDOUT. i.e. the ith line in the KEY file shows the instance and sense ids of the ith instance in the SVAL2 file or the ith vector displayed on stdout.
Sample KEY file looks like
<instance id="line-n.w8_020:7099:"/> <sense id="phone"/> <instance id="line-n.w8_132:15431:"/> <sense id="phone"/> <instance id="line-n.w8_027:13762:"/> <sense id="phone"/> <instance id="line-n.w7_114:8965:"/> <sense id="text"/> <instance id="line-n.w7_065:1553:"/> <sense id="product"/> <instance id="line-n.w9_4:9437:"/> <sense id="product"/>
Or
<instance id="line-n.w8_020:7099:"/> <sense id="NOTAG"/> <instance id="line-n.w7_111:238:"/> <sense id="NOTAG"/> <instance id="line-n.w7_011:12078:"/> <sense id="NOTAG"/> <instance id="line-n.w7_095:17576:"/> <sense id="NOTAG"/> <instance id="line-n.w7_080:10129:"/> <sense id="NOTAG"/> <instance id="line-n.w9_4:2358:"/> <sense id="NOTAG"/>
when the sense ids of instances are not available in the input SVAL2 file.
Or
<instance id="hard-a.sjm-180_1:"/> <sense id="HARD1"/> <sense id="HARD2"/> <instance id="hard-a.br-l15:"/> <sense id="HARD1"/> <instance id="hard-a.sjm-242_12:"/> <sense id="HARD2"/> <instance id="hard-a.sjm-070_4:"/> <sense id="HARD1"/> <sense id="HARD3"/> <instance id="hard-a.sjm-168_4:"/> <sense id="HARD3"/>
when some instances have multiple sense tags.
By default (unless --dense is specified), output vectors will be created in sparse format.
The first line on stdout will show 3 numbers separated by blanks as
N M NNZ
where
N = Number of instances in SVAL2 file
M = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file
NNZ = Total number of non-zero entries in all sparse vectors
Each line thereafter shows a single sparse context vector on each line. In short, every ith line after the 1st line shows the context vector of the i'th instance in the given SVAL2 file.
Each sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the index of a non-zero value in the vector and the second number is a non-zero value itself corresponding to that index.
12 18 31 1 1 2 1 1 1 2 2 3 2 4 1 4 1 5 1 6 2 5 2 6 3 7 1 8 2 9 1 9 1 7 1 8 1 10 1 4 2 11 3 12 2 13 4 14 1 15 1 14 1 15 1 3 1 8 1 16 4 17 4 18 4
Note that,
Feature indices start from 1, to be consistent with Cluto's matrix format standard.
If --binary is set ON, all non-zero values will have value 1 showing mere presence of feature in the context rather than the frequency counts.
When --dense option is selected, order1vec will create output in dense vector format.
First line on STDOUT will show exactly two numbers separated by space. The first number indicates the number of vectors and the second number indicates the number of features (dimensions of the context vectors).
Each line thereafter shows a single context vector such that ith line after the 1st line shows the context vector of the ith instance in the SVAL2 file.
12 18 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 3 2 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 4 4 4
shows same context vectors as shown in Sample Sparse Format but in dense format.
Note that
When --showkey is selected, output will be exactly same as described above except the first line will show the KEY file name that is required by the SenseClusters' programs.
e.g.
<keyfile name="KEY"/> 12 18 31 1 1 2 1 1 1 2 2 3 2 4 1 4 1 5 1 6 2 5 2 6 3 7 1 8 2 9 1 9 1 7 1 8 1 10 1 4 2 11 3 12 2 13 4 14 1 15 1 14 1 15 1 3 1 8 1 16 4 17 4 18 4
Shows same vectors as shown in Sample Sparse Output when --showkey is ON. Value of KEY shown in the <keyfile> tag will be the system generated KEY file name.
Note that --testregex TEST_REGEX is a required option when --transpose is specified.
By default (unless --dense is specified), output vectors will be created in sparse format.
The first line on stdout will show 3 numbers separated by blanks as
N M NNZ
where
N = Number of features from the FEATURE_REGEX file that were found at least once in the SVAL2 file
M = Number of instances in SVAL2 file, for which at least one feature was identified
NNZ = Total number of non-zero entries in all sparse vectors
Each line thereafter shows a single sparse feature vector on each line. In short, every ith line after the 1st line shows the feature vector of the i'th feature in the created TEST_REGEX file.
Each sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the index of a non-zero value in the vector and the second number is a non-zero value itself corresponding to that index.
18 12 31 1 1 2 1 1 1 2 2 2 2 12 1 2 1 3 1 9 2 4 1 5 2 4 2 5 3 5 1 7 1 5 2 8 1 12 1 5 1 6 1 8 1 9 3 9 2 9 4 9 1 11 1 10 1 11 1 12 4 12 4 12 4
Note that,
Context indices start from 1, to be consistent with Cluto's matrix format standard.
If --binary is set ON, all non-zero values will have value 1 showing mere presence of feature in the context rather than the frequency counts.
When --dense option is selected, order1vec will create output in dense vector format.
First line on STDOUT will show exactly two numbers separated by space. The first number indicates the number of vectors and the second number indicates the number of contexts (dimensions of the feature vectors).
Each line thereafter shows a single feature vector such that ith line after the 1st line shows the context vector of the ith instance in the SVAL2 file.
18 12 1 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 4
shows same context vectors as shown in Sample Sparse Format but in dense format.
Note that
PDL - http://search.cpan.org/dist/PDL/
Math::SparseVector - http://search.cpan.org/dist/Math-SparseVector/
This program behaves unpredictably if the input file is not in Senseval2 format. No error message is given, and it will produce numeric output, but of course it has no real meaning. A check should be added to make sure the input file is in Senseval2 format.
Ted Pedersen, University of Minnesota, Duluth
Amruta Purandare, University of Minnesota, Duluth
Anagha Kulkarni, University of Minnesota, Duluth
Mahesh Joshi, University of Minnesota, Duluth
Copyright (c) 2002-2006,
Amruta Purandare, University of Pittsburgh. amruta@cs.pitt.edu
Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu
Anagha Kulkarni, University of Minnesota, Duluth kulk020@d.umn.edu
Mahesh Joshi, University of Minnesota, Duluth josh031@d.umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.