This program lemmatize a lexicon using different methods to find a sentence context. You can choose which one to use between: - requesting Wikipedia - requesting StartPage - looking for a sentence in a given corpus

Basic Usage

Command without any option

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] [-addOptionsHere...]

Help option

Of course, you can always add the "-help" flag option or only launch LexiconTagger without any options to show the help message.

java -jar LexiconTagger.jar


java -jar LexiconTagger.jar [option1] [option...] -help

Without context

If you want to have a first result quickly, please use the -contextfree option. It will override other context option (wikipedia, startpage and corpora).

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree

Using an external corpus

LexiconTagger can tag a french and english lexicon. So you will have to specify each language corpus. Corpora used are expected in the txt format (tokenized full text without tags). If your corpus isn't tokenized you will need to add the "-tokenize" flag option. For instance, here can be seen a usage example:

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -corpusFR [path] -corpusEN [path] [-tokenize]

If you only have one corpus but a two language lexicon, you can specify only one corpus path. It will then be used only for the language it depends on.

Using Wikipedia and/or StartPage

Because you will need an internet connexion, please be sure your proxy are configured if necessary (see the Proxy section). To use search context sentence on wikipedia or startpage you only need to add these flag options. Be sure there isn't "contextfree" or "help" options as they will override it.

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -wikipedia -startpage

One complete basic usage example

java -jar LexiconTagger.jar -lexicon ../bioLexicon.xml -ouput ../taggedBioLexicon.xml -wikipedia -startpage -corpusFR ../corpora/bioCorpusFR.txt -corpusEN ../corpora/bioCorpusEN.txt -tsv

Advanced Usage

Limit the number of lemmatized terms

You add a limit in order to lemmatize only the first x terms of your lexicon with the "-n" option. Useful for test purposes.

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -n 100

Enable tsv outputs

The "-tsv" flag option can be used in order to add one tabular separated format per language. These output are simpler and are very useful if you are using the TermSuite software for instance.

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -tsv

Configure proxy

If you or you company use a proxy and you want to use wikipedia and startpage flag options, then you will have to use together both options "proxy" and "port". Here is an example:

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -wikipedia -startpage -proxy http://proxy.com -port 8080

Convert a XML result to TSV

Maybe you used LexiconTagger to tag your lexicon with every options but forgot to add the "tsv" output flag option. If so, you can use LexiconTagger to generate the tsv outputs from an already tagged lexicon without processing the tagging again.

To do so you will have to use the "-xml2tsv" flag option like this:

java -jar LexiconTagger.jar -lexicon [Path] -ouput ../lexicon/taggedLexicon.xml -xml2tsv

It will override other options except the "help" option and will create tsv in the same directory as the "output" option. With this example you will have two tsv: "../lexicon/FR_taggedLexicon.tsv" and "../lexicon/EN_taggedLexicon.tsv".

Use different models

There are three models used for each language: 1. model for OpenNLP tokenizer "-modelTokenFR | -modelTokenEN" 2. model for OpenNLP sentence detector "-modelSentFR | -modelSentEN" 3. model for Mate 3.3 lemmatizer "-modelLemmaFR | -modelLemmaEN"

Here is an example of how you can specify a different model only for french tokenization and sentence detection (if not specify the program will use default model):

java -jar LexiconTagger.jar -lexicon [Path] -ouput [Path] -contextfree -modelTokenFR [new model path] -modelSentFR [new model path]

Use LexiconTagger for another language

LexiconTagger is initially made for a french and/or english lexicon but it is possible to use it for another language. Thus, wikipedia option will not be available.

To do so you have to specify the new models for the given language, and you need to, to specify the corpus under the same language. The output will show a different name, but the result will be fine. Here is an example for italian (IT):

java -jar LexiconTagger.jar -lexicon ../lexiconIT.xml -ouput ../taggedLexiconIT.xml -modelSentFR ../model/modelSentIT -modelTokenFR ../model/modelTokenIT -modelLemmaFR ../model/modelLemmaIT -corpusFR ../corpora/corpusIT.txt -tokenize -tsv -startpage 

This will result with two files: ../taggedLexiconIT.xml ../FR_taggedLexiconIT.tsv

You will then only need to change the name of the tsv ouput into "IT_taggedLexiconIT.tsv".


gael dot guibon at gmail.com gael dot guibon at inist.fr istex at inist.fr