sole-exception script¶

usage: sole-exception [-h] [-l LABEL] [--work-data WORK_DATA] [-v] [-m]
                      [-r RAM] [-t {cbeta,latin,pagel}]
                      DATABASE CORPUS CATALOGUE OUTPUT

Generate sole exception data and reports for unclassified works against
benchmark corpora.

positional arguments:
  DATABASE              Path to database file.
  CORPUS                Path to corpus.
  CATALOGUE             Path to catalogue file.
  OUTPUT                Path to directory where results will be written

options:
  -h, --help            show this help message and exit
  -l LABEL, --label LABEL
                        Label for unclassified works to analyse (default:
                        grey)
  --work-data WORK_DATA
                        Path to CSV file containing additional data for each
                        unclassified work (default: None)
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -m, --memory          Use RAM for temporary database storage.
                        
                        This may cause an out of memory error, in which case
                        run the command without this switch. (default: False)
  -r RAM, --ram RAM     Number of gigabytes of RAM to use. (default: 3)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

The catalogue must have at least two labels, one of which that specified by
the --label option.

If the supplied output directory already contains base data files (ie, not the
reports), these will not be regenerated.

The --work-data option allows for extra columns of data to be added to the
report tables, immediately following the "work" column. A CSV file referenced
by this option must have a header row, with one of the fields called "work"
with values matching the names of the unclassified works in the catalogue. The
other labelled fields will be added as columns with the same name in the
report tables.

The data this command generates are kept in the specified output directory,
and will be reused if the command is run again with the same output directory.
The reports are output into the "reports" subdirectory of the output
directory.
sole-exception script¶

tacl-extra

Navigation

Related Topics