usage: sole-exception [-h] [-l LABEL] [--work-data WORK_DATA] [-v] [-m]
[-r RAM] [-t {cbeta,latin,pagel}]
DATABASE CORPUS CATALOGUE OUTPUT
Generate sole exception data and reports for unclassified works against
benchmark corpora.
positional arguments:
DATABASE Path to database file.
CORPUS Path to corpus.
CATALOGUE Path to catalogue file.
OUTPUT Path to directory where results will be written
options:
-h, --help show this help message and exit
-l LABEL, --label LABEL
Label for unclassified works to analyse (default:
grey)
--work-data WORK_DATA
Path to CSV file containing additional data for each
unclassified work (default: None)
-v, --verbose Display debug information; multiple -v options
increase the verbosity. (default: None)
-m, --memory Use RAM for temporary database storage.
This may cause an out of memory error, in which case
run the command without this switch. (default: False)
-r RAM, --ram RAM Number of gigabytes of RAM to use. (default: 3)
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
The catalogue must have at least two labels, one of which that specified by
the --label option.
If the supplied output directory already contains base data files (ie, not the
reports), these will not be regenerated.
The --work-data option allows for extra columns of data to be added to the
report tables, immediately following the "work" column. A CSV file referenced
by this option must have a header row, with one of the fields called "work"
with values matching the names of the unclassified works in the catalogue. The
other labelled fields will be added as columns with the same name in the
report tables.
The data this command generates are kept in the specified output directory,
and will be reused if the command is run again with the same output directory.
The reports are output into the "reports" subdirectory of the output
directory.