int-all script¶

usage: int-all [-h] [--min_size MINIMUM] [--max_size MAXIMUM] [-v] [-m]
               [-r RAM] [-t {cbeta,latin,pagel}]
               DATABASE CORPUS CATALOGUE DIRECTORY TRACKING

Produces intersect results files for every pair of labelled texts in the
supplied catalogue.

positional arguments:
  DATABASE              Path to database file.
  CORPUS                Path to corpus.
  CATALOGUE             Path to catalogue file.
  DIRECTORY             Path to output directory
  TRACKING              Path to tracking file

options:
  -h, --help            show this help message and exit
  --min_size MINIMUM    Minimum size of n-grams to generate, if DATABASE is
                        "memory". (default: 1)
  --max_size MAXIMUM    Maximum size of n-grams to generate, if DATABASE is
                        "memory". (default: 10)
  -v, --verbose         Display debug information; multiple -v options
                        increase the verbosity. (default: None)
  -m, --memory          Use RAM for temporary database storage.
                        
                        This may cause an out of memory error, in which case
                        run the command without this switch. (default: False)
  -r RAM, --ram RAM     Number of gigabytes of RAM to use. (default: 3)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

This process can take an extremely long time if the number of works in the
catalogue is large. The process has been designed to track which intersections
have been done, so the process can be killed and then rerun, by pointing to
the same tracking file and output directory.

If DATABASE is "memory", individual in-memory databases will be created for
each pair. This can be much more performant than using a single database that
includes data from all of the works in the corpus.

Results are extended and reduced.
int-all script¶

tacl-extra

Navigation

Related Topics