paternity script

usage: paternity [-h] [-m] [-r RAM] [-t {cbeta,latin,pagel}]
                 DATABASE CATALOGUE PARENT_LABEL CHILD_LABEL UNRELATED_LABEL
                 MAXIMUM DIRECTORY

Generates a series of results files giving the n-grams in common between one
corpus and each work in a second corpus, that are not present in a third
corpus.

positional arguments:
  DATABASE              Path to database file.
  CATALOGUE             Path to catalogue file.
  PARENT_LABEL          Label of corpus whose n-grams are being matched with
                        the works in the CHILD_LABEL corpus.
  CHILD_LABEL           Label of corpus whose individual works are to be
                        compared.
  UNRELATED_LABEL       Label of corpus that provides n-grams to be excluded
                        from the matches between CHILD and PARENT corpora.
  MAXIMUM               Maximum number of works in the child corpus that each
                        result n-gram may be found in.
  DIRECTORY             Directory to output to. It must not already exist.

options:
  -h, --help            show this help message and exit
  -m, --memory          Use RAM for temporary database storage.
                        
                        This may cause an out of memory error, in which case
                        run the command without this switch. (default: False)
  -r RAM, --ram RAM     Number of gigabytes of RAM to use. (default: 3)
  -t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
                        Type of tokenizer to use. The "cbeta" tokenizer is
                        suitable for the Chinese CBETA corpus (tokens are
                        single characters or workaround clusters within square
                        brackets). The "pagel" tokenizer is for use with the
                        transliterated Tibetan corpus (tokens are sets of word
                        characters plus some punctuation used to transliterate
                        characters). (default: cbeta)

This script performs a 'paternity test' for each work in a corpus by finding
n-grams that it shares with a second corpus that are not found within a third
corpus. In the case of authorship attribution, these three corpora may be
described as:

  A. A benchmark corpus of works for a given figure.

  B. A group of works suspected of belonging to, or somehow aligning
     with, corpus A.

  C. A contrast corpus of works that count as definitively not related
     to A.

The algorithm is that for each work in B (Bx), a results file is generated
giving (A asymmetric diff C) supplied intersect Bx.

These results are then filtered to include only those n-grams that occur in at
most the user supplied number of works within the child corpus.

Three CSV results files are created in the specified output directory:

  * parent-minus-unrelated.csv - n-grams from the parent corpus
    that do not occur in the unrelated corpus

  * child.csv - all n-grams from the child corpus

  * parent-child.csv - n-grams shared between the previous two
    results, as filtered by the maximum number of works specified