usage: paternity [-h] [-m] [-r RAM] [-t {cbeta,latin,pagel}]
DATABASE CATALOGUE PARENT_LABEL CHILD_LABEL UNRELATED_LABEL
MAXIMUM DIRECTORY
Generates a series of results files giving the n-grams in common between one
corpus and each work in a second corpus, that are not present in a third
corpus.
positional arguments:
DATABASE Path to database file.
CATALOGUE Path to catalogue file.
PARENT_LABEL Label of corpus whose n-grams are being matched with
the works in the CHILD_LABEL corpus.
CHILD_LABEL Label of corpus whose individual works are to be
compared.
UNRELATED_LABEL Label of corpus that provides n-grams to be excluded
from the matches between CHILD and PARENT corpora.
MAXIMUM Maximum number of works in the child corpus that each
result n-gram may be found in.
DIRECTORY Directory to output to. It must not already exist.
options:
-h, --help show this help message and exit
-m, --memory Use RAM for temporary database storage.
This may cause an out of memory error, in which case
run the command without this switch. (default: False)
-r RAM, --ram RAM Number of gigabytes of RAM to use. (default: 3)
-t {cbeta,latin,pagel}, --tokenizer {cbeta,latin,pagel}
Type of tokenizer to use. The "cbeta" tokenizer is
suitable for the Chinese CBETA corpus (tokens are
single characters or workaround clusters within square
brackets). The "pagel" tokenizer is for use with the
transliterated Tibetan corpus (tokens are sets of word
characters plus some punctuation used to transliterate
characters). (default: cbeta)
This script performs a 'paternity test' for each work in a corpus by finding
n-grams that it shares with a second corpus that are not found within a third
corpus. In the case of authorship attribution, these three corpora may be
described as:
A. A benchmark corpus of works for a given figure.
B. A group of works suspected of belonging to, or somehow aligning
with, corpus A.
C. A contrast corpus of works that count as definitively not related
to A.
The algorithm is that for each work in B (Bx), a results file is generated
giving (A asymmetric diff C) supplied intersect Bx.
These results are then filtered to include only those n-grams that occur in at
most the user supplied number of works within the child corpus.
Three CSV results files are created in the specified output directory:
* parent-minus-unrelated.csv - n-grams from the parent corpus
that do not occur in the unrelated corpus
* child.csv - all n-grams from the child corpus
* parent-child.csv - n-grams shared between the previous two
results, as filtered by the maximum number of works specified