Fusion in Decoder (FiD)

The FiD model is first described in Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (G. Izacard, E. Grave 2020); the original implementation can be found here. The implementation we provide uses the RAG models as a backbone; thus, instructions for options to use when running a FiD model can be found in the RAG README, as well as the corresponding project page.

Simply swap --model rag with --model fid, and you’re good to go!

DictionaryAgent Options

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

FidAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

RagAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

SearchQueryFAISSIndexFiDAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

--woi-doc-chunk-size

Document chunk size (in characters).

Default: 500.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

Search Query FiD Params

Argument

Description

--search-query-generator-model-file

Path to a query generator model.

--search-query-generator-inference

Generation algorithm for the search query generator model

Default: greedy.

--search-query-generator-beam-min-length

The beam_min_length opt for the search query generator model

Default: 1.

--search-query-generator-beam-size

The beam_size opt for the search query generator model

Default: 1.

--search-query-generator-text-truncate

Truncates the input to the search query generator model

Default: 512.

--splitted-chunk-length

The number of tokens in each document split

Default: 256.

--doc-chunk-split-mode

Split the docs by white space (word) or dict tokens.

Choices: word, token.

Default: word.

--n-ranked-doc-chunks

Number of document chunks to keep if documents is too long and has to be splitted.

Default: 1.

--doc-chunks-ranker

How to rank doc chunks.

Choices: tfidf, head, woi_chunk_retrieved_docs.

Default: head.

SearchQueryFiDAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

--woi-doc-chunk-size

Document chunk size (in characters).

Default: 500.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

Search Query FiD Params

Argument

Description

--search-query-generator-model-file

Path to a query generator model.

--search-query-generator-inference

Generation algorithm for the search query generator model

Default: greedy.

--search-query-generator-beam-min-length

The beam_min_length opt for the search query generator model

Default: 1.

--search-query-generator-beam-size

The beam_size opt for the search query generator model

Default: 1.

--search-query-generator-text-truncate

Truncates the input to the search query generator model

Default: 512.

--splitted-chunk-length

The number of tokens in each document split

Default: 256.

--doc-chunk-split-mode

Split the docs by white space (word) or dict tokens.

Choices: word, token.

Default: word.

--n-ranked-doc-chunks

Number of document chunks to keep if documents is too long and has to be splitted.

Default: 1.

--doc-chunks-ranker

How to rank doc chunks.

Choices: tfidf, head, woi_chunk_retrieved_docs.

Default: head.

SearchQuerySearchEngineFiDAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

--woi-doc-chunk-size

Document chunk size (in characters).

Default: 500.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

Search Query FiD Params

Argument

Description

--search-query-generator-model-file

Path to a query generator model.

--search-query-generator-inference

Generation algorithm for the search query generator model

Default: greedy.

--search-query-generator-beam-min-length

The beam_min_length opt for the search query generator model

Default: 1.

--search-query-generator-beam-size

The beam_size opt for the search query generator model

Default: 1.

--search-query-generator-text-truncate

Truncates the input to the search query generator model

Default: 512.

--splitted-chunk-length

The number of tokens in each document split

Default: 256.

--doc-chunk-split-mode

Split the docs by white space (word) or dict tokens.

Choices: word, token.

Default: word.

--n-ranked-doc-chunks

Number of document chunks to keep if documents is too long and has to be splitted.

Default: 1.

--doc-chunks-ranker

How to rank doc chunks.

Choices: tfidf, head, woi_chunk_retrieved_docs.

Default: head.

Search Engine FiD Params

Argument

Description

--search-server

A search server address.

WizIntGoldDocRetrieverFiDAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

--woi-doc-chunk-size

Document chunk size (in characters).

Default: 500.

TorchRankerAgent

Argument

Description

--candidates, --cands

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

--eval-candidates, --ecands

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

--interactive-candidates, --icands

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

--fixed-candidates-path, --fcp

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

Argument

Description

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

Polyencoder Arguments

Argument

Description

--polyencoder-type

Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors.

Choices: codes, n_first.

Default: codes. Recommended: codes.

--poly-n-codes

Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered.

Default: 64. Recommended: 64.

--poly-attention-type

Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key)

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--poly-attention-num-heads

In case poly-attention-type is multihead, specify the number of heads

Default: 4.

--codes-attention-type

Type

Choices: basic, sqrt, multihead.

Default: basic. Recommended: basic.

--codes-attention-num-heads

In case codes-attention-type is multihead, specify the number of heads

Default: 4.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

RAG Model Args

Argument

Description

--generation-model

Which generation model to use

Choices: transformer/generator, bart, t5.

Default: bart.

--query-model

Which query model to use for DPR.

Choices: bert, bert_from_parlai_rag, dropout_poly.

Default: bert.

--rag-model-type

Which rag model decoding to use.

Choices: token, sequence, turn.

Default: token.

--thorough

Whether to use thorough decoding for rag sequence.

Default: False.

Modified RAG Args

Argument

Description

--n-extra-positions

Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended.

Default: 0.

--gold-knowledge-passage-key

Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: checked_sentence.

--gold-knowledge-title-key

Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time.

Default: title.

RAG Retriever Args

Argument

Description

--rag-retriever-query

What to use as the query for retrieval. one_turn retrieves only on the last turn of dialogue; full_history retrieves based on the full dialogue history.

Choices: one_turn, full_history.

Default: full_history.

--rag-retriever-type

Which retriever to use

Choices: dpr, tfidf, dpr_then_poly, poly_faiss, search_engine, search_term_faiss, observation_echo_retriever.

Default: dpr.

--retriever-debug-index

Load specified small index, for debugging.

Choices: None, none, exact, compressed.

--n-docs

How many documents to retrieve

Default: 5.

--min-doc-token-length

Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context.

Default: 64.

--max-doc-token-length

Maximum amount of information to retain from document.

Default: 256.

--rag-query-truncate

Max token length of query for retrieval.

Default: 512.

--print-docs

Whether to print docs; usually useful during interactive mode.

Default: False.

RAG Dense Passage Retriever Args

Argument

Description

--path-to-index

Path to FAISS Index.

Default: zoo:hallucination/wiki_index_compressed/compressed_pq.

--path-to-dense-embeddings

Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.

--dpr-model-file

Path to DPR Model.

Default: zoo:hallucination/multiset_dpr/hf_bert_base.cp.

--path-to-dpr-passages

Path to DPR passages, used to build index.

Default: zoo:hallucination/wiki_passages/psgs_w100.tsv.

--retriever-embedding-size

Embedding size of dense retriever

Default: 768.

RAG TFIDF Retriever Args

Argument

Description

--tfidf-max-doc-paragraphs

If > 0, limit documents to this many paragraphs

Default: -1.

--tfidf-model-path

Optionally override TFIDF model.

Default: zoo:wikipedia_full/tfidf_retriever/model.

RAG DPR-POLY Retriever Args

Argument

Description

--dpr-num-docs

In two stage retrieval, how many DPR documents to retrieve

Default: 25.

--poly-score-initial-lambda

In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here

Default: 0.5.

--polyencoder-init-model

Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder

Default: wikito.

RAG PolyFAISS retriever args

Argument

Description

--poly-faiss-model-file

Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument

Description

--regret

Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine).

Default: False.

--regret-intermediate-maxlen

Maximum length in intermediate regret generation

Default: 32.

--regret-model-file

Path to model for initial round of retrieval.

--regret-dict-file

Path to dict file for model for initial round of retrieval.

--regret-override-index

Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval

Default: False.

RAG Indexer Args

Argument

Description

--indexer-type

Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy.

Choices: exact, compressed.

Default: compressed.

--indexer-buffer-size

Buffer size for adding vectors to the index

Default: 65536.

--compressed-indexer-factory

If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details

Default: IVF4096_HNSW128,PQ128.

--compressed-indexer-nprobe

How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details

Default: 64.

RAG-Turn Args

Argument

Description

--rag-turn-n-turns

How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined.

Default: 2.

--rag-turn-marginalize

How to marginalize rag-turn.

Choices: doc_only, doc_then_turn.

Default: doc_then_turn.

--rag-turn-discount-factor

Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0.

Default: 1.0.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

T5 Args

Argument

Description

--t5-model-arch

Choices: t5-small, t5-base, t5-large, t5-3b, t5-11b, google/flan-t5-small, google/flan-t5-base, google/flan-t5-large, google/flan-t5-xl, google/flan-t5-xxl.

Default: t5-base.

--t5-model-parallel

Use HF model parallel

Default: False.

--t5-dropout

Dropout for T5

Default: 0.0.

--t5-generation-config

Task specific generation config for T5

Choices: summarization, translation_en_to_de, translation_en_to_fr, translation_en_to_ro.

Search Query FiD Params

Argument

Description

--search-query-generator-model-file

Path to a query generator model.

--search-query-generator-inference

Generation algorithm for the search query generator model

Default: greedy.

--search-query-generator-beam-min-length

The beam_min_length opt for the search query generator model

Default: 1.

--search-query-generator-beam-size

The beam_size opt for the search query generator model

Default: 1.

--search-query-generator-text-truncate

Truncates the input to the search query generator model

Default: 512.

--splitted-chunk-length

The number of tokens in each document split

Default: 256.

--doc-chunk-split-mode

Split the docs by white space (word) or dict tokens.

Choices: word, token.

Default: word.

--n-ranked-doc-chunks

Number of document chunks to keep if documents is too long and has to be splitted.

Default: 1.

--doc-chunks-ranker

How to rank doc chunks.

Choices: tfidf, head, woi_chunk_retrieved_docs.

Default: head.