Image+Seq2Seq

The Image+Seq2Seq agent is a model that incorporates image features with a sequence to sequence transformer generator. A core component of the dodecaDialogue task.

Basic Examples

Train an Image+Seq2Seq model on an image captioning task:

python parlai/scripts/train_model.py -m image_seq2seq -t flickr30k --image-mode resnext101_32x48d_wsl -mf /tmp/model

Train an Image+Seq2Seq model on a dialogue task:

python parlai/scripts/train_model.py -m image_seq2seq -t convai2 -mf /tmp/model

Multi-task train an Image+Seq2Seq model on a dialogue and captioning task:

python parlai/scripts/train_model.py -m image_seq2seq -t flickr30k,convai2 -mf /tmp/model --image-mode resnext101_32x48d_wsl

DictionaryAgent Options

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

ImageSeq2seqAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

Image args

Argument

Description

--image-features-dim

Dimensionality of image features

Default: 2048.

--image-encoder-num-layers

Number of linear layers to encode image features with

Default: 1. Recommended: 1.

--n-image-tokens

Number of tokens that the image encoding will consist of. Specify to spread image encoding over multiple tokens

Default: 1.

--n-image-channels

Number of channels that the image encoding will consist of. Specify if incoming image is multidimensional

Default: 1.

Image Encoder Args

Argument

Description

--include-image-token

If true, include image token (or no image token) for each example

Default: True. Recommended: True.

--image-fusion-type

Which fusion type to use

Choices: early, late.

Default: late.

TransformerGeneratorAgent Options

optional arguments

Argument

Description

--gpu-beam-blocking

Set to use CUDA kernel for beam search ngram blocking

Default: False.

--verbose-topk

Return the topk logits in the act message, if verbose mode is set.

Default: -1.

Transformer Arguments

Argument

Description

--embedding-size, --esz

Size of all embedding layers. Must be a multiple of –n-heads.

Default: 300.

--n-layers, --nl

Number of transformer layers.

Default: 2.

--ffn-size, --hid

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax. This is not used in Vaswani 2017.

Default: 0.0.

--relu-dropout

Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch.

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: xlm, prelayernorm, bart, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--n-encoder-layers, --nel

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--n-decoder-layers, --ndl

This will overidde the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--checkpoint-activations

Recompute activations on backward pass to conserve memory.

Default: False.

Torch Generator Agent

Argument

Description

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-block-full-context

Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--lambda-decay

Decay factor in factual nucleus sampling

Default: 0.9.

--omega-bound

Lower bound in factual nucleus sampling

Default: 0.3.

--p-reset

Whether to reset p value in factual nucleus at full stops

Default: True.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

Argument

Description

--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

--history-reversed

Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

--special-tok-lst

Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

Argument

Description

--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments

Argument

Description

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

--bpe-dropout

Use BPE dropout during training.

Learning Rate Scheduler

Argument

Description

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.