Utilities for downloading and building data.

These can be replaced if your particular file system does not support them.

class parlai.core.build_data.DownloadableFile(url, file_name, hashcode, zipped=True, from_google=False)[source]

Bases: object

A class used to abstract any file that has to be downloaded online.

Any task that needs to download a file needs to have a list RESOURCES that have objects of this class as elements.

This class provides the following functionality:

  • Download a file from a URL / Google Drive

  • Untar the file if zipped

  • Checksum for the downloaded file

  • Send HEAD request to validate URL or Google Drive link

An object of this class needs to be created with:

  • url <string> : URL or Google Drive id to download from

  • file_name <string> : File name that the file should be named

  • hashcode <string> : SHA256 hashcode of the downloaded file

  • zipped <boolean> : False if the file is not compressed

  • from_google <boolean> : True if the file is from Google Drive

__init__(url, file_name, hashcode, zipped=True, from_google=False)[source]

Checksum on a given file.


dpath – path to the downloaded file.


Performs a HEAD request to check if the URL / Google Drive ID is live.

parlai.core.build_data.built(path, version_string=None)[source]

Check if ‘.built’ flag has been set for that task.

If a version_string is provided, this has to match, or the version is regarded as not built.

parlai.core.build_data.mark_done(path, version_string=None)[source]

Mark this path as prebuilt.

Marks the path as done by adding a ‘.built’ file with the current timestamp plus a version description string if specified.

  • path (str) – The file path to mark as built.

  • version_string (str) – The version of this dataset., path, fname, redownload=False, num_retries=5)[source]

Download file using requests.

If redownload is set to false, then will not download tar file again if it is present (default False).


Make the directory and any nonexistent parent directories (mkdir -p).


Remove the given directory, if it exists.

parlai.core.build_data.untar(path, fname, delete=True, flatten_tar=False)[source]

Unpack the given archive file to the same directory.

  • path (str) – The folder containing the archive. Will contain the contents.

  • fname (str) – The filename of the archive file.

  • delete (bool) – If true, the archive will be deleted after extraction.

parlai.core.build_data.ungzip(path, fname, deleteGZip=True)[source]

Unzips the given gzip compressed file to the same directory.

  • path (str) – The folder containing the archive. Will contain the contents.

  • fname (str) – The filename of the archive file.

  • deleteGZip (bool) – If true, the compressed file will be deleted after extraction.

parlai.core.build_data.download_from_google_drive(gd_id, destination)[source]

Use the requests package to download a file from Google Drive.

parlai.core.build_data.download_models(opt, fnames, model_folder, version='v1.0', path='aws', use_model_type=False, flatten_tar=False)[source]

Download models into the ParlAI model zoo from a url.

  • fnames – list of filenames to download

  • model_folder – models will be downloaded into models/model_folder/model_type

  • path – url for downloading models; defaults to downloading from AWS

  • use_model_type – whether models are categorized by type in AWS

parlai.core.build_data.modelzoo_path(datapath, path)[source]

Map pretrain models filenames to their path on disk.

If path starts with ‘models:’, then we remap it to the model zoo path within the data directory (default is ParlAI/data/models). We download models from the model zoo if they are not here yet.

parlai.core.build_data.download_multiprocess(urls, path, num_processes=32, chunk_size=100, dest_filenames=None, error_path=None)[source]

Download items in parallel (e.g. for an image + dialogue task).

WARNING: may have issues with OS X.

  • urls – Array of urls to download

  • path – directory to save items in

  • num_processes – number of processes to use

  • chunk_size – chunk size to use

  • dest_filenames – optional array of same length as url with filenames. Images will be saved as path + dest_filename

  • error_path – where to save error logs


array of tuples of (destination filename, http status code, error message if any). Note that upon failure, file may not actually be created.