Terminology and Concepts
[Outdated] A new version of this will be uploaded soon
Authors: Amanpreet Singh
To develop on top of MMF, it is necessary to understand concepts and terminology
used in MMF codebase. MMF has been very carefully designed from ground-up to be a
multi-tasking framework. This means using MMF you can train on multiple datasets/datasets
together.
To achieve this, MMF has few opinions about architecture of your research project.
But, being generic means MMF abstracts a lot of concepts in its modules and it would
be easy to develop on top of MMF once a developer understands these simple concepts.
Major concepts and terminology in MMF that one needs to know in order to develop
over MMF are as follows:
Tasks and Datasets
In MMF, we have divided datasets into a set category of tasks. Thus, a task corresponds
to a collection of datasets that belong to it. For example, VQA 2.0, VizWiz and TextVQA
all belong VQA task. Each task and dataset has been assigned a unique key which is used
to refer it in the command line arguments.
Following table shows the tasks and their datasets:
Task |
Key |
Datasets |
VQA |
vqa |
VQA2.0, VizWiz, TextVQA, VisualGenome, CLEVR |
Dialog |
dialog |
VisualDialog |
Caption |
captioning |
MS COCO |
Following table shows the inverse of the above table, datasets along with their tasks and keys:
Datasets |
Key |
Task |
Notes |
VQA 2.0 |
vqa2 |
vqa |
|
TextVQA |
textvqa |
vqa |
|
VizWiz |
vizwiz |
vqa |
|
VisualDialog |
visdial |
dialog |
Coming soon! |
VisualGenome |
visual_genome |
vqa |
|
CLEVR |
clevr |
vqa |
|
MS COCO |
coco |
captioning |
|
Models
Reference implementations for state-of-the-art models have been included to act as
a base for reproduction of research papers and starting point of new research. MMF has
been used in past for following papers:
Similar to tasks and datasets, each model has been registered with a unique key for easy
reference in configuration and command line arguments. Following table shows each model’s
key name and datasets it can be run on.
Model |
Key |
Datasets |
LoRRA |
lorra |
vqa2, textvqa, vizwiz |
Pythia |
pythia |
textvqa, vizwiz, vqa2, visual_genome |
BAN |
ban |
textvqa, vizwiz, vqa2 |
BUTD |
butd |
coco |
CNN LSTM |
cnn_lstm |
clevr |
Note
BAN support is preliminary and hasn’t been properly fine-tuned yet.
Registry
Registry acts as a central source of truth for MMF. Inspired from Redux’s global store,
useful information needed by MMF ecosystem is registered in the registry
. Registry can be
considered as a general purpose storage for information which is needed by multiple parts
of the framework and acts source of information wherever that information is needed.
Registry also registers models, tasks, datasets etc. based on a unique key as mentioned above.
Registry’s functions can be used as decorators over the classes which need to be registered
(for e.g. models etc.)
Registry object can be imported as the follow:
from mmf.common.registry import registry
Find more details about Registry class in its documentation common/registry.
Configuration
As is necessary with research, most of the parameters/settings in MMF are
configurable. MMF specific default values (training
) are present
in mmf/common/defaults/configs/base.yaml
with detailed comments delineating the usage of each parameter.
For ease of usage and modularity, configuration for each dataset is kept separately in
mmf/common/defaults/configs/datasets/[task]/[dataset].yaml
where you can get [task]
value for the dataset from the tables in Tasks and Datasets section.
The most dynamic part, model configuration are also kept separate and are the one which
need to be defined by the user if they are creating their own models. We include
configurations for the models included in the model zoo of MMF. For each model,
there is a separate configuration for each dataset it can work on. See an example in
configs/vqa/vqa2/pythia.yaml. The configuration in
the configs folder are divided using the scheme configs/[task]/[dataset]/[model].yaml
.
It is possible to include other configs into your config using includes
directive.
Thus, in MMF config above you can include vqa2
’s config like this:
includes:
- common/defaults/configs/datasets/vqa/vqa2.yaml
Now, due to separate config per dataset this concept can be extended
to do multi-tasking and include multiple dataset configs here.
base.yaml
file mentioned above is always included and provides sane defaults
for most of the training parameters. You can then specify the config of the model
that you want to train using --config [config_path]
option. The final config can be
retrieved using registry.get('config')
anywhere in your codebase. You can access
the attributes from these configs by using dot
notation. For e.g. if you want to
get the value of maximum iterations, you can get that by registry.get('config').training.max_updates
.
The values in the configuration can be overriden using two formats:
- Individual Override: For e.g. you want to use
DataParallel
to train on multiple GPUs,
you can override the default value of False
by passing arguments training.data_parallel True
at the end your command. This will override that option on the fly.
- DemJSON based override: The above option gets clunky when you are trying to run the
hyperparameters sweeps over model parameters. To avoid this, you can update a whole block
using a demjson string. For e.g. to use early stopping as well update the patience, you
can pass
--config_override "{training: {should_early_stop: True, patience: 5000}}"
. This demjson string is easier to generate programmatically than the individual
override.
Note
It is always helpful to verify your config overrides and final configuration
values that are printed to make sure you override the correct keys.
Processors
The main aim of processors is to keep data processing pipelines as similar as
possible for different datasets and allow code reusability. Processors take in
a dict with keys corresponding to data they need and return back a dict with
processed data. This helps keep processors independent of the rest of the logic
by fixing the signatures they require. Processors are used in all of the datasets
to hand off the data processing needs. Learn more about processors in the
documentation for processors.
Sample List
SampleList has been inspired
from BBoxList in maskrcnn-benchmark, but is more generic. All datasets integrated
with MMF need to return a
Sample which will be collated into
SampleList
. Now, SampleList
comes with a lot of handy functions which
enable easy batching and access of things. For e.g. Sample
is a dict with
some keys. In SampleList
, values for these keys will be smartly clubbed
based on whether it is a tensor or a list and assigned back to that dict.
So, end user gets these keys clubbed nicely together and can use them in their model.
Models integrated with Pythia receive a SampleList
as an argument which again
makes the trainer unopinionated about the models as well as the datasets. Learn more
about Sample
and SampleList
in their documentation.
Tutorial: Adding a dataset
[Outdated] A new version of this will be uploaded soon
# MMF
This is a tutorial on how to add a new dataset to MMF.
MMF is agnostic to kind of datasets that can be added to it. On high level, adding a dataset requires 4 main components.
- Dataset Builder
- Default Configuration
- Dataset Class
- Dataset’s Metrics
In most of the cases, you should be able to inherit one of the existing datasets for easy integration. Let’s start from the dataset builder
Dataset Builder
Builder creates and returns an instance of mmf.datasets.base_dataset.BaseDataset
which is inherited from torch.utils.data.dataset.Dataset
.
Any builder class in MMF needs to be inherited from mmf.datasets.base_dataset_builder.BaseDatasetBuilder
. BaseDatasetBuilder
requires
user to implement following methods after inheriting the class.
Inside this function call super().__init__(“name”) where “name” should your dataset’s name like “vqa2”.
load(self, config, dataset_type, *args, **kwargs)
This function loads the dataset, builds an object of class inheriting BaseDataset
which contains your dataset logic and returns it.
build(self, config, dataset_type, *args, **kwargs)
This function actually builds the data required for initializing the dataset for the first time. For e.g. if you need to download some data for your dataset, this
all should be done inside this function.
Finally, you need to register your dataset builder with a key to registry using mmf.common.registry.registry.register_builder("key")
.
That’s it, that’s all you require for inheriting BaseDatasetBuilder
.
Let’s write down this using example of CLEVR dataset.
import json
import math
import os
import zipfile
from collections import Counter
from mmf.common.registry import registry
from mmf.datasets.base_dataset_builder import BaseDatasetBuilder
# Let's assume for now that we have a dataset class called CLEVRDataset
from mmf.datasets.builders.clevr.dataset import CLEVRDataset
from mmf.utils.general import download_file, get_mmf_root
@registry.register_builder("clevr")
class CLEVRBuilder(BaseDatasetBuilder):
DOWNLOAD_URL = "https://s3-us-west-1.amazonaws.com/clevr/CLEVR_v1.0.zip"
def __init__(self):
# Init should call super().__init__ with the key for the dataset
super().__init__("clevr")
self.writer = registry.get("writer")
# Assign the dataset class
self.dataset_class = CLEVRDataset
def build(self, config, dataset):
download_folder = os.path.join(
get_mmf_root(), config.data_dir, config.data_folder
)
file_name = self.DOWNLOAD_URL.split("/")[-1]
local_filename = os.path.join(download_folder, file_name)
extraction_folder = os.path.join(download_folder, ".".join(file_name.split(".")[:-1]))
self.data_folder = extraction_folder
# Either if the zip file is already present or if there are some
# files inside the folder we don't continue download process
if os.path.exists(local_filename):
return
if os.path.exists(extraction_folder) and \
len(os.listdir(extraction_folder)) != 0:
return
self.writer.write("Downloading the CLEVR dataset now")
download_file(self.DOWNLOAD_URL, output_dir=download_folder)
self.writer.write("Downloaded. Extracting now. This can take time.")
with zipfile.ZipFile(local_filename, "r") as zip_ref:
zip_ref.extractall(download_folder)
def load(self, config, dataset, *args, **kwargs):
# Load the dataset using the CLEVRDataset class
self.dataset = CLEVRDataset(
config, dataset, data_folder=self.data_folder
)
return self.dataset
def update_registry_for_model(self, config):
# Register both vocab (question and answer) sizes to registry for easy access to the
# models. update_registry_for_model function if present is automatically called by
# MMF
registry.register(
self.dataset_name + "_text_vocab_size",
self.dataset.text_processor.get_vocab_size(),
)
registry.register(
self.dataset_name + "_num_final_outputs",
self.dataset.answer_processor.get_vocab_size(),
)
Default Configuration
Some things to note about MMF’s configuration:
- Each dataset in MMF has its own default configuration which is usually under this structure
mmf/common/defaults/configs/datasets/[task]/[dataset].yaml
where task
is the task your dataset belongs to.
- These dataset configurations can be then included by the user in their end config using
includes
directive
- This allows easy multi-tasking and management of configurations and user can also override the default configurations
easily in their own config
So, for CLEVR dataset also, we will need to create a default configuration.
The config node is directly passed to your builder which you can then pass to your dataset for any configuration that you need
for building your dataset.
Basic structure for a dataset configuration looks like below:
dataset_config:
[dataset]:
... your config here
Here, is a default configuration for CLEVR needed based on our dataset and builder class above:
dataset_config:
# You can specify any attributes you want, and you will get them as attributes
# inside the config passed to the dataset. Check the Dataset implementation below.
clevr:
# Where your data is stored
data_dir: ${env.data_dir}
data_folder: CLEVR_v1.0
# Any attribute that you require to build your dataset but are configurable
# For CLEVR, we have attributes that can be passed to vocab building class
build_attributes:
min_count: 1
split_regex: " "
keep:
- ";"
- ","
remove:
- "?"
- "."
processors:
# The processors will be assigned to the datasets automatically by MMF
# For example if key is text_processor, you can access that processor inside
# dataset object using self.text_processor
text_processor:
type: vocab
params:
max_length: 10
vocab:
type: random
vocab_file: vocabs/clevr_question_vocab.txt
# You can also specify a processor here
preprocessor:
type: simple_sentence
params: {}
answer_processor:
# Add your processor for answer processor here
type: multi_hot_answer_from_vocab
params:
num_answers: 1
# Vocab file is relative to [data_dir]/[data_folder]
vocab_file: vocabs/clevr_answer_vocab.txt
preprocessor:
type: simple_word
params: {}
For processors, check mmf.datasets.processors
to understand how to create a processor and different processors that are
already available in MMF.
Dataset Class
Next step is to actually build a dataset class which inherits BaseDataset
so it can interact with PyTorch
dataloaders. Follow the steps below to inherit and create your dataset’s class.
- Inherit
mmf.datasets.base_dataset.BaseDataset
- Implement
__init__(self, config, dataset)
. Call parent’s init using super().__init__("name", config, dataset)
where “name” is the string representing the name of your dataset.
- Implement
__getitem__(self, idx)
, our replacement for normal __getitem__(self, idx)
you would implement for a torch dataset. This needs to
return an object of class :class:Sample.
- Implement
__len__(self)
method, which represents size of your dataset.
- [Optional] Implement
load_item(self, idx)
if you need to load something or do something else with data and then call it inside __getitem__
.
import os
import json
import numpy as np
import torch
from PIL import Image
from mmf.common.registry import registry
from mmf.common.sample import Sample
from mmf.datasets.base_dataset import BaseDataset
from mmf.utils.general import get_mmf_root
from mmf.utils.text import VocabFromText, tokenize
class CLEVRDataset(BaseDataset):
def __init__(self, config, dataset, data_folder=None, *args, **kwargs):
super().__init__("clevr", config, dataset)
self._data_folder = data_folder
self._data_dir = os.path.join(get_mmf_root(), config.data_dir)
if not self._data_folder:
self._data_folder = os.path.join(self._data_dir, config.data_folder)
if not os.path.exists(self._data_folder):
raise RuntimeError(
"Data folder {} for CLEVR is not present".format(self._data_folder)
)
# Check if the folder was actually extracted in the subfolder
if config.data_folder in os.listdir(self._data_folder):
self._data_folder = os.path.join(self._data_folder, config.data_folder)
if len(os.listdir(self._data_folder)) == 0:
raise RuntimeError("CLEVR dataset folder is empty")
self._load()
def _load(self):
self.image_path = os.path.join(self._data_folder, "images", self._dataset_type)
with open(
os.path.join(
self._data_folder,
"questions",
"CLEVR_{}_questions.json".format(self._dataset_type),
)
) as f:
self.questions = json.load(f)["questions"]
self._build_vocab(self.questions, "question")
self._build_vocab(self.questions, "answer")
def __len__(self):
# __len__ tells how many samples are there
return len(self.questions)
def _get_vocab_path(self, attribute):
return os.path.join(
self._data_dir, "vocabs",
"{}_{}_vocab.txt".format(self.dataset_name, attribute)
)
def _build_vocab(self, questions, attribute):
# This function builds vocab for questions and answers but not required for the
# tutorial
...
def __getitem__(self, idx):
# Get item is like your normal __getitem__ in PyTorch Dataset. Based on id
# return a sample. Check VQA2Dataset implementation if you want to see how
# to do caching in MMF
data = self.questions[idx]
# Each call to __getitem__ from dataloader returns a Sample class object which
# collated by our special batch collator to a SampleList which is basically
# a attribute based batch in layman terms
current_sample = Sample()
question = data["question"]
tokens = tokenize(question, keep=[";", ","], remove=["?", "."])
# This processors are directly assigned as attributes to dataset based on the config
# we created above
processed = self.text_processor({"tokens": tokens})
# Add the question as text attribute to the sample
current_sample.text = processed["text"]
processed = self.answer_processor({"answers": [data["answer"]]})
# Now add answers and then the targets. We normally use "targets" for what
# should be the final output from the model in MMF
current_sample.answers = processed["answers"]
current_sample.targets = processed["answers_scores"]
image_path = os.path.join(self.image_path, data["image_filename"])
image = np.true_divide(Image.open(image_path).convert("RGB"), 255)
image = image.astype(np.float32)
# Process and add image as a tensor
current_sample.image = torch.from_numpy(image.transpose(2, 0, 1))
# Return your sample and MMF will automatically convert it to SampleList before
# passing to the model
return current_sample
Metrics
For your dataset to be compatible out of the box, it is a good practice to also add the metrics your dataset requires.
All metrics for now go inside MMF/modules/metrics.py
. All metrics inherit BaseMetric
and implement a function calculate
with signature calculate(self, sample_list, model_output, *args, **kwargs)
where sample_list
(SampleList
) is the current batch and
model_output
is a dict return by your model for current sample_list
. Normally, you should define the keys you want inside
model_output
and sample_list
. Finally, you should register your metric to registry using @registry.register_metric('[key]')
where ‘[key]’ is the key for your metric. Here is a sample implementation of accuracy metric used in CLEVR dataset:
These are the common steps you need to follow when you are adding a dataset to MMF.