Skip to content
Snippets Groups Projects
runner.md 44.1 KiB
Newer Older
```python
optim_wrapper = dict(
    constructor='CustomConstructor',
    type='OptimWrapper',  # Specify the type of OptimWrapper
    optimizer=dict(  # optimizer configuration
        type='AdamW',
        lr=0.0001,
        betas=(0.9, 0.999),
        weight_decay=0.05)
    paramwise_cfg={
        'decay_rate': 0.95,
        'decay_type': 'layer_wise',
        'num_layers': 6
    })
```

</div>
  </td>
</tr>
</thead>
</table>

```{note}
For the high-level tasks like detection and classification, MMCV needs to configure `optim_config` to build `OptimizerHook`, while not necessary for MMEngine.
```

`optim_wrapper` used in this tutorial is as follows:

```python
from torch.optim import SGD

optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
optim_wrapper = dict(optimizer=optimizer)
```

### Prepare hooks

**Prepare hooks in MMCV**

The commonly used hooks configuration in MMCV is as follows:

```python
# learning rate scheduler config
lr_config = dict(policy='step', step=[2, 3])
# configuration of optimizer
optimizer_config = dict(grad_clip=None)
# configuration of saving checkpoints periodically
checkpoint_config = dict(interval=1)
# save log periodically and multiple hooks can be used simultaneously
log_config = dict(interval=100, hooks=[dict(type='TextLoggerHook')])
# register hooks to runner and those hooks will be invoked automatically
runner.register_training_hooks(
    lr_config=lr_config,
    optimizer_config=optimizer_config,
    checkpoint_config=checkpoint_config,
    log_config=log_config)
```

Among them:

- `lr_config` is used for `LrUpdaterHook`
- `optimizer_config` is used for `OptimizerHook`
- `checkpoint_config` is used for `CheckPointHook`
- `log_config` is used for `LoggerHook`

Besides the hooks mentioned above, MMCV Runner will build `IterTimerHook` automatically. MMCV `Runner` will register the training hooks after instantiating the model, while MMEngine Runner will initialize the hooks during instantiating the model.

**Prepare hooks in MMEngine**

MMEngine `Runner` takes some commonly used hooks in MMCV as the default hooks.

- [RuntimeInfoHook](mmengine.hooks.RuntimeInfoHook)
- [IterTimerHook](mmengine.hooks.IterTimerHook)
- [DistSamplerSeedHook](mmengine.hooks.DistSamplerSeedHook)
- [LoggerHook](mmengine.hooks.LoggerHook)
- [CheckpointHook](mmengine.hooks.CheckpointHook)
- [ParamSchedulerHook](mmengine.hooks.ParamSchedulerHook)

Compared with the example of MMCV

- `LrUpdaterHook` correspond to the `ParamSchedulerHook`, find more details in [migrate scheduler](./param_scheduler.md)
- MMEngine optimize the model in [train_step](mmengine.model.BaseModel.train_step), therefore we do not need `OptimizerHook` in MMEngine anymore
- MMEngine takes `CheckPointHook` as the default hook
- MMEngine take `LoggerHook` as the default hook

Therefore, we can achieve the same effect as the MMCV example as long as we configure the [param_scheduler](../tutorials/param_scheduler.md) correctly.

We can also register custom hooks in MMEngine runner, find more details in [runner tutorial](../tutorials/runner.md) and [migrate hook](./hook.md).

<table class="docutils">
<thead>
  <tr>
    <th>Commonly used hooks in MMCV</th>
    <th>Default hooks in MMEngine</th>
<tbody>
  <tr>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
# Configure training hooks
# Configure LrUpdaterHook
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])

# Configure OptimizerHook
optimizer_config = dict(grad_clip=None)

# Configure LoggerHook
log_config = dict(  # LoggerHook
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])

# Configure CheckPointHook
checkpoint_config = dict(interval=1)  # CheckPointHook
```

</div>
  </td>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
# Configure parameter scheduler
param_scheduler = [
    dict(
        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
    dict(
        type='MultiStepLR',
        begin=0,
        end=12,
        by_epoch=True,
        milestones=[8, 11],
        gamma=0.1)
]

# Configure default hooks
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='DetVisualizationHook'))
```

</div>
  </td>
</tr>
</thead>
</table>

The parameter scheduler used in this tutorial is as follows:

```python
from math import gamma

param_scheduler = dict(type='MultiStepLR', milestones=[2, 3], gamma=0.1)
```

### Prepare testing/validation components

MMCV implements the validation process by `EvalHook`, and we'll not talk too much about it here. Given that validation is a common process in training, MMEngine abstracts validation as two independent modules: [Evaluator](../tutorials/evaluation.md) and [ValLoop](../tutorials/runner.md). We can customize the metric or the validation process by defining a new [loop](mmengine.runner.ValLoop) or a new [metric](mmengine.evaluator.BaseMetric).

```python
import torch
from mmengine.evaluator import BaseMetric
from mmengine.registry import METRICS

@METRICS.register_module(force=True)
class ToyAccuracyMetric(BaseMetric):

    def process(self, label, pred) -> None:
        self.results.append((label[1], pred, len(label[1])))

    def compute_metrics(self, results: list) -> dict:
        num_sample = 0
        acc = 0
        for label, pred, batch_size in results:
            acc += (label == torch.stack(pred)).sum()
            num_sample += batch_size
        return dict(Accuracy=acc / num_sample)
```

After defining the metric, we should also configure the evaluator and loop for `Runner`. The example used in this tutorial is as follows:

```python
val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')
```

<table class="docutils">
<thead>
  <tr>
    <th>Configure validation in MMCV</th>
    <th>Configure validation in MMEngine</th>
<tbody>
  <tr>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'
eval_hook = DistEvalHook if distributed else EvalHook
runner.register_hook(
    eval_hook(val_dataloader, **eval_cfg), priority='LOW')
```

</div>
  </td>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
val_dataloader = val_dataloader
val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')
```

</div>
  </td>
</tr>
</thead>
</table>

### Build Runner

**Building Runner in MMCV**

```python
runner = EpochBasedRunner(
    model=model,
    optimizer=optimizer,
    work_dir=work_dir,
    logger=logger,
    max_epochs=4
)
```

**Building Runner in MMEngine**

The `EpochBasedRunner` and `max_epochs` arguments in `MMCV` are moved to `train_cfg` in MMEngine. All parameters configurable in `train_cfg` are listed below:

- by_epoch: `True` equivalent to `EpochBasedRunner`. `False` equivalent to `IterBasedRunner`
- `max_epoch/max_iter`: Equivalent to `max_epochs` and `max_iters` in MMCV
- `val_iterval`: Equivalent to `interval` in MMCV

```python
from mmengine.runner import Runner

runner = Runner(
    model=model,  # model to be optimized
    work_dir='./work_dir',  # working directory
    randomness=randomness,  # random seed
    env_cfg=env_cfg,  # environment config
    launcher='none',  # launcher for distributed training
    optim_wrapper=optim_wrapper,  # configure optimizer wrapper
    param_scheduler=param_scheduler,  # configure parameter scheduler
    train_dataloader=train_dataloader,  # configure train dataloader
    train_cfg=dict(by_epoch=True, max_epochs=4, val_interval=1),  # Configure training loop
    val_dataloader=val_dataloader,  # Configure validation dataloader
    val_evaluator=val_evaluator,  # Configure evaluator and metrics
    val_cfg=val_cfg)  # Configure validation loop
```

### Load checkpoint

**Loading checkpoint in MMCV**

```python
if cfg.resume_from:
    runner.resume(cfg.resume_from)
elif cfg.load_from:
    runner.load_checkpoint(cfg.load_from)
```

**Loading checkpoint in MMEngine**

```python
runner = Runner(
    ...
    load_from='/path/to/checkpoint',
    resume=True
)
```

<table class="docutils">
<thead>
  <tr>
    <th>Configuration of loading checkpoint in MMCV</th>
    <th>Configuration of loading checkpoint in MMEngine</th>
<tbody>
<tr>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
load_from = 'path/to/ckpt'
```

</td>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
load_from = 'path/to/ckpt'
resume = False
```

</div>
  </td>
</tr>
<tr>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
resume_from = 'path/to/ckpt'
```

</td>
  <td valign="top" class='two-column-table-wrapper' width="50%"><div style="overflow-x: auto">

```python
load_from = 'path/to/ckpt'
resume = True
```

</div>
  </td>
  </tr>
</thead>
</table>

### Training process

**Training process in MMCV**

Resume or load checkpoint firstly, and then start training.

```python
if cfg.resume_from:
    runner.resume(cfg.resume_from)
elif cfg.load_from:
    runner.load_checkpoint(cfg.load_from)
runner.run(data_loaders, cfg.workflow)
```

**Training process in MMEngine**

Complete the process mentioned above the `Runner.__init__` and `Runner.train`

```python
runner.train()
```

### Testing process

Since MMCV Runner does not integrate the test function, we need to implement the test scripts by ourselves.

For MMEngine Runner, as long as we have configured the `test_dataloader`, `test_cfg` and `test_evaluator` for the `Runner`, we can call `Runner.test` to start the testing process.

**`work_dir` is the same for training**

```python
runner = Runner(
    model=model,
    work_dir='./work_dir',
    randomness=randomness,
    env_cfg=env_cfg,
    optim_wrapper=optim_wrapper,
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.test()
```

**`work_dir` is the different for training, configure load_from manually**

```python
runner = Runner(
    model=model,
    work_dir='./test_work_dir',
    load_from='./work_dir/epoch_5.pth',  # set load_from additionally
    randomness=randomness,
    env_cfg=env_cfg,
    launcher='none',
    optim_wrapper=optim_wrapper,
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_dataloader=val_dataloader,
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.test()
```

### Customize training process

If we want to customize a training/validation process, we need to override the `Runner.val` or `Runner.train` in a custom `Runner`. Take overriding `runner.train` as an example, suppose we need to train with the same batch twice for each iteration, we can override the `Runner.train` like this:

```python
class CustomRunner(EpochBasedRunner):
    def train(self, data_loader, **kwargs):
        self.model.train()
        self.mode = 'train'
        self.data_loader = data_loader
        self._max_iters = self._max_epochs * len(self.data_loader)
        self.call_hook('before_train_epoch')
        time.sleep(2)  # Prevent possible deadlock during epoch transition
        for i, data_batch in enumerate(self.data_loader):
            self.data_batch = data_batch
            self._inner_iter = i
            for _ in range(2)
                self.call_hook('before_train_iter')
                self.run_iter(data_batch, train_mode=True, **kwargs)
                self.call_hook('after_train_iter')
            del self.data_batch
            self._iter += 1

        self.call_hook('after_train_epoch')
        self._epoch += 1
```

In MMEngine, we need to customize a train loop.

```python
from mmengine.registry import LOOPS
from mmengine.runner import EpochBasedTrainLoop


@LOOPS.register_module()
class CustomEpochBasedTrainLoop(EpochBasedTrainLoop):
    def run_iter(self, idx, data_batch) -> None:
        for _ in range(2):
            super().run_iter(idx, data_batch)
```

and then, we need to set `type` as `CustomEpochBasedTrainLoop` in `train_cfg`. Note that `by_epoch` and `type` cannot be configured at the same time. Once `by_epoch` is configured, the type of the training loop will be inferred as `EpochBasedTrainLoop`.

```python
runner = Runner(
    model=model,
    work_dir='./test_work_dir',
    randomness=randomness,
    env_cfg=env_cfg,
    launcher='none',
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_dataloader=train_dataloader,
    train_cfg=dict(
        type='CustomEpochBasedTrainLoop',
        max_epochs=5,
        val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_dataloader=val_dataloader,
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.train()
```

For more complicated migration needs of `Runner`, you can refer to the [runner tutorials](../tutorials/runner.md) and [runner design](../design/runner.md).