tructure.

    In the flattened `fx.GraphModule`, each `nn.Module` forward call has been traced as
    a sequence of `fx.Node`s. All these `fx.Node`s are flattened and reside in the same
    `fx.GraphModule`.

    This pass generates a new `fx.GraphModule`. It groups the flattened `fx.Node`s that belong
    to the same `nn.Module` forward call into a sub `fx.GraphModule`. It then replaces the
    sequence of flattened `fx.Node`s with a single `call_module` node, which is linked with
    the sub `fx.GraphModule` by `node.target`. The sub `fx.GraphModule` is registered as a
    submodule of the new `fx.GraphModule`.

    The process is done based on information from the `nn_module_stack` metadata of each node, i.e.
    `node.meta["nn_module_stack"]`. For more implementation details, see [NOTE: Modularize Pass Implementation].

    An fx submodule under this context can typically be interpreted in three different ways:

        1. As an embodiment of an nn.Module class, which is considered stateless.
        Its execution path can vary depending on the configuration of module initialization,
        which should also be part of the inputs.

        2. As a representation of an nn.Module instance. It maintains the state initialized in the module.
        The execution path can vary based on actual input data.

        3. As a captured call of an nn.Module instance, where the execution path
        is set.

    The generality decreases along this list. Within the scope of this function, the pass
    creates fx submodules according to the third interpretation.

    The first interpretation is the most general case. It requires complex analysis and additional
    metadata and code information to construct its general form. Consider an example nn.Module
    that generates arbitrary submodules based on an initialization configuration file. It's impractical
    to extract this logic for the generated fx submodule to function with arbitrary configuration.

    The second interpretation demands less analysis and is sturdier than the
    first. In most use cases, it's equivalent to the third. It only differs in exceptional situations
    where a complex nn.Module instance is called multiple times, each with a different set of inputs
    leading to a unique execution branching path.

    The third interpretation is the most specific scenario. It necessitates the minimum
    analysis and creates the most stable representation. The drawback is that it
    generates more redundancy than the other two methods. If needed, a subsequent post-processing
    pass can be applied to consolidate completely identical functions and reduce duplication.

    ### Known constraints
    Two successive calls to the same module instance will be conflated. They are indistinguishable.
    This is due to limitations of the current fx metadata "nn_module_stack".

    [NOTE: Modularize pass ordering]
    This pass groups fx nodes into subgraphs that reside within the `call_module` fx node.
    Other fx passes (including some outside the exporter) might not recognize `call_module`.
    They may assume that all nodes are flattened. Hence it is recommended to invoke this pass
    as the last pre onnx export fx pass. If not for this consideration, this operation could
    potentially be relocated anywhere earlier in the pipeline.

    Example:

        >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_ONNX)
        >>> import torch
        >>> from torch.onnx._internal.fx import passes
        >>> from torch.onnx._internal.diagnostics import infra
        >>>
        >>> class CustomModule(torch.nn.Module):
        >>>     def __init__(self):
        >>>         super().__init__()
        >>>         self.embedding = torch.nn.Embedding(10, 32)
        >>>         self.relu = torch.nn.ReLU()
        >>>
        >>>     def forward(self, x):
        >>>         out = self.embedding(x)
        >>>         out = self.relu(out)
        >>>         return out
        >>>
        >>> class TestModule(torch.nn.Module):
        >>>     def __init__(self):
        >>>         super().__init__()
        >>>         self.layer = CustomModule()
        >>>         self.linear = torch.nn.Linear(32, 10)
        >>>
        >>>     def forward(self, x):
        >>>         out = self.layer(x)
        >>>         out = self.linear(out)
        >>>         return out
        >>>
        >>> gm, _ = torch._dynamo.export(TestModule(), aten_graph=True)(torch.tensor([0, 1, 2]))
        >>> gm.print_readable()

        >>> gm = passes.Modularize(infra.DiagnosticContext("test_context", "1.0"), gm).run()
        >>> gm.print_readable()

    r