.TH "GCLOUD_AI_ENDPOINTS_DEPLOY\-MODEL" 1


.SH "NAME"
.HP
gcloud ai endpoints deploy\-model \- deploy a model to an existing Vertex AI endpoint


.SH "SYNOPSIS"
.HP
\f5gcloud ai endpoints deploy\-model\fR (\fIENDPOINT\fR\ :\ \fB\-\-region\fR=\fIREGION\fR) \fB\-\-display\-name\fR=\fIDISPLAY_NAME\fR \fB\-\-model\fR=\fIMODEL\fR [\fB\-\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]] [\fB\-\-autoscaling\-metric\-specs\fR=[\fIMETRIC\-NAME\fR=\fITARGET\fR,...]] [\fB\-\-deployed\-model\-id\fR=\fIDEPLOYED_MODEL_ID\fR] [\fB\-\-disable\-container\-logging\fR] [\fB\-\-enable\-access\-logging\fR] [\fB\-\-gpu\-partition\-size\fR=\fIGPU_PARTITION_SIZE\fR] [\fB\-\-machine\-type\fR=\fIMACHINE_TYPE\fR] [\fB\-\-max\-replica\-count\fR=\fIMAX_REPLICA_COUNT\fR] [\fB\-\-min\-replica\-count\fR=\fIMIN_REPLICA_COUNT\fR] [\fB\-\-required\-replica\-count\fR=\fIREQUIRED_REPLICA_COUNT\fR] [\fB\-\-reservation\-affinity\fR=[\fIkey\fR=\fIKEY\fR],[\fIreservation\-affinity\-type\fR=\fIRESERVATION\-AFFINITY\-TYPE\fR],[\fIvalues\fR=\fIVALUES\fR]] [\fB\-\-service\-account\fR=\fISERVICE_ACCOUNT\fR] [\fB\-\-spot\fR] [\fB\-\-traffic\-split\fR=[\fIDEPLOYED_MODEL_ID\fR=\fIVALUE\fR,...]] [\fIGCLOUD_WIDE_FLAG\ ...\fR]


.SH "EXAMPLES"

To deploy a model \f5\fI456\fR\fR to an endpoint \f5\fI123\fR\fR under project
\f5\fIexample\fR\fR in region \f5\fIus\-central1\fR\fR, run:

.RS 2m
$ gcloud ai endpoints deploy\-model 123 \-\-project=example \e
    \-\-region=us\-central1 \-\-model=456 \e
    \-\-display\-name=my_deployed_model
.RE


.SH "POSITIONAL ARGUMENTS"

.RS 2m
.TP 2m

Endpoint resource \- The endpoint to deploy a model to. The arguments in this
group can be used to specify the attributes of this resource. (NOTE) Some
attributes are not given arguments in this group but can be set in other ways.

To set the \f5project\fR attribute:
.RS 2m
.IP "\(em" 2m
provide the argument \f5endpoint\fR on the command line with a fully specified
name;
.IP "\(em" 2m
provide the argument \f5\-\-project\fR on the command line;
.IP "\(em" 2m
set the property \f5core/project\fR.
.RE
.sp

This must be specified.


.RS 2m
.TP 2m
\fIENDPOINT\fR

ID of the endpoint or fully qualified identifier for the endpoint.

To set the \f5name\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5endpoint\fR on the command line.
.RE
.sp

This positional argument must be specified if any of the other arguments in this
group are specified.

.TP 2m
\fB\-\-region\fR=\fIREGION\fR

Cloud region for the endpoint.

To set the \f5region\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5endpoint\fR on the command line with a fully specified
name;
.IP "\(bu" 2m
provide the argument \f5\-\-region\fR on the command line;
.IP "\(bu" 2m
set the property \f5ai/region\fR;
.IP "\(bu" 2m
choose one from the prompted list of available regions.
.RE
.sp


.RE
.RE
.sp

.SH "REQUIRED FLAGS"

.RS 2m
.TP 2m
\fB\-\-display\-name\fR=\fIDISPLAY_NAME\fR

Display name of the deployed model.

.TP 2m
\fB\-\-model\fR=\fIMODEL\fR

ID of the uploaded model.


.RE
.sp

.SH "OPTIONAL FLAGS"

.RS 2m
.TP 2m
\fB\-\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]

Manage the accelerator config for GPU serving. When deploying a model with
Compute Engine Machine Types, a GPU accelerator may also be selected.

.RS 2m
.TP 2m
\fBtype\fR
The type of the accelerator. Choices are 'nvidia\-a100\-80gb', 'nvidia\-b200',
\'nvidia\-gb200', 'nvidia\-h100\-80gb', 'nvidia\-h100\-mega\-80gb',
\'nvidia\-h200\-141gb', 'nvidia\-l4', 'nvidia\-rtx\-pro\-6000',
\'nvidia\-tesla\-a100', 'nvidia\-tesla\-k80', 'nvidia\-tesla\-p100',
\'nvidia\-tesla\-p4', 'nvidia\-tesla\-t4', 'nvidia\-tesla\-v100'.

.TP 2m
\fBcount\fR
The number of accelerators to attach to each machine running the job. This is
usually 1. If not specified, the default value is 1.

For example: \f5\-\-accelerator=type=nvidia\-tesla\-k80,count=1\fR

.RE
.sp
.TP 2m
\fB\-\-autoscaling\-metric\-specs\fR=[\fIMETRIC\-NAME\fR=\fITARGET\fR,...]

Metric specifications that control autoscaling behavior. At most one entry is
allowed per metric.

.RS 2m
.TP 2m
\fBMETRIC\-NAME\fR
Resource metric name. Choices are 'cpu\-usage', 'dcgm\-fi\-dev\-gpu\-util',
\'gpu\-duty\-cycle', 'request\-counts\-per\-minute',
\'vllm\-gpu\-cache\-usage\-perc', 'vllm\-num\-requests\-waiting'.

.TP 2m
\fBTARGET\fR
Target value for the given metric. For \f5cpu\-usage\fR, \f5gpu\-duty\-cycle\fR,
\f5dcgm\-fi\-dev\-gpu\-util\fR, and \f5vllm\-gpu\-cache\-usage\-perc\fR, the
target is the target resource utilization in percentage (1% \- 100%). For
\f5request\-counts\-per\-minute\fR, the target is the number of requests per
minute per replica. For \f5vllm\-num\-requests\-waiting\fR, the target is the
number of pending requests allowed on the replica.

For example, to set target CPU usage to 70% and target requests to 600 per
minute per replica:
\f5\-\-autoscaling\-metric\-specs=cpu\-usage=70,request\-counts\-per\-minute=600\fR

.RE
.sp
.TP 2m
\fB\-\-deployed\-model\-id\fR=\fIDEPLOYED_MODEL_ID\fR

User\-specified ID of the deployed\-model.

.TP 2m
\fB\-\-disable\-container\-logging\fR

For custom\-trained Models and AutoML Tabular Models, the container of the
deployed model instances will send \f5stderr\fR and \f5stdout\fR streams to
Cloud Logging by default. Please note that the logs incur cost, which are
subject to Cloud Logging pricing (https://cloud.google.com/stackdriver/pricing).

User can disable container logging by setting this flag to true.

.TP 2m
\fB\-\-enable\-access\-logging\fR

If true, online prediction access logs are sent to Cloud Logging.

These logs are standard server access logs, containing information like
timestamp and latency for each prediction request.

.TP 2m
\fB\-\-gpu\-partition\-size\fR=\fIGPU_PARTITION_SIZE\fR

The partition size of the GPU accelerator. This can be used to partition a
single GPU into multiple smaller GPU instances. See
https://cloud.google.com/kubernetes\-engine/docs/how\-to/gpus\-multi#multi\-instance_gpu_partitions
for more details.

.TP 2m
\fB\-\-machine\-type\fR=\fIMACHINE_TYPE\fR

The machine resources to be used for each node of this deployment. For available
machine types, see
https://cloud.google.com/ai\-platform\-unified/docs/predictions/machine\-types.

.TP 2m
\fB\-\-max\-replica\-count\fR=\fIMAX_REPLICA_COUNT\fR

Maximum number of machine replicas for the deployment resources the model will
be deployed on.

.TP 2m
\fB\-\-min\-replica\-count\fR=\fIMIN_REPLICA_COUNT\fR

Minimum number of machine replicas for the deployment resources the model will
be deployed on. For normal deployments, the value must be equal to or larger
than 1. If the value is 0, the deployment will be enrolled in the
scale\-to\-zero feature. If not specified and the uploaded models use dedicated
resources, the default value is 1.

NOTE: DeploymentResourcePools (model\-cohosting) is currently not supported for
scale\-to\-zero deployments.

.TP 2m
\fB\-\-required\-replica\-count\fR=\fIREQUIRED_REPLICA_COUNT\fR

Required number of machine replicas for the deployment resources the model will
be considered successfully deployed. This value must be greater than or equal to
1 and less than or equal to min\-replica\-count.

.TP 2m
\fB\-\-reservation\-affinity\fR=[\fIkey\fR=\fIKEY\fR],[\fIreservation\-affinity\-type\fR=\fIRESERVATION\-AFFINITY\-TYPE\fR],[\fIvalues\fR=\fIVALUES\fR]

A ReservationAffinity can be used to configure a Vertex AI resource (e.g., a
DeployedModel) to draw its Compute Engine resources from a Shared Reservation,
or exclusively from on\-demand capacity.

.TP 2m
\fB\-\-service\-account\fR=\fISERVICE_ACCOUNT\fR

Service account that the deployed model's container runs as. Specify the email
address of the service account. If this service account is not specified, the
container runs as a service account that doesn't have access to the resource
project.

.TP 2m
\fB\-\-spot\fR

If true, schedule the deployment workload on Spot VMs.

.TP 2m
\fB\-\-traffic\-split\fR=[\fIDEPLOYED_MODEL_ID\fR=\fIVALUE\fR,...]

List of pairs of deployed model id and value to set as traffic split.


.RE
.sp

.SH "GCLOUD WIDE FLAGS"

These flags are available to all commands: \-\-access\-token\-file, \-\-account,
\-\-billing\-project, \-\-configuration, \-\-flags\-file, \-\-flatten,
\-\-format, \-\-help, \-\-impersonate\-service\-account, \-\-log\-http,
\-\-project, \-\-quiet, \-\-trace\-token, \-\-user\-output\-enabled,
\-\-verbosity.

Run \fB$ gcloud help\fR for details.


.SH "NOTES"

These variants are also available:

.RS 2m
$ gcloud alpha ai endpoints deploy\-model
.RE

.RS 2m
$ gcloud beta ai endpoints deploy\-model
.RE