Scaling Best Practices - Self-Managed Repository Integrations
To estimate the number of container replicas required for your workload, please contact Mend.io Professional Services.
The following guidance is designed for large deployments of 10,000 repositories or greater.
Cluster Worker Nodes
Recommended per node compute resources:
8 core X 64GB RAM
1 TB SSD storage
AWS: r6i-2xl (or equivalent)
Controller
Scaling out the controller requires an ingress load balancer configured to round-robin requests across controller replicas. Controllers are stateless. Do not configure sticky sessions.
Container request/limit sizes:
Request:
CPU: 2 core (2000m)
Memory: 6GB RAM (6G)
Limit:
CPU: 2 core (2000m)
Memory: 6GB RAM (6G)
JVM
Specify the following JAVA_OPTS settings:
Environment Variable | Value |
JAVA_OPTS | -Xms4G -Xmx4G |
In order to monitor and validate Scanner horizontal scaling, monitor and track pending and active scans via the scan queue statistics API.
Monitor controller logs for JVM out-of-memory errors.
Prior to v23.7.1, setting JVM options requires modification of the Controller shell.sh launch script. Contact Mend Professional Services for specific guidance. Upgrading to the latest version is recommended.
Specifying 4GB for the JVM heap space leaves ~2GB for the system and other JVM memory requirements.
Scanner
Scanners clone repositories and cache open source artifacts resolved using package manager manifest files. Storage limits will be determined by the size of cloned repos and the number of dependencies contained in repositories under scan. They will vary greatly from customer to customer and repo to repo.
Scanner container memory and storage limits should be determined by the largest build machine requirements used to build the largest repositories being scanned. In the absence of that information, use the following and adjust as necessary based on scanner performance.
Container request/limit sizes:
Request:
CPU: 0.5 core (500m)
Memory: 2.5GB RAM (2500M)
Ephemeral Storage: 250GB (250G)
Limit:
CPU: 1 core (1000m)
Memory: 5GB RAM (5G)
Ephemeral Storage: 500GB (500G)**
AutoScale
DO NOT configure your container orchestrator (e.g. K8s) to automatically scale out the number of Scanner replicas based on CPU or memory utilization. These system metrics do not indicate additional scanners are needed. The only metric useful for scaling out is the number of pending scans from the Controller scan statistics API mentioned above.
JVM
Do not adjust default JVM options
Git Connector
The scanner uses JGit by default to clone repositories. JGit has several limitations including file size and lack of support for shallow cloning.
For large-scale deployments, switch to the Git Connector to take advantage of shallow Git clones reducing ephemeral storage requirements during scans.
If the Git Connector is enabled, Git will need to be configured for custom certificate authorities and proxies, if required. See our Custom Certificate guidance for more information. For proxy support, see Git documentation.
To enable the Git Connector, set the following environment variable:
Environment Variable | Value |
WS_GIT_CONNECTOR | true |
Remediate
Remediate scales out using a Server/Worker model. The Remediate Server is stateful and manages an in-memory job queue. There can only be one instance of a Remediate Server per SCM integration cluster. Remediate Workers pull jobs off the Server queue and perform the R/R work. Workers are stateless and can scale out as required. Worker/Server modes are controlled by environment variables. See the product documentation for more information.
Server request/limit sizes:
Request:
CPU: 0.6 core (600m)
Memory: 1GB (1Gi)
Disk: 20GB
Limit:
CPU: 1.0 core (1000m)
Memory: 4GB (4Gi)
Disk: 20GB
Worker request/limit sizes:
Request:
CPU: 0.6 core (600m)
Memory: 1GB (1Gi)
Disk: 20G
Limit:
CPU: 1.0 core (1000m)
Memory: 2GB (4Gi)
Disk: 40GB (minimum, depending on repo size)
JVM - N/A
Remediate is a Node.js application and does not require JVM settings
Other Settings
To reduce the load on your SCM system and possible API rate limits, decrease the Remediate Server cron schedule from default hourly to daily at midnight UTC:
Add the following environment variable to the Remediate-Server pod:
Environment Variable | Value |
SCHEDULER_CRON | 0 0 * * * |
Add the following environment variable to the Worker service if you experience hung Remediate Workers:
Environment Variable | Value |
REMEDIATE_JOB_TIMEOUT_MINUTES | 60 |
Deploy REDIS for caching across your Remediate container pool
Monitor the Remediate status API, and track the number of queued jobs over time. Use this information to determine the number of R/R workers required to handle your workload. See the Remediate docs for more detail on this status API.