Troubleshooting Mend Self-Hosted Repo Integrations
Introduction
When using the repository integrations, it is imperative that a user know how to troubleshoot these integrations, as errors are not completely uncommon. Having a good understanding of the Controller workflow, Unified Agent Process, and the Remediate Container are great ways to know how to troubleshoot the integration. The following information shows some things to look for when troubleshooting, and steps to take to resolve these issues.
Controller
The controller is the central component responsible for orchestrating all activity for the Mend Repository Integration. It receives and processes inbound webhook events from the SCM system, and initiates all activities including:
repository onboarding
queuing scans
scan results management
remediation
The controller container(s) are stateless and use the Mend cloud for persistence storage. All inbound webhook events are queued in the Mend cloud, and processed asynchronously via multiple Controller worker threads. Queue metrics are not currently available to customers, but are monitored by Mend DevOps and Support with internal alerts triggered in case of queue issues. Customers will be notified if that occurs.
Valid SCM push events will be queued in the Mend cloud as Pending scans. Scanner(s) poll the scan queue for Pending scans, updating the status to Active when taking ownership. The controller exposes a statistics API for monitoring the Pending and Active scans, which is only available to the Self-Hosted Repository Integrations.
After scans complete and are processed in the Mend cloud, results are queued for retrieval by the Controller. Worker threads pull the scan results updating CheckRuns (or Pipelines) and Issues (for base branch scans).
For remediation activity, if enabled, relevant webhook events are forwarded to the Remediate Server for automated Pull Request creation and management.
Startup Checks
Upon initialization, the Mend Controller runs checks to make sure the rest of the integration is properly started. As a rule, if one check fails, then the Repo Integration will forego testing the rest of the checks and error out. Here are the different checks it performs:
Check | Action to take if the Check Failed |
---|---|
Activation Key Parsing | A new activation key should be generated. |
Mend API Connectivity | Check connection to Mend Servers. |
Mend Credentials | Check connection to Mend Servers. |
Queue Implementation | Check connection to Mend Servers. Open a support ticket to determine if there is something wrong with integration queue. |
Cache Implementation | Check Cache settings and validate everything is set properly. This should not error out if default settings are used. |
SCM API Connectivity | Check API URL endpoint provided before generating the activation key. |
GitHub App Permissions | Check the App Permissions that were set when creating the Github Application. |
Controller to Remediate Connectivity | This may fail if the Remediate container has not fully started at the time of this check. This does not automatically mean there is a problem, and further testing should be conducted. |
Remediate to Controller Connectivity | This may fail if the Remediate container has not fully started at the time of this check. This does not automatically mean there is a problem, and further testing should be conducted. |
Error Messages
The following are strings that should be searched when looking for errors:
Failed to store object in Mend cache
Failed to handle queue message
Failed to handle webhook due to an unexpected exception
Failed to handle publish issues request due to an unexpected exception
Failed to retrieve queue message
Failed to read message from queue
Failed to call Mend persistence queue
Failed to connect
Failed to get GH repository
Failed to get GH CheckRun
Failed to update GH CheckRun
Error processing git repository
Error while retrieving repo settings
Read timed out / Read time out
Connection reset
Out of memory
Examples
2023/06/20 17:00:59.796/GMT [ERROR] com.wss.common.imp.queue.whitesource.connector.WssQueueConnector - [CTX=94ba...93421] Mend Queue - Failed to call Mend persistence queue. Action:addQueueMessage, url: https://customer.whitesourcesoftware.com/api, ResponseCode: 405
This means that an HTTP 405 (Method Not Allowed) error was returned when attempting to create a message in the persistence queue (hosted on Mend Servers).
2023/10/20 15:44:22.802/UTC [ERROR] com.wss.bolt.webhook.handler.WebhookHandlerModule - [CTX=f3f35161...3cac1;EVENT=check_run;REPO=.reponame] Failed to handle webhook due to an unexpected exception.
This typically occurs when there was a major error processing the repository. In general, this does not provide much context, but is a good way of knowing whether repository processing failed. Further investigation into the integration logs is recommended.
Scanner
Container logs and Unified Agent logs are both generated by the scanner, however, both do two different things. The Scanner Container Process downloads the repository, resolves host rules if any are present, builds the project with the PSB (Pre-Step Builder), and then calls specific parts of the Mend Unified Agent to run a scan. The Unified Agent logs are then generated from this process, however Unified Agent logs will not be printed to the containers STDOUT unless the following environment variable is set: EXTERNAL_LOG_IN_CONSOLE=true.
Container Logs
Error Messages
Failed to clone
Could not read from remote repository - This could be due to Connectivity Issues, the app not being installed on the repository, permissions issues, and more.
2023-11-09 17:13:32.596/UTC [DEBUG] com.wss.github.scanner.manager.impl.HostRulesManagerImpl - [CTX=33474654f6b...00333;SCAN_CTX=0ab9a5...9c6;SCAN_ID=723150] PSB_RESULT_JSON={"result":[{"pm":"npm","tags":["NO_LOCKFILE"],"connectivity":[],"registryConfig":[],"preStep":[{"success":false,"ref":"package.json","errorType":"INSTALL_ERROR","errorMessage":"code EBADENGINE\nengine Unsupported engine\nengine Not compatible with your version of node/npm: nmp-dependency@0.0.1\nnotsup Not compatible with your version of node/npm: npm-mend-regression-test@0.0.1\nnotsup Required: {\"node\":\"16.x\"}\nnotsup Actual: {\"npm\":\"9.5.1\",\"node\":\"v18.14.2\"}"}]}],"logUrl":""}
Error messages that are produced like this often occur due to the PSB erroring out for one reason or another. This specific example shows an error where the version of the PSB is not compatible with the projects NPM version.
Unified Agent Logs
The Unified Agent logs should typically be read just as any normal Unified Agent log. Due to this, it is important to familiarize oneself with the Unified Agent and how it runs.
Error Messages
[ERROR] Plugin org.apache.maven.plugins:maven-dependency-plugin:2.8 or one of its dependencies could not be resolved: Failed to read artifact descriptor for org.apache.maven.plugins:maven-dependency-plugin:jar:2.8
This means that maven could not run the “maven-dependency-plugin” which usually means that it could not be downloaded. Troubleshooting should include internet connectivity check, and determining whether host rules were set properly if required.
[DEBUG] org.whitesource.utils.command.Command - Command result error lines for 'poetry run pip download -r /tmp/ws-scm/Case122366/.ws-temp-PAZUIT-requirements.txt
This typically means that the package manager command (in this instance poetry) returned an error. Subsequent logs will show what was printed out when this command was run, and users should analyze these to determine what went wrong when resolving the dependency tree. Issues could include: no internet connectivity, package not found, and host rules not resolved properly.
Some dependency managers failed, but some succeeded. Updating Mend with successful and failed managers. Failed commands: {"python":[{"cmd":"poetry install --no-dev","exitCode":1},{"cmd":"poetry run pip freeze","exitCode":1},{"cmd":"poetry show --tree --no-dev","exitCode":1}]}
This is typically printed out at the end of the scanner process, and is a good way to quickly get information on what went wrong in a scan.
Remediate
The remediate container is responsible for opening pull requests to update dependencies and execute the Renovate Process. There are two different processes that this container can run: Remediate Server, and Remediate Worker. A server simply handles a queue of repositories that need to be resolved by the Remediate Worker. While the worker itself uses either Remediate, Renovate, or both to determine what pull requests to open.
Error Messages
The Remediate container is an active participant in the repo integration flow since it receives a copy of every webhook the controller gets from the SCM. From that standpoint, the Remediate logs can be helpful in diagnosing issues that may not be directly related to Remediate or Renovate. One such instance is that the logs will print a view of the current Mend configuration for a specific repo (including inherited settings from a global config and defaults), as it presents itself in the controller, as that is where the Remediate Container pulls the configuration from. This allows for the validation of an integrated repo configuration in a read-friendly manner.
In order to effectively make the most of Remediate logs for troubleshooting purposes, debug level logging must be turned on. This can be done by setting the environment variable LOG_LEVEL=DEBUG before starting the containers. The logs include a "level": 20
element for each entry they output at debug level.
{"name":"renovate", … ,"repository":"<ORG/REPO NAME>","msg":"Retrieving repo configuration from controller", …}
{"name":"renovate", … ,"repository":"<ORG/REPO NAME>","config":{
… <FULL CONFIG DETAILS> …
},"msg":"Repo config from controller", …}
Another useful log entry is the "msg":"http statistics"
block. This entry shows summary information to help validate http(s) connectivity to external endpoints. It is particularly useful to verify if hostRules processing resulted in Remediate and/or Renovate reaching out to the correct http(s) endpoints and if any errors (or delays) resulted from those interactions. The example below shows statistics while reaching out to an Artifactory instance:
{
… ,
"repository": "<ORG/REPO NAME>",
"urls": {
"https://api.github.com/graphql (POST,200)": 1,
"<https://api.github.com/repos/<ORG/REPO> NAME>/branches/main/protection (GET,403)": 1,
"<https://api.github.com/repos/<ORG/REPO> NAME>/pulls (GET,200)": 1,
"https://registry.npmjs.org/axios (GET,304)": 1,
"https://registry.npmjs.org/express (GET,304)": 1,
"https://registry.npmjs.org/grunt (GET,304)": 1,
"https://wsdev.jfrog.io/artifactory/api/npm/jramirez_npm_local_test/@jorgerdemocorp-mend-selfhosted%2Fultrathin_vulnerable_npm (GET,200)": 1
},
"hostStats": {
"api.github.com": {
"requestCount": 3,
"requestAvgMs": 295,
"queueAvgMs": 0
},
"registry.npmjs.org": {
"requestCount": 3,
"requestAvgMs": 76,
"queueAvgMs": 0
},
"http://wsdev.jfrog.io ": {
"requestCount": 1,
"requestAvgMs": 125,
"queueAvgMs": 0
}
},
"totalRequests": 7,
"msg": "http statistics",
…
}
One other thing to note is that the "urls"
block also reports the HTTP Response code returned from calling that URL. This is especially useful for determining whether authentication occurred properly, if there was an unhandled redirect, or other issues.
Non-Critical Errors
The following errors can be benign and are normally considered false negatives:
java.util.NoSuchElementException: No line found
at java.util.Scanner.nextLine(Scanner.java:1540)
at com.wss.bolt.AppRunner.start(AppRunner.java:79)
at com.wss.bolt.AppRunner.main(AppRunner.java:26)
This occurs at the beginning of every single controller container startup and can be completely ignored if it shows up immediately after the container checks.
Called SCM API [RESPONSE_CODE=404;RESPONSE_TIME=132;REQUEST_TYPE=GET;REQUEST_URL=https://customer.github.com/api/v3/repos/ORG_NAME/REPO_NAME/contents/.whitesource]
This can occur when the integration checks whether a repository is onboarded or not. The 404 response code means that a .whitesource file is not present, but does not mean it necessarily is requiring one. If a .whitesource file does exist on the repository and this message occurs, then further troubleshooting would be required.
User is not allowed to perform this action
This can occur if the GitHub application is not installed on the repository. Unless the repository has in fact been onboarded, this should not be an issue.
What to Provide to Support
When opening tickets with Support providing the appropriate information is imperative to receiving a speedy resolution. Here are some examples of what to provide for different case types:
Failed Scans
Scanner and Unified Agent logs for the relevant CTX
Scans not triggered and checkRuns are not created
Controller logs and repository name
checkRun not getting updated
Controller logs and CTX
“Oops something went wrong” message
Controller logs and CTX
Issues are not created or updated
Controller logs, CTX, CVE and/or library name