Why I hate proxies and logging is important - Tue, Jun 18, 2024
Why I hate proxies and logging is important
Why I hate proxies and logging is important
As part of my work we were experimenting a little bit with kubeflow . Especially the kubeflow pipelines , to get insights into how we could potentially automate our machine learning workflows. For that reason I installed kubeflow on a local kind cluster as described here in order to run some basic examples.
The problem
Since I was behind a corporate proxy and I already had some painful experiences with proxies in the past, I was surprised that the kind and kubeflow installation went smoothly. However, when I tried to run the hello world example I got the following error from one of the Pods that make up the kubeflow pipeline:
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7f40106e1e50>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /simple/kfp/
ERROR: Could not find a version that satisfies the requirement kfp==2.7.0 (from versions: none)
ERROR: No matching distribution found for kfp==2.7.0
The cause of the problem was easy to find in the stack trace above: The Pod could not connect to the internet to download the kfp package because direct internet access was not possible but the proxy had to be used.
My (imperfect) solution
Luckily, kubeflow pipeline SDK offers a way to set environment variables for a pipeline (see here ):
@dsl.pipeline
def hello_pipeline():
say_hello().set_env_variable('http_proxy', http_proxy).set_env_variable('https_proxy', https_proxy)
This solved the pip installation problem but when the pipeline was executed I got another error in the same Pod:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x944f7d]
goroutine 1 [running]:
github.com/kubeflow/pipelines/backend/src/v2/metadata.(*Client).PublishExecution(0xc0006f7cb0, {0x2315528, 0x32a80a0}, 0x0, 0x0, {0x0, 0x0, 0x22f63a0?}, 0x4)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/metadata/client.go:446 +0x5d
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).publish(0xc000655ae0?, {0x2315528?, 0x32a80a0?}, 0x1e3e344?, 0x31?, {0x0?, 0x1?, 0x1?}, 0x655b38?)
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:271 +0x90
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute.func2()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:144 +0x5d
github.com/kubeflow/pipelines/backend/src/v2/component.(*LauncherV2).Execute(0xc000814000, {0x2315528?, 0x32a80a0})
/go/src/github.com/kubeflow/pipelines/backend/src/v2/component/launcher_v2.go:156 +0x9ee
main.run()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:98 +0x3ec
main.main()
/go/src/github.com/kubeflow/pipelines/backend/src/v2/cmd/launcher-v2/main.go:47 +0x13
time="2024-06-18T06:24:33.145Z" level=info msg="sub-process exited" argo=true error="<nil>"
Error: exit status 2
So a plain segmentation violation with not much context information (That’s why I think that expressive logging really is important).
Following my own advice here I cloned the repository and started looking at the source code in order to find out what could have caused the problem. After a long and cumbersome search that involved creating a debug Pod , a code analysis and doing the whole work in a proxy free environment, I found out what the problem was:
After the pipeline was run, the metadata client tried to publish the result using the internal cluster service metadata-grpc-service
.
Since I set the http_proxy
and https_proxy
variables but forgot to set the no_proxy
variable (hence my imperfect solution) the metadata client tried to connect to the metadata-grpc-service
via the proxy which of course did not work.
After adding the no_proxy
variable with the appropriate value the pipeline ran successfully.
Conclusion
The obvious conclusion is that proxies can be make your life pretty miserable. Also people make mistakes. So do I. But the real conclusion is that expressive and extensive logging is important. A segmentation violation with no context information is not an ideal starting point for a root cause analysis. Besides that the rule that you should look at the source code once again proved to be true.