Dependency graphs and pip: a difficult set of relationships
A dive into how component-detection tooling analyzes your project's Python environment
Overview
Across build environments in every company, you’re bound to find traces of Python code and tooling built by engineers, data scientists, and analysts alike. With Python offering excellent flexibility while remaining easy to use, it’s no surprise that it sits at #2 on the GitHub top languages ranking1. But with a thriving ecosystem comes its own set of challenges for organizations looking to understand what dependencies they’re bringing along for the ride.
A personal introduction
I’ve been working on dependency analysis and open source vulnerability remediation tooling for the past three years since I joined the Open Source Engineering team at Microsoft. I’m now the primary maintainer of the microsoft/component-detection tooling which is used at scale to perform build time dependency analysis across nearly every build at Microsoft, powering things like automated vulnerability reporting and SBOM generation. I’m also a big fan of Google’s deps.dev and osv.dev which can help you perform a similar analysis on your code bases. All of these projects are open source and looking for contributors, so if you’re interested in the space we’d love to have your help!
Package manifests
If you’re not familiar with the phase, a package manifest is (ideally) a file generated by your language’s build tooling which contains the list of packages you’re using in the project. It usually contains a full list of explicitly requested components and their dependencies such as npm’s package-lock.json, but can also take a more limited form such as a requirements.txt file where only a few top level packages need to be defined. As an example, the following requirements.txt contains only one package, jupyterlab, but actually pulls in almost 90 packages during pip install!
jupyterlab == 4.2.1
Let’s take a quick look at just a few of the different ways you’re able to specify which open source packages you want to use in the Python project you’re working on:
Even more options
pipfile.lock
https://hatch.pypa.io/1.7/config/dependency/
https://astral.sh/blog/uv
Approaching Python dependencies the component-detection way
Original implementation
The PythonCommandService is the main driver of dependency parsing from setup.py
and requirements.txt
files in the original implementation. It will perform a basic text parsing of the file, splitting out the list of dependency specifiers provided by the build environment. However, as noted above this only contains top level dependencies (unless automatically generated by tooling such as pip compile
).
So how do you actually build an accurate dependency graph from such little information? The approach taken was a brute force one, we would generate the graph ourselves by downloading the associated WHL and metadata files from the PyPi index and performing a “best guess” approximation of which packages were going to be fetched. If you’re interested in the specifics, you can peek through PythonResolver to get a better idea on how this is done.
Getting Python data from "https://pypi.org/pypi/dill/json"
Getting Python data from "https://pypi.org/pypi/colorama/json"
Getting Python data from "https://files.pythonhosted.org/packages/42/3f/669429ce58de2c22d8d2c542752e137ec4b9885fff398d3eceb1a7f5acb4/pep8-1.7.1-py2.py3-none-any.whl"
Getting Python data from "https://files.pythonhosted.org/packages/6a/85/1116939099463333051082c2af435abd2facd4559cc335a4a2d595ae8ce5/prospector-1.10.3-py3-none-any.whl"
Getting Python data from "https://pypi.org/pypi/GitPython/json"
Getting Python data from "https://pypi.org/pypi/PyYAML/json"
Getting Python data from "https://pypi.org/pypi/dodgy/json"
##[warning]Candidate version (flake8 7.1.0 - pip) for flake8 already exists in map and the version is NOT valid.
##[warning]Specifiers: <6.0.0 for package prospector caused this.
Getting Python data from "https://files.pythonhosted.org/packages/cf/a0/b881b63a17a59d9d07f5c0cc91a29182c8e8a9aa2bde5b3b2b16519c02f4/flake8-5.0.4-py2.py3-none-any.whl"
##[warning]Version Resolution for flake8 failed, assuming last valid version is used.
Getting Python data from "https://pypi.org/pypi/pep8-naming/json"
##[warning]Candidate version (pyflakes 3.2.0 - pip) for pyflakes already exists in map and the version is NOT valid.
##[warning]Specifiers: >=2.2.0,<3 for package prospector caused this.
##[warning]Version Resolution for pyflakes failed, assuming last valid version is used.
Doing so required an approximate rewrite of the pip dependency resolution logic, which has support for a myriad of edge cases for dependency versioning based on Python version, system OS, forcing different resolution algorithms, etc.
This strategy was often difficult to maintain and is a major source of bug reports in the component-detection repository. Of the open issues, approximately 15-20% are related to our original Python parsing implementation. Over time as environments become more complex and additional security requirements are introduced such as internal artifact feed configuration, doing detection this way just doesn’t cut it and it was time to look for an alternative.
The latest and greatest: pip installation reports
In v22.0
of the pip tooling, a new feature was added which changed the game for package detection, the installation report. Most developers can quietly ignore this new option, but for anyone building SBOM tools, this comes much closer to the standard of package manifests generated by other package managers such as NPM and Yarn.
So, how does it work? First, let’s take a quick look at what these files consist of and how to generate them. The new format is documented at Installation Report - pip documentation, but the real gold nugget that we’re looking for is the array of InstallationReportItem
objects in the report’s install
field containing the necessary package metadata.
{
"version": "1",
"pip_version": "22.2",
"install": [
{
"download_info": {
"url": "https://files.pythonhosted.org/packages/a4/0c/fbaa7319dcb5eecd3484686eb5a5c5702a6445adb566f01aee6de3369bc4/pydantic-1.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"archive_info": {
"hashes": {
"sha256": "18f3e912f9ad1bdec27fb06b8198a2ccc32f201e24174cec1b3424dda605a310"
}
}
},
"is_direct": false,
"is_yanked": false,
"requested": true,
"metadata": {
"name": "pydantic",
"version": "1.9.1",
"requires_dist": [
"typing-extensions (>=3.7.4.3)",
"dataclasses (>=0.6) ; python_version < \"3.7\"",
"python-dotenv (>=0.10.4) ; extra == 'dotenv'",
"email-validator (>=1.0.3) ; extra == 'email'"
],
"requires_python": ">=3.6.1",
"provides_extra": [
"dotenv",
"email"
]
}
},
{
"download_info": {
"url": "https://github.com/pypa/packaging",
"vcs_info": {
"vcs": "git",
"requested_revision": "main",
"commit_id": "4f42225e91a0be634625c09e84dd29ea82b85e27"
}
},
"is_direct": true,
"is_yanked": false,
"requested": true,
"metadata": {
"name": "packaging",
"version": "21.4.dev0",
"requires_dist": [
"pyparsing (!=3.0.5,>=2.0.2)"
],
"requires_python": ">=3.7"
}
},
{
"download_info": {
"url": "https://files.pythonhosted.org/packages/75/e1/932e06004039dd670c9d5e1df0cd606bf46e29a28e65d5bb28e894ea29c9/typing_extensions-4.2.0-py3-none-any.whl",
"archive_info": {
"hashes": {
"sha256": "6657594ee297170d19f67d55c05852a874e7eb634f4f753dbd667855e07c1708"
}
}
},
"is_direct": false,
"requested": false,
"metadata": {
"name": "typing_extensions",
"version": "4.2.0",
"requires_python": ">=3.7"
}
},
... [truncated for readability]
],
"environment": {
... [truncated for readability]
}
}
Pitfalls and limitations
As with any problem, each approach has tradeoffs which must be considered. By using pip installation reports we manage to avoid several of the largest Python pitfalls commonly seen by basic SBOM tooling where only direct dependencies are returned due to the limited package manifests, but that comes with its own problems.
The primary concern for developers adding this kind of tooling into their CI builds is that Python needs to be installed on your system with a recent version of the pip tooling. Now, that may seem like common sense but remember that Python 2 had its end of life in 2020 and in March 2022, there were still almost 2.5 million downloads of Python 2.72. If you’re coming at this from an organizational standpoint, it may be time to discuss having a requirement for Python versions being consistently updated, but this introduces its own challenges. From the component-detection standpoint we will continue to support versions of pip >=v22
where installation reports can be generated.
A secondary concern is the amount of time taken to generate the installation report. SBOM tooling should not drastically increase CI times because then the negatives start to outweigh the benefits in high velocity environments.
The pip tooling must be told to avoid looking at any already installed dependencies using the —ignore-installed
flag (otherwise the installs
field will be empty), meaning the command will contact PyPi (or your configured feed) for each dependency to grab the requirements almost like running the full pip install. It doesn’t take quite as long though as it won’t be writing to disk, but as of June 2024 there is still an open discussion pip #825: Parallel downloads to try and get parallelization implemented. It’s not a straightforward contribution (see examples like https://github.com/pypa/pip/issues/8187 and , however doing so would certainly decrease time to run for CIs everywhere. Until officially supported, component-detection remains at the mercy of how complex your project is.
Potential tooling improvements
Updating the pip installation report format
With v23.0
of the pip tooling, the installation report format has been declared stable and given its initial release version 1
but that doesn’t mean there can’t be changes. The tooling itself is only two major pip versions old and there’s bound to be updates to the format over time as the ecosystem evolves.
But what are the top of mind improvements that can be tackled to make dependency graph creation even more robust? One source of additional data could be a field such as `required_by` which has a list of packages bringing in the specified dependency. This would allow a small improvement of needing to parse the graph only once, but the time taken here is trivial compared to dependency resolution.
Have an idea of your own? Feel free to share it in the comments.
Introducing pip and PyPi functionality to avoid large file downloads
Building in new functionality to a package manager used as extensively as pip is no small feat, but it is being developed by Pip contributors. The open PR perform 1-3 HTTP requests for each wheel using fast-deps #12208 discusses a potential improvement which should decrease dependency resolution times by ~50%.
With the issue Investigate usage of --use-feature=fast-deps in pip install --report #1182, I discuss how the tooling could leverage the pip maintainer’s work since introduction of PEP 658 – Serve Distribution Metadata in the Simple Repository API to bring decreased detection times. Pip maintainer time is extremely limited, and it may be some time before this work is fully completed.
Wrapping up
Python is a complicated tooling ecosystem and there have been great strides in recent years to make the tooling more accessible to package dependency analyzers. There’s more to be done in the space, and component-detection will continue to try and take the latest improvements.