3.0. Package
What is a Python package?
A Python package is a structured collection of Python modules, which allows for a convenient way to organize and share code. Among the various formats a package can take, the wheel format (.whl) stands out. Wheels are a built package format that can significantly speed up the installation process for Python software, compared to distributing source code and requiring the user to build it themselves.
Why do you need to create a Python package?
Creating a Python package offers multiple benefits, particularly for developers looking to distribute their code effectively:
- As a Library: Packaging your code as a library enables you to share reusable components across different projects. This is common in the Python ecosystem, with examples like
numpy
,pandas
, andtensorflow
being shared as libraries. - As an Application: Packaging also plays a crucial role in deploying applications. It simplifies the distribution and installation process, ensuring your software can be easily executed on various systems, including web or mobile platforms.
Additionally, creating a package can enhance the maintainability of your code, enforce good coding practices by encouraging modular design, and facilitate version control and dependency management.
Which tool should you use to create a Python package?
The Python ecosystem provides several tools for packaging, each with its unique features and advantages. While the choice can seem overwhelming, as humorously depicted in the xkcd comic on Python environments, Poetry emerges as a standout option. Poetry simplifies dependency management and packaging, offering an intuitive interface for developers.
To get started with Poetry for packaging, you can use the following commands:
- Initiate a Poetry package:
poetry init
- Start developing the package:
poetry install
- Build a package with Poetry:
poetry build --format wheel
At the end of the build process, a .whl
file is generated in the dist
folder with the name and version of the project from pyproject.toml
.
For those seeking alternatives, tools like PDM, Hatch, and Pipenv offer different approaches to package management and development, each with its own set of features designed to cater to various needs within the Python community.
Do you recommend Conda for your AI/ML project?
Although Conda is a popular choice among data scientists for its ability to manage complex dependencies, it's important to be aware of its limitations. Challenges such as slow performance, a complex dependency resolver, and confusing channel management can hinder productivity. Moreover, Conda's integration with the Python ecosystem, especially with new standards like pyproject.toml
, is limited. For managing complex dependencies in AI/ML projects, Docker containers present a robust alternative, offering better isolation and compatibility across environments.
How can you install new dependencies with Poetry?
Please refer to this section of the course.
Which metadata should you provide to your Python package?
Including detailed metadata in your pyproject.toml
file is crucial for defining your package's identity and dependencies. This file should contain essential information such as the package name, version, authors, and dependencies. Here's an example that outlines the basic structure and content for your package's metadata:
# Example metadata for a Python package
[tool.poetry]
name = "bikes"
version = "1.0.0"
description = "Predict the number of bikes available."
repository = "https://github.com/yourusername/mlops-python-package"
documentation = "https://yourusername.github.io/mlops-python-package/"
authors = ["Your Name <youremail@example.com>"]
readme = "README.md"
license = "CC BY"
keywords = ["mlops", "python", "package"]
packages = [{ include = "bikes", from = "src" }]
[tool.poetry.dependencies]
python = "^3.10"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
This information not only aids users in understanding what your package does but also facilitates its discovery and integration into other projects.
Where should you add the source code of your Python package?
For a clean and efficient project structure, placing your package's source code in a src
directory is recommended. This approach, known as the src
layout, separates your package's code from other project files, such as tests and documentation, reducing the risk of import clashes and making it easier to package and distribute your code.
Here's how you can set up this structure:
mkdir -p src/bikes
touch src/bikes/__init__.py
The presence of an __init__.py
file within a directory indicates to Python that this directory should be treated as a package, making it possible for other parts of your project or external projects to import its modules.
Should you publish your Python package? On which platform should you publish it?
Deciding whether to publish your Python package depends on your goals. If you aim to share your work with the broader community or need a convenient way to distribute your code across projects or teams, publishing is a great option. The Python Package Index (PyPI) is the primary repository for public Python packages, making it an ideal platform for reaching a wide audience.
For private packages or when sharing within a limited group or organization, platforms like AWS CodeArtifact or GCP Artifact Registry offer secure hosting and management of your packages.
To publish a package using Poetry, you can use the command:
poetry publish
This will upload your package to PyPI, making it available for installation via pip
by the Python community.