As part of the broader PyTorch community, Facebook AI and AWS engineers have partnered to develop new libraries targeted at large-scale elastic and fault-tolerant model training and high-performance PyTorch model deployment. These libraries enable the community to efficiently productionize AI models at scale and push the state of the art on model exploration as model architectures continue to increase in size and complexity. Today, we are sharing new details on these features.
TorchServe (experimental)
Available now, TorchServe is an easy-to-use, open source framework for deploying PyTorch models for high-performance inference. Cloud and environment agnostic, the framework’s library includes features such as multimodel serving, logging, metrics for monitoring, and the creation of RESTful endpoints for application integration. With these features, TorchServe provides a clear path to deploying PyTorch models to production at scale. To get started, visit the AWS News blog for more information.
TorchElastic Integration with Kubernetes (experimental)
The integration of Kubernetes and TorchElastic allows PyTorch developers to train machine learning models on a cluster of compute nodes that can dynamically change without disrupting the training job. The built-in fault-tolerant capabilities of TorchElastic allow training to continue even if nodes go down during the training process. This can take the form of things like server maintenance events, network issues, or the preemption of a server node (e.g., in the case of spot instances). This framework provides the primitives and interfaces for developers to write a distributed PyTorch job in such a way that it can be run on multiple machines with elasticity—without requiring developers to manage the pods and services required for TorchElastic training jobs manually. This library is now available.
We are excited to share TorchServe and TorchElastic to enable the community to train and deploy models more flexibly and at scale. These libraries, which are included as part of the PyTorch 1.5 release, will be maintained by Facebook and AWS in partnership with the broader community. We look forward to continuing to serve the PyTorch open source community with new capabilities.
Resources:
0 Comments