Amazon EC2 P3 Instances

Accelerate machine learning and high performance computing applications with powerful GPUs

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. Amazon EC2 P3 instances have been proven to reduce machine learning training times from days to minutes, as well as increase the number of simulations completed for high performance computing by 3-4x.

With up to 4x the network bandwidth of P3.16xlarge instances, Amazon EC2 P3dn.24xlarge instances are the latest addition to the P3 family, optimized for distributed machine learning and HPC applications. These instances provide up to 100 Gbps of networking throughput, 96 custom Intel® Xeon® Scalable (Skylake) vCPUs, 8 NVIDIA® V100 Tensor Core GPUs with 32 GiB of memory each, and 1.8 TB of local NVMe-based SSD storage. P3dn.24xlarge instances also support Elastic Fabric Adapter (EFA) which accelerates distributed machine learning applications that use NVIDIA Collective Communications Library (NCCL). EFA can scale to thousands of GPUs, significantly improving the throughput and scalability of deep learning training models, which leads to faster results.

Overview of Amazon EC2 P3 Instances (2:18)

Benefits

Reduce machine learning training time from days to minutes

For data scientists, researchers, and developers who need to speed up ML applications, Amazon EC2 P3 instances are the fastest in the cloud for ML training. Amazon EC2 P3 instances feature up to eight latest-generation NVIDIA V100 Tensor Core GPUs and deliver up to one petaflop of mixed-precision performance to significantly accelerate ML workloads. Faster model training can enable data scientists and machine learning engineers to iterate faster, train more models, and increase accuracy.

The industry's most cost-effective solution for ML training

One of the most powerful GPU instances in the cloud combined with flexible pricing plans results in an exceptionally cost-effective solution for machine learning training. As with Amazon EC2 instances in general, P3 instances are available as On-Demand Instances, Reserved Instances, or Spot Instances. Spot Instances take advantage of unused EC2 instance capacity and can lower your Amazon EC2 costs significantly for up to a 70% discount from On-Demand prices.

Flexible, powerful, high performance computing

Unlike on-premises systems, running high performance computing on Amazon EC2 P3 instances offers virtually unlimited capacity to scale out your infrastructure, and the flexibility to change resources easily and as often as your workload demands. You can configure your resources to meet the demands of your application and launch an HPC cluster in minutes, paying for only what you use.

Start building immediately

Use pre-packaged Docker images to deploy deep learning environments in minutes. The images contain the required deep learning framework libraries (currently TensorFlow and Apache MXNet) and tools and are fully tested. You can easily add your own libraries and tools on top of these images for a higher degree of control over monitoring, compliance, and data processing. In addition, Amazon EC2 P3 instances work seamlessly together with Amazon SageMaker to provide a powerful and intuitive complete machine learning platform. Amazon SageMaker is a fully-managed machine learning platform that enables you to quickly and easily build, train, and deploy machine learning models. Furthermore, Amazon EC2 P3 instances can be integrated with AWS Deep Learning Amazon Machine Images (AMIs) that are pre-installed with popular deep learning frameworks. This makes it faster and easier to get started with machine learning training and inference.

Scalable multi-node machine learning training

You can use multiple Amazon EC2 P3 instances with up to 100 Gbps of networking throughput to rapidly train machine learning models. Higher networking throughput enables developers to remove data transfer bottlenecks and efficiently scale out their model training jobs across multiple P3 instances. Customers have been able to train ResNet-50, a common image classification model, to industry standard accuracy in just 18 minutes using 16 P3 instances. This level of performance was previously unattainable by the vast majority of ML customers as it required a large CapEx investment to build out on-premises GPU clusters. With P3 instances and their availability via an On-Demand usage model, this level of performance is now accessible to all developers and machine learning engineers. In addition, P3dn.24xlarge instances support Elastic Fabric Adapter (EFA) that uses the NVIDIA Collective Communications Library (NCCL) to scale to thousands of GPUs.

Support for all major machine learning frameworks

Amazon EC2 P3 instances support all major machine learning frameworks including TensorFlow, PyTorch, Apache MXNet, Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), Chainer, Theano, Keras, Gluon, and Torch. You have the flexibility to choose the framework that works best for your application.

Customer stories

Airbnb

Airbnb is using machine learning to optimize search recommendations and improve dynamic pricing guidance for hosts, both of which translate to increased booking conversions. With Amazon EC2 P3 instances, Airbnb can run training workloads faster, go through more iterations, build better machine learning models and reduce costs.

Celgene

Celgene is a global biotechnology company that is developing targeted therapies that match treatment with the patient. The company runs their HPC workloads for next-generation genomic sequencing and chemical simulations on Amazon EC2 P3 instances. With this compute power, Celgene can train deep learning models to distinguish between malignant cells and benign cells. Before using P3 instances, it took two months to run large scale computational jobs, now it takes just four hours. AWS technology has enabled Celgene to accelerate development of drug therapies for cancer and inflammatory diseases.

Hyperconnect specializes in applying new technologies based on machine learning to image and video processing and was the first company to develop webRTC for mobile platforms.

“Hyperconnect uses AI-based image classification on its video communication app to recognize the current environment wherein a user is situated. We reduced our ML model training time from more than a week to less than a day by migrating from on-premises workstations to multiple Amazon EC2 P3 instances using Horovod. By using PyTorch as our machine learning framework, we were able to quickly develop models and leverage the libraries available in the open source community.”

Sungjoo Ha, Director of AI Lab - Hyperconnect

Read full case study »

NerdWallet is a personal finance startup that provides tools and advice that make it easy for customers to pay off debt, choose the best financial products and services, and tackle major life goals like buying a house or saving for retirement. The company relies heavily on data science and machine learning (ML) to connect customers with personalized financial products.

The use of Amazon SageMaker and Amazon EC2 P3 instances with NVIDIA V100 Tensor Core GPUs has also improved NerdWallet’s flexibility and performance and has reduced the time required for data scientists to train ML models. “It used to take us months to launch and iterate on models; now it only takes days,”

Ryan Kirkman, Senior Engineering Manager - NerdWallet

Read the full case study »

A leader in quality systems solutions, Aon’s PathWise is a cloud-based SaaS application suite geared toward enterprise risk-management modeling that delivers speed, reliability, security, and on-demand service to an array of customers.

“Aon’s PathWise Solutions Group provides a risk management solution that enables our customers to leverage the latest technology to rapidly solve today’s key insurance challenges such as managing and testing hedge strategies, regulatory and economic forecasting, and budgeting. PathWise has been running on AWS in production since 2011, and today uses Amazon EC2 P-Series instances to accelerate the computations needed to solve these challenges for our customers all over the world in an ever advancing and evolving market.”

Van Beach, Global Head of Life Solutions, Aon Pathwise Strategy and Technology Group

Read case study »

Pinterest

Pinterest uses mixed precision training in P3 instances on AWS to speed up training of deep learning models, and also uses these instances for faster inference of these models, to enable fast and unique discovery experience for users. Pinterest uses PinSage, made by using PyTorch on AWS. This AI model groups images together based on certain themes. With 3 billion images on the platform, there are 18 billion different associations that connect images. These associations help Pinterest contextualize themes, styles and produce more personalized user experiences.

Salesforce

Salesforce is using machine learning to power Einstein Vision, enabling developers to harness the power of image recognition for use cases such as visual search, brand detection, and product identification. Amazon EC2 P3 instances enable developers to train deep learning models much faster so that they can achieve their machine learning goals quickly.

Schrodinger

Schrodinger uses high performance computing (HPC) to develop predictive models to extend the scale of discovery and optimization and give their customers the ability to bring lifesaving drugs to market more quickly. Amazon EC2 P3 instances allows Schrodinger to perform four times as many simulations in a day as they could with P2 instances.  

Subtle Medical is a healthcare technology company working to improve medical imaging efficiency and patient experience with innovative deep-learning solutions. Its team is made up of renowned imaging scientists, radiologists, and AI experts from Stanford, MIT, MD Anderson, and more.

“Hospitals and imaging centers want to adopt this solution without burdening their IT departments to acquire GPU expertise and build and maintain costly data centers or mini-clouds. They want to be successful with their deployments with the least amount of effort and investment… AWS makes this possible.”

Enhao Gong, Founder and CEO - Subtle Medical

Read full case study »

Western Digital

Western Digital uses HPC to run tens of thousands of simulations for materials sciences, heat flows, magnetics and data transfer to improve disk drive and storage solution performance and quality. Based on early testing, P3 instances allow engineering teams to run simulations at least three times faster than previously deployed solutions.  

Amazon EC2 P3 instances & Amazon SageMaker

The fastest way to train and run machine learning models

Amazon SageMaker is a fully-managed service for building, training, and deploying machine learning models. When used together with Amazon EC2 P3 instances, customers can easily scale to tens, hundreds, or thousands of GPUs to train a model quickly at any scale without worrying about setting up clusters and data pipelines. You can also easily access Amazon Virtual Private Cloud (Amazon VPC) resources for training and hosting workflows in Amazon SageMaker. With this feature, you can use Amazon Simple Storage Service (Amazon S3) buckets that are only accessible through your VPC to store training data, as well as storing and hosting the model artifacts derived from the training process. In addition to S3, models can access all other AWS resources contained within the VPC. Learn more.

Build

Amazon SageMaker makes it easy to build machine learning models and get them ready for training. It provides everything that you need to quickly connect to your training data, and to select and optimize the best algorithm and framework for your application. Amazon SageMaker includes hosted Jupyter notebooks that make it easy to explore and visualize your training data stored in Amazon S3.  You can also use the notebook instance to write code to create model training jobs, deploy models to Amazon SageMaker hosting, and test or validate your models.

Train

You can begin training your model with a single click in the console or with an API call. Amazon SageMaker is pre-configured with the latest versions of TensorFlow and Apache MXNet, and with CUDA9 library support for optimal performance with NVIDIA GPUs. In addition, hyper-parameter optimization can automatically tune your model by intelligently adjusting different combinations of model parameters to quickly arrive at the most accurate predictions. For larger scale needs, you can scale to tens of instances to support faster model building.

Deploy

After training, you can use one-click to deploy your model on auto-scaling Amazon EC2 instances across multiple Availability Zones. In production, Amazon SageMaker manages the compute infrastructure on your behalf to perform health checks, apply security patches, and conduct other routine maintenance, all with built-in Amazon CloudWatch monitoring and logging.

 

Amazon EC2 P3 instances & AWS Deep Learning AMIs

Pre-configured development environments to quickly start building deep learning applications

An alternative to Amazon SageMaker for developers who have more customized requirements, the AWS Deep Learning AMIs provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. You can quickly launch Amazon EC2 P3 instances pre-installed with popular deep learning frameworks such as TensorFlow, PyTorch, Apache MXNet, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch, Chainer, Gluon, and Keras to train sophisticated, custom AI models, experiment with new algorithms, or learn new skills and techniques. Learn more >>

Amazon EC2 P3 instances & high performance computing

Solve large computational problems and gain new insights using the power of HPC on AWS

Amazon EC2 P3 instances are an ideal platform to run engineering simulations, computational finance, seismic analysis, molecular modeling, genomics, rendering, and other GPU compute workloads. High performance computing (HPC) allows scientists and engineers to solve these complex, compute-intensive problems. HPC applications often require high network performance, fast storage, large amounts of memory, high compute capabilities, or all of the above. AWS enables you to increase the speed of research and reduce time-to-results by running HPC in the cloud and scaling to larger numbers of parallel tasks than would be practical in most on-premises environments. For example, P3dn.24xlarge instances support Elastic Fabric Adapter (EFA) that enables HPC applications using the Message Passing Interface (MPI) to scale to thousands of GPUs. AWS helps to reduce costs by providing solutions optimized for specific applications, and without the need for large capital investments. Learn more >>

Support for NVIDIA RTX Virtual Workstation

NVIDIA RTX Virtual Workstation AMIs deliver high graphics performance using powerful P3 instances with NVIDIA Volta V100 GPUs running in the AWS cloud. These AMIs have the latest NVIDIA GPU graphics software preinstalled along with the latest RTX drivers and NVIDIA ISV certifications with support for up to four 4K desktop resolutions. P3 instances with NVIDIA V100 GPUs combined with RTX vWS deliver a high performance workstation in the cloud with up to 32 GiB of GPU memory, fast ray tracing, and AI-powered rendering.

The new AMIs are available on the AWS Marketplace with support for Windows Server 2016 and Windows Server 2019.

Amazon EC2 P3dn.24xlarge instances

New faster, more powerful and larger instance size optimized for distributed machine learning and high performance computing

Amazon EC2 P3dn.24xlarge instances are the fastest, most powerful, and largest P3 instance size available and provide up to 100 Gbps of networking throughput, 8 NVIDIA® V100 Tensor Core GPUs with 32 GiB of memory each, 96 custom Intel® Xeon® Scalable (Skylake) vCPUs, and 1.8 TB of local NVMe-based SSD storage. The faster networking, new processors, doubling of GPU memory, and additional vCPUs enable developers to significantly lower the time to train their ML models or run more HPC simulations by scaling out their jobs across several instances (e.g., 16, 32, or 64 instances). Machine learning models require a large amount of data for training and, in addition to increasing the throughput of passing data between instances, the additional network throughput of P3dn.24xlarge instances can also be used to speed up access to large amounts of training data by connecting to Amazon S3 or shared file systems solutions such as Amazon EFS.

Remove bottlenecks and reduce machine learning training time

With 100 Gbps of networking throughput, developers can efficiently use a large number of P3dn.24xlarge instances for distributed training and significantly lower the time to train their models. The 96vCPUs of AWS-custom Intel Skylake processors with AVX-512 instructions operating at 2.5GHz help optimize the pre-processing of data. In addition, P3dn.24xlarge instances use the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor, which delivers practically all of the compute and memory resources of the host hardware to your instances. P3dn.24xlarge instances also support Elastic Fabric Adapter that enables ML applications using the NVIDIA Collective Communications Library (NCCL) to scale to thousands of GPUs.

Lower TCO by optimizing GPU utilization

Enhanced networking using the latest version of the Elastic Network Adapter with up to 100 Gbps of aggregate network bandwidth can be used not only to share data across several P3dn.24xlarge instances, but also for high-throughput data access via Amazon S3 or shared file systems solution such as Amazon EFS. High throughput data access is crucial to optimize the utilization of GPUs and deliver maximum performance from the compute instances.

Support larger and more complex models

P3dn.24xlarge instances offer NVIDIA V100 Tensor Core GPUs with 32GiB of memory that deliver the flexibility to train more advanced and larger machine learning models as well as process larger batches of data such as 4k images for image classification and object detection systems.

Amazon EC2 P3 instance product details

Instance Size GPUs - Tesla V100 GPU Peer to Peer GPU Memory (GB) vCPUs Memory (GB) Network Bandwidth EBS Bandwidth On-Demand Price/hr* 1-yr Reserved Instance Effective Hourly* 3-yr Reserved Instance Effective Hourly*
p3.2xlarge 1 N/A 16 8 61 Up to 10 Gbps 1.5 Gbps $3.06 $1.99 $1.05
p3.8xlarge 4
NVLink 64 32 244 10 Gbps 7 Gbps $12.24 $7.96 $4.19
p3.16xlarge 8 NVLink 128 64 488 25 Gbps 14 Gbps $24.48 $15.91 $8.39
p3dn.24xlarge 8 NVLink 256 96 768 100 Gbps 19 Gbps $31.218 $18.30 $9.64

* - Prices shown are for Linux/Unix in the US East (Northern Virginia) AWS Region and rounded to the nearest cent. For full pricing details, see the Amazon EC2 pricing page.

Customers can purchase P3 instances as On-Demand Instances, Reserved Instances, Spot Instances, and Dedicated Hosts.

Billing by the second

One of the many advantages of cloud computing is the elastic nature of provisioning or deprovisioning resources as you need them. By billing usage down to the second, we enable customers to level up their elasticity, save money, and enable them to optimize allocation of resources toward achieving their machine learning goals.

Reserved Instance pricing

Reserved Instances provide you with a significant discount (up to 75%) compared to On-Demand Instance pricing. In addition, when Reserved Instances are assigned to a specific Availability Zone, they provide a capacity reservation, giving you additional confidence in your ability to launch instances when you need them.

Spot pricing

With Spot Instances, you pay the Spot price that's in effect for the time period that your instances are running. Spot Instance prices are set by Amazon EC2 and adjust gradually based on long-term trends in supply and demand for Spot Instance capacity. Spot Instances are available at a discount of up to 90% off compared to On-Demand pricing.

The broadest global availability

P3 instances global availability

Amazon EC2 P3.2xlarge, P3.8xlarge and P3.16xlarge instances are available in 14 AWS Regions so that customers have the flexibility to train and deploy their machine learning models wherever their data is stored. Available regions for P3 are the US East (N. Virginia), US East (Ohio), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Frankfurt), Europe (London), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Singapore), China (Beijing), China (Ningxia), and GovCloud (US-West) AWS Regions.

P3dn.24xlarge instances are available in the Asia Pacific (Tokyo), Europe (Ireland), US East (N. Virginia), US West (Oregon), GovCloud (US-West), and GovCloud (US-East) AWS regions.

Get started with Amazon EC2 P3 instances for machine learning

To get started within minutes, learn more about Amazon SageMaker or use the AWS Deep Learning AMI, pre-installed with popular deep learning frameworks such as Caffe2 and MXNet. Alternatively, you can also use the NVIDIA AMI with GPU driver and CUDA toolkit pre-installed.

Blogs, articles, and webinars

Broadcast Date: December 19, 2018

Level: 200

Computer vision deals with how computers can be trained to gain a high-level understanding from digital images or videos. The history of computer vision dates back to the 1960’s, but recent advancements in processing technology have enabled applications such as navigation of autonomous vehicles. This tech talk will review the different steps required to build, train, and deploy a machine learning model for computer vision. We will compare and contrast the training of computer vision models using different Amazon EC2 instances and highlight how significant time savings can be achieved by using Amazon EC2 P3 instances.

Broadcast Date: July 31, 2018

Level 200

Organizations are tackling exponentially complex questions across advanced scientific, energy, high tech, and medical fields. Machine learning (ML) makes it possible to quickly explore the multitude of scenarios and generate the best answers, ranging from image, video, and speech recognition to autonomous vehicle systems and weather prediction. For data scientists, researchers, and developers who want to speed up development of their ML applications, Amazon EC2 P3 instances are the most powerful, cost effective and versatile GPU compute instances available in the cloud.

About Amazon SageMaker

Click here to learn more

About Deep Learning on AWS

Click here to learn more

About High Performance Computing (HPC)

Click here to learn more
Ready to get started?
Sign up
Have more questions?
Contact us