- Unary(一元):传统请求-响应模式。
- 服务端流式:客户端发送一个请求,服务端返回流式响应。
- 客户端流式:客户端发送流式请求,服务端返回单个响应。
- 双向流式:双方通过独立流同时发送数据。
- Unary的编程模式和普通函数调用类似
- 服务端流式(返回值流式)在客户端需要显式调用reader的Finish以标记通讯结束
It’s Jiong Gong (@jgong5) from the Intel team working on PyTorch optimization for CPU. In this post, I’d like to give an update on the recent progress of CPU backend of TorchInductor, the new DL compiler of PyTorch. Designed to support multiple device backends, TorchInductor provides backends for both CPU and NVIDIA GPU. There has been great progress on GPU backend optimization for training workloads (see this for details). On CPU side, since a significant portion of DL workloads running on CPU are DL inference, we started off optimizing CPU inference as our first step. We started the efforts in early October with a low performance baseline (see table 1-3 below [^1]) at that point of time, and we are pleased to bring the improvements to the table.
T
We are pleased to announce the release of Intel® Extension for PyTorch* 1.13.0-cpu which accompanies PyTorch 1.13. This release is highlighted with quite a few usability features which help users to get good performance and accuracy on CPU with less effort. We also added a couple of performance features as always. Check out the feature summary below.
ipex.optimize
by default. Users don't have to explicitly convert input and weight for CV models.ipex.optimize
is automatically applied to PyTorch modules without the need of code changes when the PyTorch program is started with the IPEX launcher via the new --auto-ipex
option.ipex.optimize
(experimental): A new boolean flag graph_mode
(default off) was added to ipex.optimize
, when turned on, converting the eager-mode PyTorch modufrom ctypes import c_void_p, c_long | |
import torch | |
import random | |
from torch import empty_strided, as_strided, device | |
from torchinductor.codecache import AsyncCompile | |
aten = torch.ops.aten | |
async_compile = AsyncCompile() |
from ctypes import c_void_p, c_long | |
import torch | |
import random | |
from torch import empty_strided, as_strided, device | |
from torchinductor.codecache import AsyncCompile | |
aten = torch.ops.aten | |
async_compile = AsyncCompile() |
import os | |
import torch | |
model = Model() | |
model.eval() | |
data = torch.rand(<shape>) | |
# Applying torch.fx.experimental.optimization.fuse against model performs conv-batchnorm folding for better performance. | |
import torch.fx.experimental.optimization as optimization | |
model = optimization.fuse(model, inplace=True) | |
#################### code changes #################### | |
import intel_extension_for_pytorch as ipex |
... | |
import torch | |
... | |
model = Model() | |
model = model.to(memory_format=torch.channels_last) | |
model.eval() | |
#################### code changes #################### | |
import intel_extension_for_pytorch as ipex | |
model = ipex.optimize(model, dtype=torch.bfloat16) | |
###################################################### |
... | |
import torch | |
... | |
model = Model() | |
model = model.to(memory_format=torch.channels_last) | |
criterion = ... | |
optimizer = ... | |
model.train() | |
#################### code changes #################### | |
import intel_extension_for_pytorch as ipex |
import torch | |
# Step 1: Register IPEX optimizations | |
import intel_pytorch_extension as ipex | |
from my_models import SomeModel | |
# Step 2: Enable BF16 auto-mixed-precision | |
ipex.enable_auto_mixed_precision(mixed_dtype = torch.bfloat16) | |
data_loader = … | |
# Step 3: Enable IPEX optimizations | |
model = SomeModel().to(ipex.DEVICE) | |
opt = torch.optim.SGD(model.parameters(), ...) |
import torch | |
# Step 1: Register IPEX optimizations | |
import intel_pytorch_extension as ipex | |
from my_models import SomeModel | |
# Step 2: Enable BF16 auto-mixed-precision | |
ipex.enable_auto_mixed_precision(mixed_dtype = torch.bfloat16) | |
# Step 3: Enable IPEX optimizations | |
model = SomeModel().to(ipex.DEVICE).eval() | |
model = torch.jit.script(model) |