There are two different levels of confusion and the behavior should be highlighted for both of them.
There are 7 high level scenarios to consider on when we should enable distributed tracing in the SDK libraries.
To be clear, a default implementation is one where we ship the plugin with azure.core
. An example of a tracing implementation is OpenCensus
.
Also to be clear, by current span context, I am always referring to the span context at the runtime and not the span context in a message.
Customer Action | Spans? | Header? |
---|---|---|
The customer does not have any implementations installed and does not specify an implementation via the settings. | No | No |
The customer does have a default tracing implementation installed but does not import it. The customer does not specify an implementation via the settings. | No | No |
The customer does have a default tracing implementation installed but does not import it. The customer does specifies an implementation via the settings. | Yes | Yes |
The customer does have a default tracing implementation installed and imported. That implementation is not OpenCensus . The customer does not specify an implementation via the settings. |
No | No |
The customer does have OpenCensus installed and imported. The customer does not specify an implementation via the settings. |
Yes | Yes |
The customer does not have any implementations installed but specifies an implementation via the settings by passing in their own plugin. | Yes | Yes |
The customer has settings such that tracing should happen, however they specify that they only want to propagate the headers and not trace. | No | Yes |
To be clear the implementation the customer specifies through the settings takes first priority.
The three request types are a regular request, paging and long running operations. The following is how the trace should look for all of them as of Preview 2.
The code that the customer should have to write to generate the trace and the trace exported to json
is in the same gist.
An example of a regular request is when the user does a keyclient.create_key
using keyvault
. There should be a span created for the method and a span for every network request. In case of retries, we should see each retry as a span.
If a request fails we see a clear retry span as in the following.
The each http request spans should have the following labels:
http.method
http.status_code
http.url
http.user_agent
component
x-ms-client-request-id
x-ms-request-id
An example is shown with the exact label names below.
Each http request header should have the span context. The actual standard of propagation depends on the implementation.
Paging is a little bit more interesting because the SDK makes the request lazily. An example of paging is keyclient.list_keys
which returns an iterator without making a network request. Each time the customer iterates and there are more keys in the next page, the SDK should fetch. Therefore, there should be a span for the top level function that does not make any network calls. There should then also be a span for each of the network call made for each page. There should, however, not be an overarching parent span for them. An example of such a trace is shown below.
As it can be seen from the trace, the is keyclient.list_keys
calls is another function and neither of them make network calls. Because the customer implicitly triggers the network call to fetch the next page, they fetching happen at different places. Before the customer fetches the first page and the rest of the pages, they take a tenth of a second to do something and that behavior is well represented in the trace.
In later previews, each span that represents a network call for fetching pages would be linked to the original trigger function. In the example above, that would be keyclient.list_keys
. In preview 2, this feature will not be there.
The span attributes for spans that represent network calls are to be the same as those for regular requests.
This type of request are those where the SDK customer asks a service to do an action that takes a signification amount of time. The server will then return to the SDK with a "check back later". That will initiate a series to polling to check the status of the original long running operation. An example of such an operation is to be done with fileclient.copy_file_from_url
for azure.storage.files
. For such cases there should be a span created for every network request. That means, there should be a span whenever the status is queried by the polling method. If the SDK customer calls the waiting method, there should be a span created for the duration of that method. An example of such can be seen below.
It can be seen clearly that in this particular case, after that initial call, there was wait
called and then the user received the resource after a final bit of polling. In the future, the spans related to a single polling instance would be linked, but not for preview 2.
The span attributes for spans that represent network calls are to be the same as those for regular requests.
There are two cases where tracing the lifetime of the message could be interesting. They are in Eventhub and in Storage Queues. Eventhub uses AMQP
which has some sort of metadata defined that is able to store the span context from the time it was created. For AMQP
there are 2 different schemes defined, annotations
and properties
, and the community has yet to reach a consensus for where the span context should be attached. The messages in Storage Queues could be arbitrary bytes which have no standard metadata information. This means for both cases, there will be no support to trace the lifetime of the message as of preview 2.
When the specs for Eventhub are defined, then it is important to consider how the lifetime of the messages should be traced. For individual messages, it is easy. All that has to be done for sending is we attach the current span context to the message. When an individual message is received, the span context from that message is linked to the current span context of the SDK. The problem arises when we consider receiving batch messages. For sending, we could put the current span context in the metadata for every message. For receiving, batch messages, there are two things to consider.
- The first is that there is a cost to trying to parse the span context from the headers. That cost becomes large when there might be millions of messages to parse.
- The second is the case where each message in the batch has a different span context. Let us say that the messages all have a number and the SDK customer is trying to average them. What should the trace behavior be. In the SDK, if there were no cost to parsing the span context and linking, each span context from each message could be linked to the current span context in the SDK.