Skip to content

Instantly share code, notes, and snippets.

@nguerrera
Last active July 25, 2016 22:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nguerrera/e7c7f9f70b08a4b7e1e22061aff2beee to your computer and use it in GitHub Desktop.
Save nguerrera/e7c7f9f70b08a4b7e1e22061aff2beee to your computer and use it in GitHub Desktop.
Archive of PDB embedding proposal

This proposal addresses #5397, which requests a feature for embedding source code inside of a PDB.

I am committed to implementing this with whatever changes fall out from the review if it is approved. I have an initial implementation in a WIP PR (#12353) that I will evolve based on feedback here. Some details here are new based on recent offline feedback and not yet matched by the implementation.

Scenarios

Recap from #5397

  • During the build, source code is auto-generated and then compiled. This auto-generated source does not exist on source control server and is often not preserved as a build artifact. Even if it is preserved, it can't be indexed on a symbol server making acquisition difficult at debug time.
  • A company is OK from an IP standpoint to release source for some of their projects, but their source control system is behind a firewall. Their IT security policies prevent giving any external access to the source control system, which prevents typical usage of source server. They already provide PDBs to customers, and by including source in the PDBs the customer's debugging experience improves with minimal additional work.
  • An Open Source project is doing all their development on GitHub and they current use source server to distribute source, but they don't like additional configuration necessary in VS to enable it. By distributing the source in the PDB they eliminate this additional configuration.

Also

  • See #12390, which requests embedding PDBs in PE files and argues for the power of combining that with this.
  • Binary analysis is often chosen due to the ease of acquiring binaries over integrating in to someone else's build, but comes at the cost of precision. This is a step towards having tools that can be pointed at a binaries, but analyze source, which was my primary motivation for contributing to this. There's more that I want to see in that direction: e.g. serialized compilation options, reference MVIDs in PDB -- ultimately enough to reproduce the compilation from a binary. Access to generated code was just one piece of that, but it overlaps with with the use cases noted above and provides substantial value on its own.

Command Line Usage

Since common usage will already leverage a source server and only require generated code to be embedded, we need to be able to specify the files to embed individually.

Proposal: Add a new /embed switch for vbc.exe and csc.exe:

  • /embed: embeds all source files in the PDB.

  • /embed:<file list>: embeds specific files in the PDB.

  • <file list> shall be parsed exactly as /additionalfile with semicolon separation and wildcard expansion.

  • If specific source files are to be embedded, they need to be specified as source files in the usual way AND passed to /embed.

NOTE: Some care should be taken in the compiler not to read the same files twice. The approach we landed on in design review is that if the /embed argument and source argument expand to the exact same full path (without normalization applied and case-sensitively), then we will not re-read the text of the source file. However, in the edge case, different spelling of the same file on the command line can lead to reading the same file more than once. It may also lead to repeated document entries in the PDB unless the difference is eliminated by the path normalization or the language specific case-sensitivity policy in place by the underlying debug document table. An earlier version of this proposal attempted to address these issues by having distinct mechanism for embedding source files (without repeating their paths) and additional files. However, it was decided in design review that the complexity added to the command line and API was not worth the marginal gain.

  • It is not an error to pass a file that does not represent source in the compilation to /embed. Such files will simply be added to the PDB, which is a deliberate feature.
  • It is an error to pass /embed without /debug: we can't embed text in the PDB if we're not emitting a PDB.
  • All files passed to /embed shall be included in the PDB regardless of whether or not there are sequence points targeting it.

Examples

  • Embed no sources in PDB (default)
csc /debug+ *.cs 
  • Embed all sources in PDB
csc /debug+ /embed
  • Embed only some sources in PDB
csc /debug+ src\*.cs /embed:generated\*.cs

#line directives

There is also a scenario where debugging requires external files that are not part of the compilation and are lined up to the actual source code via #line directives.

Proposal: A file targeted by a #line directive shall be embedded in the PDB if either the target file or the referencing source file are embedded.

Example

source.cs

class P {
    static void Main() {
#line 1 "example.xyz"
          System.Console.WriteLine("Hello World");
    }
}

example.xyz

print "Hello World"
  • Compile source.cs and embed only example.xyz in pdb
  • Here we're explicitly asking to embed only example.xyz
csc source.cs /embed:example.xyz /debug+   
  • Compile source.cs and embed both source.cs
  • Here's we're asking to embed all source and some source further pulls in example.xyz via #line.
csc source.cs /embed /debug+
  • Compile source.cs and embed source.cs and example.xyz in pdb
  • Here we're explicitly asking to embed source.cs, which further pulls in example.xyz via #line.
csc source.cs /embed:source.cs /debug+

Source Generators

This feature would pair nicely with https://github.com/dotnet/roslyn/blob/features/source-generators/docs/features/generators.md if/when both land, allowing generator output to be debugged without any requirement to acquire (or regenerate) the output by some other means.

We might choose to handle embedding source generator output in one of 3 ways:

  1. Always embed generator output if a PDB is being emitted.
  2. Add a way to decorate a generator as opting in (or out) of having its output embedded.
  3. Add a command-line

After much discussion about an earlier version of this proposal, there was a strong desire to keep the command-line interface minimal, so I think (1) or (2) should be preferred. I personally think always embedding generator output is the best option as it means that generators get good debuggability with no fuss. We could always add a command-line or generator API opt-out later if there was anyone pushing back on embedding the generator output.

I propose that we open a separate follow-up issue to track how to integrate these two features after both have arrived in a common branch and discuss 1-3 or other alternatives there.

Command Line API

Proposal: Add a property to Microsoft.CodeAnalysis.CommandLineArguments to indicate a list of files to be embedded in the PDB.

public class CommandLineArguments {
    ...
    // New property: True if the file is to be embedded in the PDB.
    public IEnumerable<CommandLineSourceFile> EmbeddedFiles { get; }
}

Note that if /embed is specified without arguments it is surfaced here by appending the full set of source files to this list and not via a separate API.

Emit API

It should be possible to embed source and additional text via public API without routing through the command-line compiler interface.

Proposal: Add an optional parameter to Microsoft.CodeAnalysis.Compilation.Emit() that specifies the a list of (path + text) to embed (with the usual technique for preserving binary compat.

namespace Microsoft.CodeAnalysis {
    public abstract class Compilation {
        // ...
        public EmitResult Emit(
            // Existing parameters 
            Stream peStream,
            Stream pdbStream = null,
            Stream xmlDocumentationStream = null,
            Stream win32Resources = null,
            IEnumerable<ResourceDescription> manifestResources = null,
            EmitOptions options = null,
            IMethodSymbol debugEntryPoint = null,

             // New parameter: specify the texts (with their paths) to embed
            IEnumerable<EmbeddedText> embeddedTexts = null,

            // Existing parameter
            CancellationToken cancellationToken = default(CancellationToken));
    }

    // new type
    public struct EmbeddedText : IEquatable<EmbeddedText> {
        // both args must be non-null, filePath must be non-empty, and text must have an encoding.
        public EmbeddedText(string filePath, SourceText text);
        public string FilePath { get; }
        public SourceText Text { get; }
        public bool IsDefault { get; } // to check if initialized

       // plus the usual GetHashCode, Equals, ==, != overrides/overloads
    }
}

Note that it is the caller's responsibility to the gather source and non-source text as appropriate. Text will line up with corresponding source/sequence points by the existing mechanism for de-duping debug documents generated by source trees, #line, and #pragma checksum: i.e. paths will be normalized and then compared case-insensitively for VB and case-sensitively for C#.

Compression

Files beyond a trivial size should be compressed in the PDB. GZIP format (http://www.gzip.org/zlib/rfc-gzip.html) will be used. Tiny files do not benefit from compression and can even waste cycles making the file bigger so we should have a threshold at which we start to compress.

Encoding

Files should be persisted in their original encoding as denoted by SourceText.Encoding.

Portable PDB Representation

In portable PDBs, we will put the embedded source as a custom debug info entry (with a new GUID allocated for it) parented by the document entry. A leading uint16 indicates the encoding (0 = raw bytes, uncompressed, 1 = GZIP). The portable PDB specification has been updated accordingly: dotnet/corefx#10017

Windows PDB Representation

The traditional Windows PDB already had a provision for embedded source, which we will use via ISymUnmanagedDocumentWriter::SetSource.

The corresponding method for reading back the embedded source returned E_NOTIMPL until recently, but I have made the change to implement it and an update to the nuget package is pending.

Since this is an existing mechanism (though we don't believe it is used in practice since there hasn't been compiler or debugger support), we do not have a place to put a compression/format indicator. It was decided in discussion with the diasymreader owner and other stakeholders that we can reasonably expect callers to sniff for the GZIP signature (0x1F, 0x8B). The rationale there is that the sequence is not valid ASCII nor UTF8, and cannot be the start of any Unicode BOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment