Skip to content

Instantly share code, notes, and snippets.

View tarekgh's full-sized avatar

Tarek Mahmoud Sayed tarekgh

  • Microsoft
  • Redmond
View GitHub Profile
@tarekgh
tarekgh / TokenizerInterfaceChangeProposal.md
Created February 11, 2024 23:18
Tokenizer Interface Change Proposal

Tokenizer Interface Changes Proposal

This document capturing the thoughts and ideas for the Tokenizer interface changes. The goal is to add or change the APIs that allow the callers to have more control over the memory allocation and performance. The Tokenizer class is the main class used for tokenization and that will be used by the model to encode and decode the input and output. When getting the right shape of this class, we can know the exact other changes will need to be done in the other main interfaces like the Model and PreTokenizer.

    public class Tokenizer
    {
        public Tokenizer(Model model, PreTokenizer? preTokenizer = null, Normalizer? normalizer = null) { }
        public Model Model { get; }
@tarekgh
tarekgh / Tiktoken.md
Last active January 25, 2024 22:21
Tiktoken Tokenizer Proposal

Tiktoken Tokenizer Proposal

This document outlines the proposal for integrating the Tiktoken Tokenizer into ML.NET. ML.NET currently features a tokenizers library for text, catering to tokenization needs for NLP tasks. Incorporating support for Tiktoken would be a valuable addition to the library, enhancing its capabilities to support AI models like GPT-4.

Usage Example

    Tokenizer tokenizer = await Tokenizer.CreateByModelNameAsync("gpt-4");

    // Encoding to Ids
@tarekgh
tarekgh / CustomEnvironmentVariablesConfigurationProvider.cs
Created September 13, 2023 16:36
Workaround for allow using `.` in the environment variables configuration
// Can define the environment variable with three underscores ___ in the places you want dot.
// Logging__LogLevel__Microsoft___Hosting___Lifetime=Information <-- Logging:LogLevel:Microsoft.Hosting.Lifetime=Information
public class CustomEnvironmentVariablesConfigurationProvider : EnvironmentVariablesConfigurationProvider
{
internal const string DefaultDotReplacement = ":_";
private string _dotReplacement;
public CustomEnvironmentVariablesConfigurationProvider(string? dotReplacement = DefaultDotReplacement) : base()
{
_dotReplacement = dotReplacement ?? DefaultDotReplacement;
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.using System;
using System;
using System.Runtime.CompilerServices;
namespace NetFxCasing
{
public static class NetFxUpperCaseInvariant
{
@tarekgh
tarekgh / ActivityProposedAdditions.md
Last active February 11, 2020 17:36
Activity Proposed Additions

Activity Proposed Additions

OpenTelemetry has introduced the Span type which is very similar to System.Diagnostics.Activity. In general .NET libraries expected to use System.Diagnostics APIs to publish the tracing data. We need to support the scenario of allowing the .NET libraries exporting the published data through OpenTelemetry. As the OpenTelemetry exporting APIs work with the Span class, we need to ensure all features supported by the Span class can be achieved by the Activity class too and fill any gap between the Span and Activity.

Although Activity has some more properties than Span (e.g. Parent property), Span also support some properties which not exist in Activity. This document is listing these missing properties and proposing the APIs we need to add to Activity. Most of the proposed additions here are almost identical to what OpenTel

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
namespace System
{
//
// - SystemDateTime is representing the system Date and Time. it is tied to the running OS behavior.
// - It can handle leap seconds, i.e. when there is a leap second in the system, it can report it as second 60.
// - When subtracting 2 SystemDateTime instances, the leap seconds will be accounted in resulted TimeSpan.
@tarekgh
tarekgh / test.cpp
Created May 3, 2019 17:02
ICU C++ Sample
//
// c++ test.cpp -o test -std=c++0x `pkg-config --libs --cflags icu-uc icu-io` -I/usr/include/x86_64-linux-gnu -licuio -licui18n -licuuc -licudata
//
#include <stdio.h>
#include <unicode/ucol.h>
#include <unicode/usearch.h>
int main()
{
@tarekgh
tarekgh / BdfFont.cs
Last active May 4, 2019 00:24
BDF Initial Loader
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
@tarekgh
tarekgh / CaseFolding.cs
Last active May 4, 2019 00:25
CaseFolding Project show how to generate and access case folding tables in optimized size way
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.
using System;
using System.IO;
using System.Globalization;
using System.Collections.Generic;
using System.Runtime.CompilerServices;