ljnmedium

## add_data.py
values = embedd_model.encode([b['content'] for b in batch])
sparse_values = sparsed_model.encode([b['content'] for b in batch])

# Create unique IDs
ids = [str(b['metadata']['id']) for b in batch]

# Add all to upsert list
to_upsert = [{'id': i, 'values': v, 'metadata':m , 'sparse_values': sv} for (i,v,m,sv) in zip(ids,values, metas, sparse_values)]

# Upsert/insert these records to pinecone

## managing_index.py
index.describe_index_stats()

## pinecone_setup.py
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
index = pinecone.Index("projet_esg")

## table_qa_models.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / table_qa_models.md
            
            
              Created
              September 29, 2023 07:23
            
              
                table_qa_models.md
              
          
Direct query 

Information appearing in text (entity extraction, summarization, find relevant paragraphs, etc … ).
Indirect query
Inferenced information (mathematical calculation, comparison, conclusion, etc …).


Simple text
Text containing descriptions excluding table.
Complexity: + 
Accuracy: +++
Complexity: ++
Accuracy: +++


Complex textText


## llm_size.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / llm_size.md
            
            
              Created
              September 28, 2023 09:45
            
              
                llm_size.md
              
          
Provider
Model
Number of parameters


Meta with Microsoft
LLama 2
7B, 13B, 32B, 65.2B


Meta
LLama
7B, 13B, 70B


Technology Innovation Institute of UAE
Flacon LLM
7B, 40B


Stanford’s CRFM
Alpaca
7B


Google
Plan-T5
80M, 250M, 780M, 3B, 11B


MPT
MosaicML
7B, 30B


## providers_llm.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / providers_llm.md
            
            
              Last active
              September 28, 2023 09:43
            
              
                providers_llm.md
              
          
Provider
Model
Cost for input
Cost for output
Cost per request.


OpenAI
text-davinci-004
$0.03/ 1K tokens
$0.06/ 1K tokens
0


OpenAI
text-davinci-003
$0.02/ 1K tokens
$0.02/ 1K tokens
0


OpenAI
text-davinci-002
$0.002/ 1K tokens
$0.002/ 1K tokens
0


OpenAI
gpt-3.5-turbo
$0.002/ 1K tokens
$0.002/ 1K tokens
0


[Cohere](https://cohere.com/pri


## alex_key_feature.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / alex_key_feature.md
            
            
              Created
              September 28, 2023 09:34
            
              
                alex_key_feature.md
              
          
API access solution - 3rd party model.
On-premise solution - open source model.


R&D developpement
The low initial cost, both in terms of time and money, allows us to quickly reach a Minimum Viable Product (MVP). The procedure for model parameter optimization and MLops is overseen by a third-party e


## tab.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / tab.md
            
            
              Created
              July 13, 2023 15:56
            
              
                tab.md
              
          
    | | start | length | label | text

  
## pipeline.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / pipeline.md
            
            
              Created
              July 12, 2023 13:15
            
              
                pipeline.md
              
          
Task
Model version
Comments


Voice Activity Detection
Multilingual Marblenet
Other versions exist trained on telephonic conversation or only on english data


Speaker Embeddings
Titanet Large
Smaller version of the model exists.


Multiscale Clustering
Diarization MSDD Telephonic
Specifically trained on telephonic conversations which makes it suitable for similar use cases.


## conclu.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / conclu.md
            
            
              Created
              July 12, 2023 13:12
            
              
                conclu.md
              
          
Model
Parameter Name
Value


General
Input sample rate
16 000


Batch size
16


VAD
Window length
0.8


Shift length
0.04


Pad onset
0.1


Pad offset
-0.05


Speaker embedding
Window length
[1.5,1.25,1.0,0.75,0.5]


Shift length
[0.75,0.625,0.5,0.375,0.25]
	values = embedd_model.encode([b['content'] for b in batch])
	sparse_values = sparsed_model.encode([b['content'] for b in batch])

	# Create unique IDs
	ids = [str(b['metadata']['id']) for b in batch]

	# Add all to upsert list
	to_upsert = [{'id': i, 'values': v, 'metadata':m , 'sparse_values': sv} for (i,v,m,sv) in zip(ids,values, metas, sparse_values)]

	# Upsert/insert these records to pinecone
	pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
	index = pinecone.Index("projet_esg")
	Direct query Information appearing in text (entity extraction, summarization, find relevant paragraphs, etc … ).	Indirect query Inferenced information (mathematical calculation, comparison, conclusion, etc …).
Simple text Text containing descriptions excluding table.	Complexity: + Accuracy: +++	Complexity: ++ Accuracy: +++
Complex textText
Provider	Model	Number of parameters
Meta with Microsoft	LLama 2	7B, 13B, 32B, 65.2B
Meta	LLama	7B, 13B, 70B
Technology Innovation Institute of UAE	Flacon LLM	7B, 40B
Stanford’s CRFM	Alpaca	7B
Google	Plan-T5	80M, 250M, 780M, 3B, 11B
MPT	MosaicML	7B, 30B
Provider	Model	Cost for input	Cost for output	Cost per request.
OpenAI	text-davinci-004	$0.03/ 1K tokens	$0.06/ 1K tokens	0
OpenAI	text-davinci-003	$0.02/ 1K tokens	$0.02/ 1K tokens	0
OpenAI	text-davinci-002	$0.002/ 1K tokens	$0.002/ 1K tokens	0
OpenAI	gpt-3.5-turbo	$0.002/ 1K tokens	$0.002/ 1K tokens	0
[Cohere](https://cohere.com/pri
Task	Model version	Comments
Voice Activity Detection	Multilingual Marblenet	Other versions exist trained on telephonic conversation or only on english data
Speaker Embeddings	Titanet Large	Smaller version of the model exists.
Multiscale Clustering	Diarization MSDD Telephonic	Specifically trained on telephonic conversations which makes it suitable for similar use cases.
Model	Parameter Name	Value
General	Input sample rate	16 000
	Batch size	16
VAD	Window length	0.8
	Shift length	0.04
	Pad onset	0.1
	Pad offset	-0.05
Speaker embedding	Window length	[1.5,1.25,1.0,0.75,0.5]
	Shift length	[0.75,0.625,0.5,0.375,0.25]