Dan Nguyen dannguyen

## README.openai-structured-output-demo.md

      
              9 files
            
          
              11 forks
            
          
              3 comments
            
          
              123 stars
            
          
                dannguyen
                / README.openai-structured-output-demo.md
            
            
              Last active
              October 29, 2024 23:25
            
              
                A basic test of OpenAI's Structured Output feature against financial disclosure reports and a newspaper's police blotter. Code examples use the Python SDK and pydantic for the schema definition.
              
          
    Extracting financial disclosure reports and police blotter narratives using OpenAI's Structured Output


tl;dr this demo shows how to call OpenAI's gpt-4o-mini model, provide it with URL of a screenshot of a document, and extract data that follows a schema you define. The results are pretty solid even with little effort in defining the data — and no effort doing data prep. OpenAI's API could be a cost-efficient tool for large scale data gathering projects involving public documents.

OpenAI announced Structured Outputs for its API, a feature that allows users to specify the fields and schema of extracted data, and guarantees that the JSON output will follow that specification.
For example, given a Congressional financial disclosure report, with assets defined in a table like this:

  
## skimschema.py
#!/usr/bin/env python3
"""
skimschema.py
==============

Create an excel file of transposed data rows, for easy browsing of
a data file's contents (csvs only for now)


Longer description

## bq-sfpd-query.sql
SELECT
    unique_key
    , pddistrict AS pd_district
    , DATE(timestamp) AS incident_date
    , category
    , descript AS description
    , dayofweek AS day_of_week
    , resolution
    , UPPER(address) AS address
    , longitude

## fetch_ghstars.md

      
              4 files
            
          
              0 forks
            
          
              0 comments
            
          
              13 stars
            
          
                dannguyen
                / fetch_ghstars.md
            
            
              Last active
              August 17, 2024 04:14
            
              
                fetch_ghstars.py: quick CLI script to fetch from Github API all of a user's starred repos and save it as raw JSON and wrangled CSV
              
          
    fetch_ghstars.py: quick CLI script to fetch and collate  from Github API all of a user's starred repos


Requires Python 3.6+
Creates a subdir 'ghstars-USERNAME' at the current working directory
the raw JSON of each page request is saved as: 01.json, 02.json 0n.json
A flattened, filtered CSV is also created: wrangled.csv

Example usage:

  
## aws-transcribe-2020-10-biden-palin.md

      
              3 files
            
          
              1 fork
            
          
              1 comment
            
          
              0 stars
            
          
                dannguyen
                / aws-transcribe-2020-10-biden-palin.md
            
            
              Last active
              February 10, 2021 01:29
            
              
                i only created this gist to respond to someone responding to my older aws-transcribe-via-cli gist
              
          
    Amazon Transcribe (real-time) streaming sample, with speakers identified (2020-10-09)

Note: This gist refers this older gist that shows the AWS transcribe API:
https://gist.github.com/dannguyen/9b8c51f5bb853209f19f1a0f18f0f74c
I went into the AWS console for Transcription, which has an interface for real-time transcription here:
https://console.aws.amazon.com/transcribe/home?region=us-east-1#realTimeTranscription
Then I used my phone to play out this snippet of the 2008 VP presidential debate, featuring speech from Biden and Palin:
https://twitter.com/dancow/status/1313951588428517385

  
## csvflatten-hamlet-tables.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dannguyen
                / csvflatten-hamlet-tables.md
            
            
              Created
              October 4, 2020 19:46
            
          
fieldname
value


act
1


scene
5


speaker
Horatio


lines
Propose the oath, my lord.


~~~~~~~~~


act
1


scene
5


speaker
Hamlet


## README-xsv-split-windows.md

      
              13 files
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                dannguyen
                / README-xsv-split-windows.md
            
            
              Last active
              August 27, 2020 07:00
            
              
                How to install and use xsv to split a large CSV file (Windows)
              
          
    How to use xsv (in Windows) to split up a CSV file too big for Excel

I wrote these instructions on how to install and use xsv – a powerful CSV-handling command-line tool, because someone asked how to deal with a data file that was too big to open in Excel or even Notepad. I didn't know how familiar the person was with installing/running downloadable .exe files or with Powershell, so I've tried to include some general instructions that hopefully are useful to even novices.
This mini-guide is not at all meant to be exhaustive as it basically shows just one of xsv's many useful functions. But if you're new to the idea of using command-line tools to do things, hopefully this can be a friendly intro to it.

Here's an example of a CSV that, at 3 million rows, is too big for Excel to open: https://burntsushi.net/stuff/worldcitiespop.csv

  
## bash-prompt.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                dannguyen
                / bash-prompt.md
            
            
              Last active
              August 19, 2020 00:05
            
              
                my bash prompt with a ghost and stuff
              
          
    this goes in my bash profile:
XRESET='\[\033[00m\]'
PROMPT_PATH="\[\033[0;33m\]\W${XRESET} \[\033[1;37m\]\$${XRESET}"
PROMPT_GHOST="༼ つ\[\033[1;33m\]°${XRESET}\[\033[1;31m\]︻\[\033[1;33m\]゜${XRESET}༽つ🐕"

export PS1="${PROMPT_GHOST} ${PROMPT_PATH} "

  
## normalize-ascii-google-sheet-README.md

      
              3 files
            
          
              1 fork
            
          
              0 comments
            
          
              0 stars
            
          
                dannguyen
                / normalize-ascii-google-sheet-README.md
            
            
              Last active
              August 25, 2020 22:17
            
              
                A modified Google App Script hack to normalize Vietnamese characters into ASCII
              
          
    Example Google Apps Script functions to normalize non-ASCII characters and insert a timestamp, when a new row is created.
Basically I sloppily added various Vietnamese accented characters to this Gist example: https://gist.github.com/akora/51b2933a2554776d7144#gistcomment-2936646
Blogpost about Apps Script in general, onEdit and timestamps here: http://blog.danwin.com/how-to-automatically-timestamp-a-new-row-in-google-sheets-using-apps-script/


## DANS SECRET STUFF.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                dannguyen
                / DANS SECRET STUFF.md
            
            
              Created
              July 16, 2020 22:16
            
              
                DANS SECRET STUFF
              
          
    test test test
	#!/usr/bin/env python3
	"""
	skimschema.py
	==============

	Create an excel file of transposed data rows, for easy browsing of
	a data file's contents (csvs only for now)


	Longer description
	SELECT
	unique_key
	, pddistrict AS pd_district
	, DATE(timestamp) AS incident_date
	, category
	, descript AS description
	, dayofweek AS day_of_week
	, resolution
	, UPPER(address) AS address
	, longitude
fieldname	value
act	1
scene	5
speaker	Horatio
lines	Propose the oath, my lord.
~~~~~~~~~
act	1
scene	5
speaker	Hamlet