Dan Nguyen dannguyen
Here's a link to the Drug Enforcement Agency Office of Diversion Control's ARCOS Registrant Handbook
$ curl -Lo arcos_all_washpost.tsv.gz \ https://d2ty8gaf6rmowa.cloudfront.net/dea-pain-pill-database/bulk/arcos_all_washpost.tsv.gz $ gunzip arcos_all_washpost.tsv.gz
AWS Textract -- sample document image and data from the offical demo
AWS Textract is now out of closed beta. You can read the features page here, and you can also read about its limits here (e.g. no handwriting). Basically, if you've ever had to deal with the hell of getting structured data out of a PDF (scanned image or not), Textract is aiming for your business:
This short gist contains some of my brief observations about Textract and its demo, as well as direct links to the most relevant and important files, such as the Textract demo sample image and the resulting data files from Textract's API. If you have an AWS account, I h
[Ignore this gist, checkout the github] Testing AWS Textract's ability to correctly extract data tables from a difficult FBI stats report PDF
tl;dr: pretty good table structure overall, given the issues with the original PDF. However, there were inexplicable and critical data errors, as if Textract converted the PDF to an image, OCRed it, and then attempted to extract the data tables.
Amazon Textract was announced about 6 months ago but was made public today (May 29). If have an AWS account, you can check out Textract's point-and-click demo, which allows you to upload an image or PDF for T
|ID,Posted at,Screen name,Text|
|1123212586919419905,2019-04-30 13:08:43 +0000,shinya1720777,"妙にテンションの高いまーちゃんにあおられた|
|1123212591109672961,2019-04-30 13:08:44 +0000,inesteiixeira,RT @lunaaaaa20: acho que a coisa mais linda é acompanhar o crescimento da pessoa que gostamos e contribuir para isso|
|1123212591109689348,2019-04-30 13:08:44 +0000,BTS20520283,#BBMAsTopSocial BTS @BTS_twt kalp|
|1123212591139045381,2019-04-30 13:08:44 +0000,Dipendr80247123,"RT @YL511: #البتكوين|
|📌اش الحكاية :|
Simple scraper of the ASPX search form for U.S. Congress House financial disclosure results
The following script, given someone's last name, prints a CSV of financial disclosure PDFs (the first 20, for simplicity's sake) as found on the House Financial Disclosure Reports. It's meant to be a proof-of-concept of how to scrape ASPX (and other "stateful" websites) with using plain old requests -- without too much inconvenience -- rather than resorting to something heavy like the selenium websdriver
The search page can be found here: http://clerk.house.gov/public_disc/financial-search.aspx
Here's a screenshot of what it does when you search via web browser:
Demo of Google text-to-speech Wavenet API on a NYT article
Was curious if Google's text-to-speech API might be good enough for generating audio versions of stories on-the-fly. Google has offered traditional computer voices for awhile, but last year made available their premium WaveNet voices, which are trained using audio recorded from human speakers, and are purportedly capable of mimicking natural-sounding inflection and rhythm.
Pretty good...but I honestly can't tell the difference between the standard voice and the WaveNet version, at least when it comes to intonation and inflection. The first 2 grafs of this NYT story, roughly 85 words/560 characters, took less than 2 seconds to process. The result in both cases is a 37-second second audio file.
- The M