Skip to content

Instantly share code, notes, and snippets.

Avatar

Yakov Shafranovich yakovsh

View GitHub Profile
@yakovsh
yakovsh / 2006_06_01-fixing_utf8_with_regex.md
Last active Jan 17, 2016
Fixing Malformed UTF-8 via Regex
View 2006_06_01-fixing_utf8_with_regex.md

I have been struggling with a weird problem on one of my sites that prevent that site from functioning. One of XML files that is used for this site is supposed to come in UTF-8 but unfortunatly it had some extra characters that were not encoded properly. After looking at this site [http://perl-xml.sourceforge.net/faq/#encoding_conversion], I came up with a short regular expression of my own that can convert any malformed UTF-8 characters to XML/HTML numbered entities:

s/([^x80-xFF])/'?' . ord($1) . ';'/gse;

On a related note, another issue that came up a while back is the use of ampresand without being encoded as "&". Here is another regex to solve that issue (don't remember the site I got it from):

s/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/&/g;
@yakovsh
yakovsh / 2008_10_26-fix_utf8_input.pl
Last active Jan 17, 2016
Fixing "Input is not proper UTF-8, indicate encoding" Error
View 2008_10_26-fix_utf8_input.pl
# Quick way to fix the following error in Perl:
#
# :1: parser error : Input is not proper UTF-8, indicate encoding !
# Bytes: 0xA0 0x20 0xA0 0x3C
#
# Use this command:
#
use Encode:
$string1 = decode("UTF-8", $input);
@yakovsh
yakovsh / 2007_01_16-postgres.vb
Last active Jan 17, 2016
Using PostgreSQL on Windows with ADO and VB
View 2007_01_16-postgres.vb
' The problem with PostgreSQL is lack of documentation for Windows interfaces. Visual Basic uses
' the ADO library to connect to the PostgreSQL ODBC driver, which in turns connects to the server.
'
' This example covers a unique requirement - the network has over 300 individual desktop machines,
' all of which must be able to access the planned PostgreSQL server via Access, VBA or VB6.
' However, they do not want to go and setup a data source name (DSN) on each machine separately
' (installing ODBC is easier via the Windows deployment tools). Unfortunately, the ODBC driver has
' absolutely zero documentation as to how to setup an ADO connection WITHOUT a DSN. After some prolonged
' tries and failures, we both were finally able to come up with a solution which I am posting here for
' others to benefit from.
@yakovsh
yakovsh / 2009_02_09-cleanup2.pl
Last active Jan 17, 2016
Cleaning Up Bad HTML in Perl, Take 2
View 2009_02_09-cleanup2.pl
# Here is another way to cleanup bad HTML with Perl, and convert to XML:
# This approach relies on the HTML::DOMbo module to do the actual conversion
# between HTML and XML, and HTML::TreeBuilder for parsing.
use HTML::DOMbo;
use HTML::TreeBuilder;
use XML::LibXML;
$html_code = '';
@yakovsh
yakovsh / 2008_10_24-cleanup.pl
Last active Jan 17, 2016
Cleaning Up Bad HTML in Perl
View 2008_10_24-cleanup.pl
#
# Here is a short way to cleanup bad HTML input and convert to XML with Perl:
#
use HTML::TreeBuilder;
use XML::LibXML;
$html_code = '';
my $builder = HTML::TreeBuilder->new();
@yakovsh
yakovsh / 2009_05_06-delete_s3_bucket.pl
Last active Jan 17, 2016
Deleting Amazon S3 Bucket with A Lot of Files
View 2009_05_06-delete_s3_bucket.pl
#!/usr/bin/perl
#
# Here is a short script that can mass delete files in an Amazon S3 bucket. It is limited to a 1,000 keys at a time
#
use Net::Amazon::S3;
my $s3 = Net::Amazon::S3->new({
aws_access_key_id => 'ACCESS_ID',
aws_secret_access_key => 'ACCESS_KEY',
@yakovsh
yakovsh / 2009_02_11-xml2json.pl
Last active Jan 17, 2016
Converting JSON to XML with Perl
View 2009_02_11-xml2json.pl
# Recently I had to work with Google AJAX API data which returns in JSON. For my purposes, the data needed to be in XML.
# While there is a CPAN module called XML2JSON which is designed to do that, for some reason it chokes on my input.
# Instead, I adopted a much more simple technique from the Google::Data::JSON module as follows.
use JSON::Any;
use XML::Simple;
my $convertor = JSON::Any->new();
my $data = $convertor->decode($json);
my $xml = XMLout($data);
@yakovsh
yakovsh / 2008_12_28-unicode_in_s3.pl
Last active Jan 17, 2016
Handling Unicode Data in Amazon S3 Headers
View 2008_12_28-unicode_in_s3.pl
# During a recent project, I ran into an issue when handling Unicode data in metadata headers in Amazon S3.
# Apparently, Amazon adds on "?UTF-8?B?" in front of any Unicode data and "?=" in end of the data.
# I could not find any existing standard that describes this or why it is done, but I surmise this probably
# has to do with Base-64 encoding and how it handles Unicode.
#
# As per @rawnsley:
# apparently this is because HTTP headers must only be encoded in ASCII: http://stackoverflow.com/a/4410331/671393
#
# An easy Perl hack to get around this is as following (assuming you are using MIME::Base64 module):
@yakovsh
yakovsh / 2004_08_30-xml_in_html.xsl
Last active Jan 17, 2016
Display XML in HTML files (XSLT)
View 2004_08_30-xml_in_html.xsl
<!--
While working with XSLT templates, I came across an interesting problem. I am using an XSLT template
to transform an XML file into HTML. However, for debugging purposes I need to see the original XML
and since the generation process is done on a web server (like Resin does), it is not easy to get it.
The solution: display the original XML file inside the output HTML itself. As it turns out, this was
not easy since it requires to change all "<" and ">" to use entities like "<" and ">". In XSLT,
the solution looks as follows (another solution would be to use JavaScript to escape this client-side)
-->
<xsl:template match="*">
@yakovsh
yakovsh / 2005_01_13-visited-links.css
Last active Jan 17, 2016
Making Visited Links Look the Same as Unvisited Links
View 2005_01_13-visited-links.css
You can’t perform that action at this time.