Skip to content

Instantly share code, notes, and snippets.

@iurisilvio
Last active August 29, 2015 14:15
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save iurisilvio/217ecc660dbf10e4d075 to your computer and use it in GitHub Desktop.
Save iurisilvio/217ecc660dbf10e4d075 to your computer and use it in GitHub Desktop.

Usage: python parser.py --input /your/input/directory --output /your/output/directory --sleep 5 --step 500

Output example:

---
layout: 'single-product'
categories: 'SaksFifthAvenue-UK Kids Toys-and-Books'
merchantName: 'Saks Fifth Avenue - UK'
manufacturer_name: 'Janod'
sku_number: '0405148128850'
product_id: '110013244684007288214612306030'
name: 'Barbecue Trolley'
primary: 'Kids'
secondary: 'Toys and Books'
product: 'http://click.linksynergy.com/link?id=v3EaLjWOvJQ&offerid=268285.110013244684007288214612306030&type=15&murl=http%3A%2F%2Fwww.saksfifthavenue.com%2Fmain%2FProductDetail.jsp%3FFOLDER%253C%253Efolder_id%3D2534374306439561%26PRODUCT%253C%253Eprd_id%3D845524446623895'
productImage: 'http://image.s5a.com/is/image/saks/0405148128850_396x528.jpg'
short: 'Your budding chef will cook up a storm on this rolling barbecue trolley, complete with one magnetic spatula, one magnetic barbecue fork, one piece of pork, two sausages, one fish, three tomatoes and one piece of beef.;Wheeled bottom;12.8" X 12.8" X 17.3";Recommended for ages 18 months and up;Assembly required;Wood;Wipe clean;Imported'
long: 'Your budding chef will cook up a storm on this rolling barbecue trolley, complete with one magnetic spatula, one magnetic barbecue fork, one piece of pork, two sausages, one fish, three tomatoes and one piece of beef.;Wheeled bottom;12.8" X 12.8" X 17.3";Recommended for ages 18 months and up;Assembly required;Wood;Wipe clean;Imported'
currency: 'GBP'
type: 'amount'
sale: '65.91'
retail: '65.91'
brand: 'Janod'
information: '5 - 14 business days'
availability: 'in stock'
keywords: 'Janod'
pixel: 'http://ad.linksynergy.com/fs-bin/show?id=v3EaLjWOvJQ&bids=268285.110013244684007288214612306030&type=15&subid=0'
class_id: '60'
Misc: 'No'
Age: 'Adult'
---
from __future__ import unicode_literals
import os
import re
import time
import unicodedata
from collections import OrderedDict
from datetime import date
from optparse import OptionParser
from xml.dom.pulldom import START_ELEMENT, parse
def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip())
return value
def traverse(node, properties):
if node.attributes is not None:
for k, v in node.attributes.items():
properties[k] = v
data = []
for child in node.childNodes:
if hasattr(child, "data"):
if node.tagName not in properties:
data.append(child.data)
else:
traverse(child, properties)
if data:
properties[node.tagName] = "".join(data)
def parse_xml(filename):
doc = parse(filename)
merchant_name = None
for event, node in doc:
if event == START_ELEMENT and node.localName == "merchantName":
doc.expandNode(node)
merchant_name = node.childNodes[0].toxml()
if event == START_ELEMENT and node.localName == "product":
properties = OrderedDict()
doc.expandNode(node)
traverse(node, properties)
if "manufacturer_name" not in properties:
properties["manufacturer_name"] = merchant_name
categories = [merchant_name.replace(" ", "")]
primary = properties.get("primary")
if primary:
categories.append(primary.replace(" ", "-"))
secondary = properties.get("secondary")
if secondary:
categories.append(secondary.replace(" ", "-"))
categories = [slugify(c) for c in categories]
yield merchant_name, properties["name"], categories, properties.items()
def main(input_dir, output_dir, step=None, sleep=None, quiet=False):
def _log(message):
if not quiet:
print message
def _w(f, key, *values):
f.write("%s: '" % key)
for n, v in enumerate(values):
if n > 0:
f.write(" ")
f.write(v.replace("'", "").encode("utf-8"))
f.write("'\n")
if not os.path.isdir(output_dir):
os.makedirs(output_dir)
for filename in os.listdir(input_dir):
if not filename.endswith(".xml"):
continue
_log("Parsing %s..." % filename)
n = 0
for merchant, product, categories, attributes in parse_xml(os.path.join(input_dir, filename)):
output_filename = os.path.join(output_dir,
"%s-%s.markdown" % (date.today().strftime("%Y-%m-%d"), slugify(product)))
if os.path.exists(output_filename):
# skip if file already exists
continue
n += 1
with open(output_filename, "w") as f:
f.write("---\n")
_w(f, "layout", "single-product")
_w(f, "categories", *categories)
_w(f, "merchantName", merchant)
for k, v in attributes:
_w(f, k, v)
f.write("---\n")
if step and n % step == 0:
if not quiet:
_log("%d products generated" % n)
if sleep:
time.sleep(sleep)
_log("OK")
if __name__ == "__main__":
parser = OptionParser()
parser.add_option("-i", "--input", dest="input")
parser.add_option("-o", "--output", dest="output")
parser.add_option("--sleep", dest="sleep", type="int")
parser.add_option("--step", dest="step", type="int", default=100)
parser.add_option("-q", "--quiet", dest="quiet", action="store_true", default=False)
options, args = parser.parse_args()
main(options.input, options.output, options.step, options.sleep, options.quiet)
@iurisilvio
Copy link
Author

I mispelled manufacturerName. Fixed it to check manufacturer_name attribute.

I added a check to skip duplicate titles based on filename.

You just need the default python in Ubuntu 14.04 (Python 2.7). You don't need anything extra.

@iurisilvio
Copy link
Author

Your stacktrace is a encoding issue. I'll fix it soon.

@iurisilvio
Copy link
Author

Your issue was fixed. I didn't catch this problem before because I tested with only a subset of your demo file.

I was able to generate all files from your demo.xml.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

I tested this out and it doesn't seem to get all of the info in each row. For each row I need something like this

/---
layout: single-product
merchantName: Saks Fifth Avenue
product_id: 110004345291240158904534387623
name: Bailey Weekender Bag
sku_number: 0415766562790
manufacturer_name: Storksak
part_number: 1689949378855549
primary: Kids
secondary: Baby (0 24 Months)~~Baby Bags
product: http://click.linksynergy.com/link?id=v3EaLjWOvJQ&offerid=268285.110004345291240158904534387623&type=15&murl=http%3A%2F%2Fwww.saksfifthavenue.com%2Fmain%2FProductDetail.jsp%3FFOLDER%253C%253Efolder_id%3D2534374306418231%26PRODUCT%253C%253Eprd_id%3D845524446685088
productImage: http://image.s5a.com/is/image/saks/0415766562790_396x528.jpg
short: A roomy, versatile tote in durable cotton canvas outfitted with all the necessities for moms on the go.;Double top handles, 8" drop;Adjustable shoulder strap, 16"-25" drop;Top zip closure;Two outside magnetic pockets;Two inside pockets;Three elasticized compartments;Changing pad;Wipe-clean canvas;11"W X 17.5"H X 6"D;Imported
long: A roomy, versatile tote in durable cotton canvas outfitted with all the necessities for moms on the go.;Double top handles, 8" drop;Adjustable shoulder strap, 16"-25" drop;Top zip closure;Two outside magnetic pockets;Two inside pockets;Three elasticized compartments;Changing pad;Wipe-clean canvas;11"W X 17.5"H X 6"D;Imported
currency: GBP
type: amount
sale: 164.77
retail: 164.77
brand: Storksak
information: 5 - 14 business days
availability: available
keywords: Storksak
pixel: http://ad.linksynergy.com/fs-bin/show?id=v3EaLjWOvJQ&bids=268285.110004345291240158904534387623&type=15&subid=0
class_ids: 60
misc: No
color: BLUE, CORAL
age: Kids
/---

btw I need the triple dash - - - on top and below

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Can you please add the triple dash to the top and bottom of the outputted file?

/---

/---

Without the forward slash please.

Thanks

@iurisilvio
Copy link
Author

Done!

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Hey btw, here is an xml file without the manufacturer_name. I tested the script against this file and it somehow broke with the following error:

Traceback (most recent call last):
File "parser.py", line 83, in
main(options.input, options.output)
File "parser.py", line 59, in main
for merchant, product, attributes in parse_xml(os.path.join(input_dir, filename)):
File "parser.py", line 46, in parse_xml
properties.append(("manufacturer_name", merchant_name))
AttributeError: 'OrderedDict' object has no attribute 'append'

Here is the new xml link:
https://drive.google.com/file/d/0B6h9HPRdfghjTElBM3I0bnN1VGc/view?usp=sharing

@iurisilvio
Copy link
Author

I already fixed this problem. Get the last version from the gist here. :)

@iurisilvio
Copy link
Author

I ran your new XML and it works great.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Sorry for another issue, I forgot to ask you, can you please include this in the elements list:

'categories:' the_value_from_merchantName_attribute'

below 'layout: single-product'

Also is it possible to have the script keep going even if an attribute is missing? I'm worried if later on an XML file has a few missing elements and the script won't keep generating the subsequent files.

e.g: if 'secondary' is missing, just proceed with the other attributes and generate the file anyway.

Thanks

@iurisilvio
Copy link
Author

The only required field in my code is the "name". I will show you all existent XML attributes and properties.

I added the categories line: categories: Saks Fifth Avenue - UK

I'm not sure if it is what you expected. This is the same value from merchantName line?

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Yep that's what I need. But I ran into an issue where each value attribute needs a beginning and ending single ' ' quotes.

So e.g: categories: 'Saks Fifth Avenue - UK' like this.

Can you please add this? After this I'll test the code again and I think it will be done soon :)

Cheers

@iurisilvio
Copy link
Author

It is for all values or only for this categories attribute? I added quotes to the categories attribute, but it is easy to do for all values if you want.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

It's for every single value, thanks a lot.

@iurisilvio
Copy link
Author

Done! I added a backslash to escape single quotes in values, but I don't know if it is the right way to do it. Your system probably has a specific way to do this escape single quotes.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

I've been testing and I think any backslash or anything else will just break. Can you just strip away any single quotes inside the values? I don't really mind if they're not there anymore.

Thanks

@iurisilvio
Copy link
Author

That is easy... Done!

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Just one final thing, can you please remove any spaces in the value for 'categories'?

Thanks

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Okay I really don't get it, iI tried the same script with another XML with the same setup but just different values:

https://drive.google.com/file/d/0B6h9HPRdfghjNFZvTlFQWHFiM0E/view?usp=sharing

Why do I get this error?

Traceback (most recent call last):
File "parser.py", line 93, in
main(options.input, options.output)
File "parser.py", line 69, in main
for merchant, product, attributes in parse_xml(os.path.join(input_dir, filename)):
File "parser.py", line 46, in parse_xml
traverse(node, properties)
File "parser.py", line 31, in traverse
traverse(child, properties)
File "parser.py", line 31, in traverse
traverse(child, properties)
File "parser.py", line 31, in traverse
traverse(child, properties)
File "parser.py", line 22, in traverse
for k, v in node.attributes.items():
AttributeError: 'NoneType' object has no attribute 'items'

@iurisilvio
Copy link
Author

Done. I fixed your last issue.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

What was the issue related to though? I mean what if I try with another XML file that has the exact same layout but just with different values? Will that break as well?

@iurisilvio
Copy link
Author

No, it will not break again. It was a bug handling empty data.

@jcyin1
Copy link

jcyin1 commented Feb 21, 2015

Okay cheers, I'll test it a bit more and then I'll get some sleep. It's five in the morning here. After that I'll test with all of our XML files and if it all goes well, I'll send you the bounty and a tip :)

Cheers

@iurisilvio
Copy link
Author

Thanks! Please, send me a message in bountify comments or send me an email to iurisilvio at gmail, because github does not have a notification for comments here. :/

@jcyin1
Copy link

jcyin1 commented Feb 22, 2015

Hey I've been testing and is it possible to append the 'primary' and 'secondary' values after the 'merchantname' in categories? And replace any 'spaces' with '-' a dash?

So the result will become like this:

categories: 'SaksFifthAvenue-UK' 'Kids' 'Baby-(0-24-Months)~~Baby-Bags'

Thanks

@iurisilvio
Copy link
Author

I added primary and secondary with dashes to categories, only if they exist.

@jcyin1
Copy link

jcyin1 commented Feb 22, 2015

Sorry I made a mistake in the single '' quote placement.

Instead of:
categories: 'SaksFifthAvenue-UK' 'Kids' 'Baby-(0-24-Months)~~Baby-Bags'

It should be:
categories: 'SaksFifthAvenue-UK Kids Baby-(0-24-Months)~~Baby-Bags'

Thanks

@iurisilvio
Copy link
Author

Fixed.

categories: 'SaksFifthAvenue-UK Kids Toys-and-Books'

@jcyin1
Copy link

jcyin1 commented Feb 22, 2015

I get

File "parser.py", line 67
f.write("'\n")
^
SyntaxError: invalid syntax

@iurisilvio
Copy link
Author

Oops, I pasted the wrong code. :) Fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment