Skip to content

Instantly share code, notes, and snippets.

@hunj
Created June 25, 2015 00:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hunj/9e4a2979f9bdf61b9058 to your computer and use it in GitHub Desktop.
Save hunj/9e4a2979f9bdf61b9058 to your computer and use it in GitHub Desktop.
Strip path from domain, off from sitemap xml file.
def path_strip(input_file, domain, output_file)
raise "domain must be string form" unless domain.is_a? String
raise "invalid input file name" unless input_file.is_a? String
raise "invalid output file name" unless output_file.is_a? String
file = File.open(input_file, "r")
data = file.read
file.close
data_lines = data.lines
cleared_arr = []
result_file = File.open(output_file, "w")
num = 0
data_lines.each do |line|
if line =~ /<loc>http:\/\/#{Regexp.quote(domain)}\/.*<\/loc>/
num += 1
result_file.puts "link_#{num},#{line[5..-8].sub("http://#{domain}/", '')}"
end
end
result_file.close
p num
end
# example:
path_strip "./sitemap.xml", "hunj.github.io", "./result.csv"
@imjching
Copy link

Calling File.open twice is alright.

You can refactor lines 6 to 9 to this:

data_lines = File.readlines(input_file)

I guess you could prevent creating unnecessary variables by doing something like this:
(A shorter version)

def path_strip(input_file, domain, output_file)
  raise "domain must be string form" unless domain.is_a? String
  raise "invalid input file name" unless input_file.is_a? String
  raise "invalid output file name" unless output_file.is_a? String

  data = File.read(input_file).scan(/<loc>http:\/\/#{Regexp.quote(domain)}\/.*<\/loc>/)

  data.map!.with_index do |x, index|
    "link_#{index + 1},#{x[5..-8].sub("http://#{domain}/", '')}"
  end

  File.write(output_file, data.join("\n"))
  p data.length
end

# example:
path_strip "./sitemap.xml", "hunj.github.io", "./result.csv"

By the way, with reference to your code (L12, L21), what you're doing is: open a File, process the lines one by one, then close the File instance. Only do that if you have a lot of lines (and you think that the lines will take up a lot of memory).

An alternative would be to store the processed lines in a String (concatenate) or Array, and write it to the output_file all at one go. This method would be faster, but you need memory to store your data.

Cheers,
Jay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment