Skip to content

Instantly share code, notes, and snippets.

@malcolmsparks
Created March 29, 2020 18:01
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Save malcolmsparks/6a9aab2465d2d7ea841cb9187a6d17fd to your computer and use it in GitHub Desktop.
Parsing the RFC 7231 Accept-Charset header with reap
;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;; "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
(let [charset-with-weight
(p/sequence-group
(p/alternatives
(p/pattern-parser
(re-pattern charset))
(p/pattern-parser
(re-pattern (re/re-str \*))))
(p/optionally (weight)))]
(p/cons
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
charset-with-weight))
(p/zero-or-more
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern (re/re-compose "%s%s" OWS ","))))
(p/optionally
(p/first
(p/sequence-group
(p/ignore
(p/pattern-parser
(re-pattern OWS)))
charset-with-weight))))))))]
(criterium.core/quick-bench
(let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
(parser matcher))))
@malcolmsparks
Copy link
Author

Performance:

Evaluation count : 110616 in 6 samples of 18436 calls.
             Execution time mean : 5.626027 µs
    Execution time std-deviation : 118.634985 ns
   Execution time lower quantile : 5.490290 µs ( 2.5%)
   Execution time upper quantile : 5.769129 µs (97.5%)
                   Overhead used : 8.246236 ns

@malcolmsparks
Copy link
Author

Result:

(("UTF-8" 0.8) ("shift_JIS" 0.4) ("a") ("b"))

@malcolmsparks
Copy link
Author

This variant provides a nice result:

;; Accept-Charset = *( "," OWS ) ( ( charset / "*" ) [ weight ] ) *( OWS
;;  "," [ OWS ( ( charset / "*" ) [ weight ] ) ] )
(let [parser
      (let [charset-with-weight
            (p/as-map
             (p/sequence-group
              (p/alternatives
               (p/as-entry
                :charset
                (p/pattern-parser
                 (re-pattern charset)))
               (p/pattern-parser
                (re-pattern (re/re-str \*))))
              (p/optionally
               (p/as-entry
                :weight (weight)))))]
        (p/cons
         (p/first
          (p/sequence-group
           (p/ignore
            (p/pattern-parser
             (re-pattern (re/re-compose "(?:%s)*" (re/re-concat \, OWS)))))
           charset-with-weight))
         (p/zero-or-more
          (p/first
           (p/sequence-group
            (p/ignore
             (p/pattern-parser
              (re-pattern (re/re-compose "%s%s" OWS ","))))
            (p/optionally
             (p/first
              (p/sequence-group
               (p/ignore
                (p/pattern-parser
                 (re-pattern OWS)))
               charset-with-weight))))))))]
  (criterium.core/quick-bench
   (let [matcher (re/input ", \t, , , UTF-8;q=0.8,shift_JIS;q=0.4,a,b")]
     (parser matcher))))
=>
({:charset "UTF-8", :weight 0.8}
 {:charset "shift_JIS", :weight 0.4}
 {:charset "a", :weight nil}
 {:charset "b", :weight nil})

at a marginal cost:

Evaluation count : 91548 in 6 samples of 15258 calls.
             Execution time mean : 6.791045 µs
    Execution time std-deviation : 124.832798 ns
   Execution time lower quantile : 6.656914 µs ( 2.5%)
   Execution time upper quantile : 6.965258 µs (97.5%)
                   Overhead used : 8.246236 ns

@malcolmsparks
Copy link
Author

Reap is here: https://github.com/juxt/reap

A blog is in the works.

@souenzzo
Copy link

How about use instaparse to parse the specification as is?

@malcolmsparks
Copy link
Author

Hi @souenzzo - we did once try to compare the regex approach used by reap (an earlier incarnation in jinx) with instaparse. A primitive performance comparison showed reap is 2-3 orders of magnitude faster (because Java regular expressions compile to hightly-optimized interpreters).

user=> (time (dotimes [_ 2000] (insta test-str :start :IRI)))
"Elapsed time: 18674.490482 msecs"
nil
user=> (time (dotimes [_ 2000] (jinx.patterns/re-group-by-name (jinx.patterns/matched jinx.patterns/IRI "https://jon:pither@juxt.pro/malcolm?foo#bar") "host")))
"Elapsed time: 27.511034 msecs"

@malcolmsparks
Copy link
Author

malcolmsparks commented Apr 1, 2020

Also, even if you use instaparse, you still have most of the work ahead of you if you want to parse a non-trivial request header. The purpose of reap is to provide these parsers for you, out of the box.

@malcolmsparks
Copy link
Author

Here's the latest Accept header

;; Accept = [ ( "," / ( media-range [ accept-params ] ) ) *( OWS "," [
;;  OWS ( media-range [ accept-params ] ) ] ) ]

(defn accept []
  (let [media-range-parameter
        (p/first
         (p/sequence-group
          (p/ignore
           (p/pattern-parser
            (re-pattern (re/re-concat OWS \; OWS))))
          (parameter)))

        accept-params (accept-params)

        ;; The reason why we can't just use `media-range` is that we
        ;; need to resolve the ambiguity whereby the "q" parameter
        ;; separates media type parameters from Accept extension
        ;; parameters. This is more fully discussed in RFC 7231
        ;; Section 5.3.2.
        ;;
        ;; The trick is to create a parameter parser modelled on
        ;; `zero-or-more` which attempts to match `accept-params` for
        ;; each parameter. Since `accept-params` matches on a leading
        ;; `weight`, a weight parameter will be detected and cause the
        ;; loop to end.
        parameters-weight-accept-params
        (fn [matcher]
          (loop [matcher matcher
                 result {:parameters []}]
            (if-let [accept-params (accept-params matcher)]
              (merge result accept-params)
              (if-let [match (media-range-parameter matcher)]
                (recur matcher (update result :parameters conj match))
                result))))]
    (p/optionally
     (p/cons
      (p/alternatives
       (p/ignore
        (p/pattern-parser #","))
       (p/comp
        #(apply merge %)
        (p/sequence-group
         (media-range-without-parameters)
         parameters-weight-accept-params)))
      (p/zero-or-more
       (p/first
        (p/sequence-group
         (p/ignore
          (p/pattern-parser
           (re-pattern
            (re/re-concat OWS ","))))
         (p/optionally
          (p/first
           (p/sequence-group
            (p/ignore
             (p/pattern-parser
              (re-pattern OWS)))
            (p/comp
             #(apply merge %)
             (p/sequence-group
              (media-range-without-parameters)
              parameters-weight-accept-params))))))))))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment