re4lfl0w/how_to_extract_exact_url.md

## how_to_extract_exact_url.md

      
    Raw
  

              how_to_extract_exact_url.md
            
          
    앞에 url만 추출할 수 있는 방법 문의


정규 표현식 질문 하나 드릴게요. 이거 의외로 간단한거 같은데 잘 안되네요.


http://naver.com/path/ab1.html


http://naver.com/path/ab1.htmlhttp


http://naver.com/path/ab1.htmlhttp://daum.net/aaa/1.html


여기에서 제일 앞에 http://naver.com/path/ab1.html 이것만 가져올 수 있는 방법이 없을까요?


import re

s1 = 'http://naver.com/path/ab1.htmlhttp'
s2 = 'http://naver.com/path/ab1.htmlhttp://daum.net/aaa/1.html'

http_regex = re.compile(r'''
    #     ^                     # Anchor to start of string.
        (?:https?://)?
        (?!.{1, 256})            # Whole domain must be 255 or less.
        (?:                   # One or more sub-domains.
          [a-z0-9]            # Subdomain begins with alpha-num.
          (?:                 # Optionally more than one char.
            [a-z0-9-]{0,61}   # Middle part may have dashes.
            [a-z0-9]          # Starts and ends with alpha-num.
          )?                  # Subdomain length from 1 to 63.
          \.                  # Required dot separates subdomains.
        )+                    # End one or more sub-domains.
        (?:                   # Top level domain (length from 1 to 63).
          [a-z]{1,63}         # Either traditional-tld-label = 1*63(ALPHA).
        | xn--[a-z0-9]{1,59}  # Or an idn-label = Restricted-A-Label.
        )                     # End top level domain.
        /[-a-zA-Z0-9]+(?!http)
    #     $                     # Anchor to end of string.'''
                             , re.X | re.I | re.M)
                             
print(http_regex.findall(s1))
# http://naver.com/path/ab1.htmlhttp
print(http_regex.findall(s2))
# http://naver.com/path/ab1.htmlhttp

[-a-zA-Z0-9]+ 이 부분에 의해서 마지막에 있는 http까지 매칭이 되어 버리네요.
이걸 어떻게 해결할 수 있을까요?
정규표현식으로 뽑아내고 다시 한 번 정규표현식을 돌려서 앞에것만 가져오게 하면 되긴 하는데. 한 번의 정규표현식으로 해결 할 수는 없을까요?
예를 들어 다 가져온 다음에 http 뒤로만 다 짤라내면 해결은 되는데 정규표현식 한 방에 해결이 안되네요.