User Tools

Site Tools


get_urls_in_a_url

To extract all urls in a web page, you can use get_urls.py (github.com/KamarajuKusumanchi). Sample usage

$ get_urls.py https://news.ycombinator.com/item?id=25271676
...
https://github.com/dddrrreee/cs140e-20win/
https://cs140e.sergio.bz/syllabus/
https://tc.gts3.org/cs3210/2020/spring/lab.html
https://github.com/dddrrreee/cs140e-20win/
http://ggp.stanford.edu/
...

The important snippet is

def get_urls(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    urls = [
        x.get('href')
        for x in soup.find_all(name='a', attrs={'href': re.compile('^https*://')})
    ]
    return urls
get_urls_in_a_url.txt · Last modified: 2020/12/30 19:03 by raju