get_urls_in_a_url
To extract all urls in a web page, you can use get_urls.py (github.com/KamarajuKusumanchi). Sample usage
$ get_urls.py https://news.ycombinator.com/item?id=25271676 ... https://github.com/dddrrreee/cs140e-20win/ https://cs140e.sergio.bz/syllabus/ https://tc.gts3.org/cs3210/2020/spring/lab.html https://github.com/dddrrreee/cs140e-20win/ http://ggp.stanford.edu/ ...
The important snippet is
def get_urls(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') urls = [ x.get('href') for x in soup.find_all(name='a', attrs={'href': re.compile('^https*://')}) ] return urls
get_urls_in_a_url.txt · Last modified: 2020/12/30 19:03 by raju