Saltar al contenido

Código Python: obtenga todos los enlaces de un sitio web

Visión general


In this script, we are going to use the re module to get all links from any website. 

One of the most powerful function in the re module is "re.findall()".

While re.search() is used to find the first match for a pattern, re.findall() finds *all*
the matches and returns them as a list of strings, with each string representing one match.

Obtener todos los enlaces de un sitio web


This example will get all the links from any websites HTML code. 

To find all the links, we will in this example use the urllib2 module together
with the re.module
import urllib2
import re

#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

Happy scraping!

Entrenamiento de Python recomendado

Para el entrenamiento de Python, nuestra principal recomendación es DataCamp.