spider
This module provides a collection of Python classes and functions designed for web content retrieval, data parsing, and asynchronous task handling. It includes functions for setting HTTP request headers, asynchronously fetching web page content, downloading files, and managing an asynchronous task pool based on coroutines. Additionally, the module defines several classes for structuring and processing web page data, as well as for managing sequences of web interactions through a page-based approach.
Functions
install_headers
Function Overview
Sets the user-agent header for urllib request.
Parameters
agent
(str, optional): The user-agent string to use for the HTTP request header. Defaults to a standard Chrome browser user-agent string.
Return Values
None
Notes
This function is used to prevent receiving web content that might be restricted to certain user-agents.
get_web_html_async
Function Overview
Asynchronously fetches the HTML content of a given URL.
Parameters
url
(str): The URL from which to fetch the HTML content.headers
(Dict[str, str], optional): A dictionary containing HTTP headers to send with the request.encoding
(str, optional): The encoding to use when decoding the response. Defaults to 'utf-8'.
Return Values
- A tuple containing a
TaskStatus
value and a string message. The message will be the HTML content if the request is successful, or an error message if it fails.
Notes
This function uses aiohttp
to perform an asynchronous HTTP GET request.
retrieve_file_async
Function Overview
Asynchronously downloads a file from a given URL and saves it to a specified file path.
Parameters
url
(str): The URL of the file to download.file_path
(str): The local file path where the downloaded file will be saved.headers
(Dict[str, str], optional): A dictionary containing HTTP headers to send with the request.
Return Values
- A tuple containing a
TaskStatus
value and a string message. The message will be the file path if the download is successful, or an error message if it fails.
Notes
This function creates directories if they do not exist before saving the file.
only_sleep
Function Overview
A utility function that sleeps for a specified number of seconds, with an optional random delay up to a maximum value.
Parameters
seconds
(float, optional): The base number of seconds to sleep. Defaults to 1.rand
(bool, optional): Whether to include a random delay up tomax
. Defaults to True.max
(float, optional): The maximum additional seconds to randomly wait. Defaults to 5.
Return Values
True
Notes
This function is typically used to introduce delays between requests to avoid being rate-limited or blocked by a website.
text_fn
Function Overview
Extracts text from an element or a list of elements, which could be etree._Element
objects or strings.
Parameters
x
(Union[str, List[str], etree._Element, List[etree._Element]]): The input element(s) from which to extract text.
Return Values
- The extracted text, which could be a string or a list of strings.
Compose
Function Overview
Composes a list of functions into a single function that is the sequential application of the functions in the list.
Parameters
lst
(List[Callable]): A list of functions to compose.
Return Values
- A new function that is the result of the composition.
Classes
AsyncResult
Class Overview
A data class that represents the result of an asynchronous task.
Members
async_pool
(CoroutinePool): The pool from which the task was executed.name
(str): The name of the task.result
(Any, optional): The result of the task. Defaults toTaskStatus.NOT_RETURNED
.
Methods
get
- Blocks until the result of the asynchronous task is available and returns it.
BasePage
Class Overview
A base class representing a web page, including methods for parsing and storing data.
Members
name
(str): The name of the page.xpath
(Union[str, List[str]]): The XPath expression(s) used to extract data from the page.findall_fn
(Callable, optional): An alternative function to extract data usingbs4.find_all
._async_task_pool
(CoroutinePool): The async task pool for executing async tasks._headers
(Dict[str, str]): Headers used for web page requests.result
(List[Any]): The parsed result data.father_page
(BasePage, optional): The parent page of the current page.next_pages
(Dict[str, BasePage]]): A dictionary containing the next pages linked from the current page.
Methods
add_next_page
- Adds a child page to the current page.
parse
- Parses data from a given list of results and stores it in
self.result
.
perform
- Performs the parsing and processing of the page.
PagePage
Class Overview
A specialized page class that starts by storing a new web page or a list of web pages.
UrlIdxPagesPage
Class Overview
A specialized page class that starts by parsing pages, storing web pages for further parsing, and getting page URLs from a given base URL.
DownloadPage
Class Overview
A page class designed to download files from given URLs and store file paths.
ItemsPage
Class Overview
A page class that parses and stores data from the father page.
Actions
Class Overview
A class that manages pages and performs actions, providing methods to add, perform, and manage page results.
Members
pages
(Dict[str, BasePage]): A dictionary of page objects.results
(Dict): A dictionary of all results from the pages.use_thread_listen
(bool): A flag to use a thread for listening to keyboard input.k2a
(List[Tuple[str, Key2Action]]): A list of key to action mappings for controlling the program via keyboard input._headers
(Dict[str, str]): A dictionary of headers for HTTP requests._async_task_pool
(CoroutinePool): A coroutine pool for managing asynchronous tasks.
Methods
get_page
- Retrieves a page by name from a given set of pages or a father page's next pages.
add_page
- Adds a page to the
pages
dictionary with optional before and after functions.
del_page
- Deletes a page from the
pages
dictionary. (Not implemented)
perform
- Performs all pages to get results, starting necessary threads and async task pools.
close
- Closes the async task pool.