spider

This module provides a collection of Python classes and functions designed for web content retrieval, data parsing, and asynchronous task handling. It includes functions for setting HTTP request headers, asynchronously fetching web page content, downloading files, and managing an asynchronous task pool based on coroutines. Additionally, the module defines several classes for structuring and processing web page data, as well as for managing sequences of web interactions through a page-based approach.

Functions

install_headers

Function Overview

Sets the user-agent header for urllib request.

Parameters

  • agent (str, optional): The user-agent string to use for the HTTP request header. Defaults to a standard Chrome browser user-agent string.

Return Values

None

Notes

This function is used to prevent receiving web content that might be restricted to certain user-agents.

get_web_html_async

Function Overview

Asynchronously fetches the HTML content of a given URL.

Parameters

  • url (str): The URL from which to fetch the HTML content.
  • headers (Dict[str, str], optional): A dictionary containing HTTP headers to send with the request.
  • encoding (str, optional): The encoding to use when decoding the response. Defaults to 'utf-8'.

Return Values

  • A tuple containing a TaskStatus value and a string message. The message will be the HTML content if the request is successful, or an error message if it fails.

Notes

This function uses aiohttp to perform an asynchronous HTTP GET request.

retrieve_file_async

Function Overview

Asynchronously downloads a file from a given URL and saves it to a specified file path.

Parameters

  • url (str): The URL of the file to download.
  • file_path (str): The local file path where the downloaded file will be saved.
  • headers (Dict[str, str], optional): A dictionary containing HTTP headers to send with the request.

Return Values

  • A tuple containing a TaskStatus value and a string message. The message will be the file path if the download is successful, or an error message if it fails.

Notes

This function creates directories if they do not exist before saving the file.

only_sleep

Function Overview

A utility function that sleeps for a specified number of seconds, with an optional random delay up to a maximum value.

Parameters

  • seconds (float, optional): The base number of seconds to sleep. Defaults to 1.
  • rand (bool, optional): Whether to include a random delay up to max. Defaults to True.
  • max (float, optional): The maximum additional seconds to randomly wait. Defaults to 5.

Return Values

  • True

Notes

This function is typically used to introduce delays between requests to avoid being rate-limited or blocked by a website.

text_fn

Function Overview

Extracts text from an element or a list of elements, which could be etree._Element objects or strings.

Parameters

  • x (Union[str, List[str], etree._Element, List[etree._Element]]): The input element(s) from which to extract text.

Return Values

  • The extracted text, which could be a string or a list of strings.

Compose

Function Overview

Composes a list of functions into a single function that is the sequential application of the functions in the list.

Parameters

  • lst (List[Callable]): A list of functions to compose.

Return Values

  • A new function that is the result of the composition.

Classes

AsyncResult

Class Overview

A data class that represents the result of an asynchronous task.

Members

  • async_pool (CoroutinePool): The pool from which the task was executed.
  • name (str): The name of the task.
  • result (Any, optional): The result of the task. Defaults to TaskStatus.NOT_RETURNED.

Methods

get

  • Blocks until the result of the asynchronous task is available and returns it.

BasePage

Class Overview

A base class representing a web page, including methods for parsing and storing data.

Members

  • name (str): The name of the page.
  • xpath (Union[str, List[str]]): The XPath expression(s) used to extract data from the page.
  • findall_fn (Callable, optional): An alternative function to extract data using bs4.find_all.
  • _async_task_pool (CoroutinePool): The async task pool for executing async tasks.
  • _headers (Dict[str, str]): Headers used for web page requests.
  • result (List[Any]): The parsed result data.
  • father_page (BasePage, optional): The parent page of the current page.
  • next_pages (Dict[str, BasePage]]): A dictionary containing the next pages linked from the current page.

Methods

add_next_page

  • Adds a child page to the current page.

parse

  • Parses data from a given list of results and stores it in self.result.

perform

  • Performs the parsing and processing of the page.

PagePage

Class Overview

A specialized page class that starts by storing a new web page or a list of web pages.

UrlIdxPagesPage

Class Overview

A specialized page class that starts by parsing pages, storing web pages for further parsing, and getting page URLs from a given base URL.

DownloadPage

Class Overview

A page class designed to download files from given URLs and store file paths.

ItemsPage

Class Overview

A page class that parses and stores data from the father page.

Actions

Class Overview

A class that manages pages and performs actions, providing methods to add, perform, and manage page results.

Members

  • pages (Dict[str, BasePage]): A dictionary of page objects.
  • results (Dict): A dictionary of all results from the pages.
  • use_thread_listen (bool): A flag to use a thread for listening to keyboard input.
  • k2a (List[Tuple[str, Key2Action]]): A list of key to action mappings for controlling the program via keyboard input.
  • _headers (Dict[str, str]): A dictionary of headers for HTTP requests.
  • _async_task_pool (CoroutinePool): A coroutine pool for managing asynchronous tasks.

Methods

get_page

  • Retrieves a page by name from a given set of pages or a father page's next pages.

add_page

  • Adds a page to the pages dictionary with optional before and after functions.

del_page

  • Deletes a page from the pages dictionary. (Not implemented)

perform

  • Performs all pages to get results, starting necessary threads and async task pools.

close

  • Closes the async task pool.