mbapy.file

This module provides utility functions for file operations, including reading and writing files, working with different file formats, and handling file paths.

Functions

get_paths_with_extension -> List[str]

Returns a list of file paths within a given folder that have a specified extension.

Params

folder_path (str): The path of the folder to search for files.
file_extensions (List[str]): A list of file extensions to filter the search by.

Returns

List[str]: A list of file paths that match the specified file extensions.

Notes

None

Example

folder_path = '/path/to/folder'
file_extensions = ['.txt', '.csv']
file_paths = get_paths_with_extension(folder_path, file_extensions)
print(file_paths)

extract_files_from_dir

Move all files in subdirectories to the root directory and add the subdirectory name as a prefix to the file name.

Params

root (str): The root directory path.
file_extensions (list[str]): specific file types string (without '.'), if None, means all types.
extract_sub_dir (bool, optional): Whether to recursively extract files from subdirectories. If set to False, only files in the immediate subdirectories will be extracted. Defaults to True.
join_str (str): string for link prefix and the file name.

Returns

None

Notes

None

Example

root = '/path/to/root'
file_extensions = ['.txt', '.csv']
extract_files_from_dir(root, file_extensions, extract_sub_dir=True, join_str='_')

replace_invalid_path_chr -> str

Replaces any invalid characters in a given path with a specified valid character.

Params

path (str): The path string to be checked for invalid characters.
valid_chrs (str, optional): The valid characters that will replace any invalid characters in the path. Defaults to '_'.

Returns

str: The path string with all invalid characters replaced by the valid character.

Notes

None

Example

path = '/path/with/invalid?characters'
valid_path = replace_invalid_path_chr(path, valid_chrs='_')
print(valid_path)

get_valid_file_path -> str

Returns a valid file path by replacing any invalid characters in the given path with a specified valid character and truncating the path to a specified length.

Params

path (str): The path string to be checked for invalid characters.
valid_chrs (str, optional): The valid characters that will replace any invalid characters in the path. Defaults to '_'.
valid_len (int, optional): The maximum length of the valid file path. Defaults to 250.

Returns

str: The valid file path.

Notes

None

Example

path = '/path/with/invalid?characters'
valid_path = get_valid_file_path(path, valid_chrs='_', valid_len=100)
print(valid_path)

opts_file

A function that reads or writes data to a file based on the provided options.

Params

path (str): The path to the file.
mode (str, optional): The mode in which the file should be opened. Defaults to 'r'.
encoding (str, optional): The encoding of the file. Defaults to 'utf-8'.
way (str, optional): The way in which the data should be read or written. Defaults to 'lines'.
data (Any, optional): The data to be written to the file. Only applicable in write mode. Defaults to None.

Returns

list or str or dict or None: The data read from the file, or None if the file was opened in write mode and no data was provided.

Notes

None

Example

path = '/path/to/file.txt'
data = ['line 1', 'line 2', 'line 3']
read_data = opts_file(path, mode='w', data=data)
print(read_data)

read_bits -> bytes

Reads a file in binary mode and returns the content as bytes.

Params

path (str): The path to the file.

Returns

bytes: The content of the file as bytes.

Notes

None

Example

path = '/path/to/file.bin'
content = read_bits(path)
print(content)

read_text -> str or List[str]

Reads a file in text mode and returns the content as a string or a list of lines.

Params

path (str): The path to the file.
decode (str, optional): The encoding of the file. Defaults to 'utf-8'.
way (str, optional): The way in which the data should be read. Defaults to 'lines'.

Returns

str or List[str]: The content of the file as a string or a list of lines.

Notes

None

Example

path = '/path/to/file.txt'
content = read_text(path, decode='utf-8', way='lines')
print(content)

detect_byte_coding(bits:bytes) -> str

Detects the byte coding of a given byte array.

Parameters:
- bits (bytes): The byte array to be analyzed.

Returns:
- str: The detected byte coding of the input sequence.

Example:

detect_byte_coding(b'\xe4\xb8\xad\xe6\x96\x87')

decode_bits_to_str(bits:bytes) -> str

Decodes a bytes object to a string using either GB2312 or utf-8 encoding.

Parameters:
- bits (bytes): The bytes object to decode.

Returns:
- str: The decoded string.

Example:

decode_bits_to_str(b'\xe4\xb8\xad\xe6\x96\x87')

save_json(path:str, obj, encoding:str = 'utf-8', forceUpdate = True) -> None

Saves an object as a JSON file at the specified path.

Parameters:
- path (str): The path where the JSON file will be saved.
- obj: The object to be saved as JSON.
- encoding (str): The encoding of the JSON file. Default is 'utf-8'.
- forceUpdate (bool): Determines whether to overwrite an existing file at the specified path. Default is True.

Returns:
- None

Example:

data = {'name': 'John', 'age': 30}
save_json('data.json', data)

read_json(path:str, encoding:str = 'utf-8', invalidPathReturn = None) -> Union[dict, Any]

Reads a JSON file from the given path and returns the parsed JSON data.

Parameters:
- path (str): The path to the JSON file.
- encoding (str, optional): The encoding of the file. Defaults to 'utf-8'.
- invalidPathReturn (any, optional): The value to return if the path is invalid. Defaults to None.

Returns:
- dict: The parsed JSON data.
- invalidPathReturn (any): The value passed as invalidPathReturn if the path is invalid.

Example:

read_json('data.json')

save_excel(path:str, obj:List[List[str]], columns:List[str], encoding:str = 'utf-8', forceUpdate = True) -> bool

Save a list of lists as an Excel file.

Parameters:
- path (str): The path where the Excel file will be saved.
- obj (List[List[str]]): The list of lists to be saved as an Excel file.
- columns (List[str]): The column names for the Excel file.
- encoding (str, optional): The encoding of the Excel file. Defaults to 'utf-8'.
- forceUpdate (bool, optional): If True, the file will be saved even if it already exists. Defaults to True.

Returns:
- bool: True if the file was successfully saved, False otherwise.

Example:

data = [['Name', 'Age'], ['John', '30'], ['Jane', '25']]
columns = ['Name', 'Age']
save_excel('data.xlsx', data, columns)

read_excel(path:str, sheet_name:str = None, ignore_head:bool = True, ignore_first_col:bool = True, invalid_path_return = None) -> Union[pandas.DataFrame, Any]

Reads an Excel file and returns a pandas DataFrame.

Parameters:
- path (str): The path to the Excel file.
- sheet_name (str, optional): The name of the sheet to read. Defaults to None.
- ignore_head (bool, optional): Whether to ignore the first row (header) of the sheet. Defaults to True.
- ignore_first_col (bool, optional): Whether to ignore the first column of the sheet. Defaults to True.
- invalid_path_return (Any, optional): The value to return if the path is invalid. Defaults to None.

Returns:
- pandas.DataFrame: The DataFrame containing the data from the Excel file.
- invalid_path_return (Any): The value specified if the path is invalid.

Example:

read_excel('data.xlsx')

write_sheets(path:str, sheets:Dict[str, pd.DataFrame]) -> None

Write multiple sheets to an Excel file.

Parameters:
- path (str): The path to the Excel file.
- sheets (Dict[str, pd.DataFrame]): A dictionary mapping sheet names to dataframes.

Returns:
- None

Example:

data1 = pd.DataFrame({'Name': ['John', 'Jane'], 'Age': [30, 25]})
data2 = pd.DataFrame({'City': ['New York', 'Los Angeles'], 'Country': ['USA', 'USA']})
sheets = {'Sheet1': data1, 'Sheet2': data2}
write_sheets('data.xlsx', sheets)

update_excel(path:str, sheets:Dict[str, pd.DataFrame] = None) -> Union[Dict[str, pd.DataFrame], None]

Updates an Excel file with the given path by adding or modifying sheets.

Parameters:
- path (str): The path of the Excel file.
- sheets (Dict[str, pd.DataFrame], optional): A dictionary of sheets to add or modify. The keys are sheet names and the values are pandas DataFrame objects. Defaults to None.

Returns:
- Union[Dict[str, pd.DataFrame], None]: If the Excel file exists and sheets is None, returns a dictionary containing all the sheets in the Excel file. Otherwise, returns None.

Raises:
- None

Example:

data1 = pd.DataFrame({'Name': ['John', 'Jane'], 'Age': [30, 25]})
data2 = pd.DataFrame({'City': ['New York', 'Los Angeles'], 'Country': ['USA', 'USA']})
sheets = {'Sheet1': data1, 'Sheet2': data2}
update_excel('data.xlsx', sheets)

convert_pdf_to_txt(path: str, backend = 'PyPDF2') -> str

Convert a PDF file to a text file.

Parameters:
- path: The path to the PDF file.
- backend: The backend library to use for PDF conversion. Defaults to 'PyPDF2'.

Returns:
- The extracted text from the PDF file as a string.

Raises:
- NotImplementedError: If the specified backend is not supported.

Example:

convert_pdf_to_txt('document.pdf')

is_jsonable -> bool

This function checks if the given data is JSON serializable.

Params

data (any): The data to be checked.

Returns

bool: True if the data is JSON serializable, False otherwise.

Notes

The function checks if the data is of type str, int, float, bool, or None. These types are JSON serializable.
If the data is a mapping (e.g. dict), the function recursively checks if all values in the mapping are JSON serializable.
If the data is a sequence (e.g. list, tuple), the function recursively checks if all items in the sequence are JSON serializable.
If the data is of any other type, it is not JSON serializable.

Example

data1 = "Hello"
print(is_jsonable(data1))  # Output: True

data2 = {"name": "John", "age": 30}
print(is_jsonable(data2))  # Output: True

data3 = [1, 2, 3, {"name": "John"}]
print(is_jsonable(data3))  # Output: True

data4 = {"name": "John", "age": datetime.datetime.now()}
print(is_jsonable(data4))  # Output: False

convert_pdf_to_txt -> str

Convert a PDF file to a text file.

Params

path: The path to the PDF file.
backend: The backend library to use for PDF conversion.
- 'PyPDF2' is the default.
- 'pdfminer'.

Returns

The extracted text from the PDF file as a string.

Raises

NotImplementedError: If the specified backend is not supported.

Example

text = convert_pdf_to_txt('path/to/pdf/file.pdf')
print(text)

text = convert_pdf_to_txt('path/to/pdf/file.pdf', backend='pdfminer')
print(text)