mbapy.stats.df

This module provides utility functions for working with pandas DataFrames.

Functions

get_value(df: pd.DataFrame, column: str, mask: np.array) -> list

Get the values of a specific column in a DataFrame based on a boolean mask.

Params

df (pd.DataFrame): The input DataFrame.
column (str): The name of the column.
mask (np.array): The boolean mask to filter the DataFrame.

Returns

list: The values of the specified column that satisfy the mask.

Example

df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
mask = np.array([True, False, True, False, True])
get_value(df, 'A', mask)  # Output: [1, 3, 5]

pro_bar_data(factors: List[str], tags: List[str], df: pd.DataFrame, **kwargs) -> pd.DataFrame

Calculate the mean, standard error, and count for each combination of factors in a DataFrame.

Params

factors (List[str]): The names of the columns representing the factors.
tags (List[str]): The names of the columns to calculate the statistics for.
df (pd.DataFrame): The input DataFrame.
kwargs (optional): Additional keyword arguments.
- min_sample_N (int): The minimum number of samples required for a combination to be included in the output. Defaults to 1.

Returns

pd.DataFrame: A DataFrame containing the calculated statistics for each combination of factors.

Notes:
- The output DataFrame will have the same columns as the input DataFrame, with the addition of columns for the mean, standard error, and count of each tag.

Example

df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
pro_bar_data(['factor1', 'factor2'], ['y1', 'y2'], df)

pro_bar_data_R(factors: List[str], tags: List[str], df: pd.DataFrame, suffixs: List[str], **kwargs) -> Callable

A decorator that wraps a function to be applied to each combination of factors in a DataFrame.

Params

factors (List[str]): The names of the columns representing the factors.
tags (List[str]): The names of the columns to apply the function to.
df (pd.DataFrame): The input DataFrame.
suffixs (List[str]): The suffixes to append to the tags in the output DataFrame.
kwargs (optional): Additional keyword arguments.

Returns

Callable: The wrapped function.

Notes:
- The wrapped function should take a single argument, which is a numpy array of values for a specific combination of factors.
- The wrapped function should return a list of values, with the length equal to the number of suffixes.

Example

@pro_bar_data_R(['factor1', 'factor2'], ['y1', 'y2'], df, ['_mean', '_SE'])
def calc_stats(values):  
    return [np.mean(values), np.std(values, ddof=1)/np.sqrt(len(values))]

calc_stats(df.loc[(df['factor1'] == 'A') & (df['factor2'] == 'X'), ['y1', 'y2']].values)

get_df_data(factors: Dict[str, List[str]], tags: List[str], df: pd.DataFrame, include_factors: bool = True) -> pd.DataFrame

Return a subset of the input DataFrame, filtered by the given factors and tags.

Params

factors (Dict[str, List[str]]): A dictionary containing the factors to filter by. The keys are column names in the DataFrame and the values are lists of values to filter by in that column.
tags (List[str]): A list of column names to include in the output DataFrame.
df (pd.DataFrame): The input DataFrame to filter.
include_factors (bool, optional): Whether to include the factors in the output DataFrame. Defaults to True.

Returns

pd.DataFrame: A subset of the input DataFrame, filtered by the given factors and tags.

Example

df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
get_df_data({'factor1': ['A'], 'factor2': ['X']}, ['y1', 'y2'], df)

sort_df_factors(factors: List[str], tags: List[str], df: pd.DataFrame) -> pd.DataFrame

Sort each combination of factors in a DataFrame.

Params

factors (List[str]): The names of the columns representing the factors.
tags (List[str]): The names of the columns to include in the output DataFrame.
df (pd.DataFrame): The input DataFrame.

Returns

pd.DataFrame: The sorted DataFrame.

Example

df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
sort_df_factors(['factor2', 'factor1'], ['y1', 'y2'], df)

remove_simi(tag: str, df: pd.DataFrame, sh: float = 1., backend: str = 'numpy-array', tensor = None, device = 'cuda') -> Tuple[pd.DataFrame, List[int]]

Remove similar values from a column in a DataFrame.

Params

tag (str): The name of the column to remove similar values from.
df (pd.DataFrame): The input DataFrame.
sh (float, optional): The threshold for similarity. Values with a difference less than or equal to this threshold will be considered similar. Defaults to 1.
backend (str, optional): The backend to use for the computation. Supported backends are 'numpy-mat', 'numpy-array', 'torch-array', and 'ba-cpp'. Defaults to 'numpy-array'.
tensor (optional): The tensor to use for the computation if the backend is 'torch-array'. Defaults to None.
device (str, optional): The device to use for the computation if the backend is 'torch-array'. Defaults to 'cuda'.

Returns

Tuple[pd.DataFrame, List[int]]: A tuple containing the modified DataFrame and a list of indices of the removed values.

Example

df = pd.DataFrame({'d': [1, 2, 3, 3, 5, 6, 8, 13]})
remove_simi('d', df, 2.1, 'numpy-array')

interp(long_one: pd.Series, short_one: pd.Series) -> np.ndarray

Interpolate a short pandas Series to have the same length as a long pandas Series.

Params

long_one (pd.Series): The long pandas Series.
short_one (pd.Series): The short pandas Series.

Returns

np.ndarray: The interpolated short pandas Series.

Example

long_one = pd.Series([1, 2, 3, 4, 5])
short_one = pd.Series([1, 3, 5])
interp(long_one, short_one)  # Output: array([1., 2., 3., 4., 5.])

merge_col2row(df: pd.DataFrame, cols: List[str], new_cols_name: str, value_name: str) -> pd.DataFrame

Merge columns in a DataFrame to rows.

Params

df (pd.DataFrame): The input DataFrame.
cols (List[str]): The names of the columns to merge.
new_cols_name (str): The name of the new column that will contain the column names.
value_name (str): The name of the new column that will contain the values.

Returns

pd.DataFrame: The modified DataFrame.

Example

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
merge_col2row(df, ['A', 'B'], 'new_col', 'value_col')

make_three_line_table -> pd.DataFrame

This function creates a three-line table from the input data frame, with specified factors and tags.

Params

factors: List of strings representing the factors to be included in the table.
tags: List of strings representing the tags to be included in the table.
df: Input pandas DataFrame containing the data.
float_fmt: String representing the format for floating point numbers (default is '.3f').
t_samples: Integer representing the threshold for the number of samples (default is 30).

Returns

ndf: Pandas DataFrame containing the three-line table.

Notes

The function calculates the three-line table using the input factors and tags, and the provided data frame.
It applies formatting to the floating point numbers based on the specified float format.
It uses a threshold for the number of samples to determine the confidence interval.

Example

import pandas as pd
from typing import List

# Create sample data
data = {
    'factor1': [1, 2, 3, 4],
    'factor2': [5, 6, 7, 8],
    'tag1': [0.1, 0.2, 0.3, 0.4],
    'tag2': [0.5, 0.6, 0.7, 0.8],
    'tag1_SE': [0.01, 0.02, 0.03, 0.04],
    'tag2_SE': [0.05, 0.06, 0.07, 0.08],
    'tag1_N': [20, 25, 30, 35],
    'tag2_N': [40, 45, 50, 55]
}
df = pd.DataFrame(data)

factors = ['factor1', 'factor2']
tags = ['tag1', 'tag2']

# Create three-line table
result = make_three_line_table(factors, tags, df)
print(result)

Notes

The functions in this module are designed to work with pandas DataFrames.
Some functions have optional parameters that allow for customization of the behavior.
The examples provided demonstrate the usage of each function.