mbapy.stats.df
This module provides utility functions for working with pandas DataFrames.
Functions
get_value(df: pd.DataFrame, column: str, mask: np.array) -> list
Get the values of a specific column in a DataFrame based on a boolean mask.
Params
- df (pd.DataFrame): The input DataFrame.
- column (str): The name of the column.
- mask (np.array): The boolean mask to filter the DataFrame.
Returns
- list: The values of the specified column that satisfy the mask.
Example
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
mask = np.array([True, False, True, False, True])
get_value(df, 'A', mask) # Output: [1, 3, 5]
pro_bar_data(factors: List[str], tags: List[str], df: pd.DataFrame, **kwargs) -> pd.DataFrame
Calculate the mean, standard error, and count for each combination of factors in a DataFrame.
Params
- factors (List[str]): The names of the columns representing the factors.
- tags (List[str]): The names of the columns to calculate the statistics for.
- df (pd.DataFrame): The input DataFrame.
- kwargs (optional): Additional keyword arguments.
- min_sample_N (int): The minimum number of samples required for a combination to be included in the output. Defaults to 1.
Returns
- pd.DataFrame: A DataFrame containing the calculated statistics for each combination of factors.
Notes:
- The output DataFrame will have the same columns as the input DataFrame, with the addition of columns for the mean, standard error, and count of each tag.
Example
df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
pro_bar_data(['factor1', 'factor2'], ['y1', 'y2'], df)
pro_bar_data_R(factors: List[str], tags: List[str], df: pd.DataFrame, suffixs: List[str], **kwargs) -> Callable
A decorator that wraps a function to be applied to each combination of factors in a DataFrame.
Params
- factors (List[str]): The names of the columns representing the factors.
- tags (List[str]): The names of the columns to apply the function to.
- df (pd.DataFrame): The input DataFrame.
- suffixs (List[str]): The suffixes to append to the tags in the output DataFrame.
- kwargs (optional): Additional keyword arguments.
Returns
- Callable: The wrapped function.
Notes:
- The wrapped function should take a single argument, which is a numpy array of values for a specific combination of factors.
- The wrapped function should return a list of values, with the length equal to the number of suffixes.
Example
@pro_bar_data_R(['factor1', 'factor2'], ['y1', 'y2'], df, ['_mean', '_SE'])
def calc_stats(values):
return [np.mean(values), np.std(values, ddof=1)/np.sqrt(len(values))]
calc_stats(df.loc[(df['factor1'] == 'A') & (df['factor2'] == 'X'), ['y1', 'y2']].values)
get_df_data(factors: Dict[str, List[str]], tags: List[str], df: pd.DataFrame, include_factors: bool = True) -> pd.DataFrame
Return a subset of the input DataFrame, filtered by the given factors and tags.
Params
- factors (Dict[str, List[str]]): A dictionary containing the factors to filter by. The keys are column names in the DataFrame and the values are lists of values to filter by in that column.
- tags (List[str]): A list of column names to include in the output DataFrame.
- df (pd.DataFrame): The input DataFrame to filter.
- include_factors (bool, optional): Whether to include the factors in the output DataFrame. Defaults to True.
Returns
- pd.DataFrame: A subset of the input DataFrame, filtered by the given factors and tags.
Example
df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
get_df_data({'factor1': ['A'], 'factor2': ['X']}, ['y1', 'y2'], df)
sort_df_factors(factors: List[str], tags: List[str], df: pd.DataFrame) -> pd.DataFrame
Sort each combination of factors in a DataFrame.
Params
- factors (List[str]): The names of the columns representing the factors.
- tags (List[str]): The names of the columns to include in the output DataFrame.
- df (pd.DataFrame): The input DataFrame.
Returns
- pd.DataFrame: The sorted DataFrame.
Example
df = pd.DataFrame({'factor1': ['A', 'A', 'B', 'B'], 'factor2': ['X', 'Y', 'X', 'Y'], 'y1': [1, 2, 3, 4], 'y2': [5, 6, 7, 8]})
sort_df_factors(['factor2', 'factor1'], ['y1', 'y2'], df)
remove_simi(tag: str, df: pd.DataFrame, sh: float = 1., backend: str = 'numpy-array', tensor = None, device = 'cuda') -> Tuple[pd.DataFrame, List[int]]
Remove similar values from a column in a DataFrame.
Params
- tag (str): The name of the column to remove similar values from.
- df (pd.DataFrame): The input DataFrame.
- sh (float, optional): The threshold for similarity. Values with a difference less than or equal to this threshold will be considered similar. Defaults to 1.
- backend (str, optional): The backend to use for the computation. Supported backends are 'numpy-mat', 'numpy-array', 'torch-array', and 'ba-cpp'. Defaults to 'numpy-array'.
- tensor (optional): The tensor to use for the computation if the backend is 'torch-array'. Defaults to None.
- device (str, optional): The device to use for the computation if the backend is 'torch-array'. Defaults to 'cuda'.
Returns
- Tuple[pd.DataFrame, List[int]]: A tuple containing the modified DataFrame and a list of indices of the removed values.
Example
df = pd.DataFrame({'d': [1, 2, 3, 3, 5, 6, 8, 13]})
remove_simi('d', df, 2.1, 'numpy-array')
interp(long_one: pd.Series, short_one: pd.Series) -> np.ndarray
Interpolate a short pandas Series to have the same length as a long pandas Series.
Params
- long_one (pd.Series): The long pandas Series.
- short_one (pd.Series): The short pandas Series.
Returns
- np.ndarray: The interpolated short pandas Series.
Example
long_one = pd.Series([1, 2, 3, 4, 5])
short_one = pd.Series([1, 3, 5])
interp(long_one, short_one) # Output: array([1., 2., 3., 4., 5.])
merge_col2row(df: pd.DataFrame, cols: List[str], new_cols_name: str, value_name: str) -> pd.DataFrame
Merge columns in a DataFrame to rows.
Params
- df (pd.DataFrame): The input DataFrame.
- cols (List[str]): The names of the columns to merge.
- new_cols_name (str): The name of the new column that will contain the column names.
- value_name (str): The name of the new column that will contain the values.
Returns
- pd.DataFrame: The modified DataFrame.
Example
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
merge_col2row(df, ['A', 'B'], 'new_col', 'value_col')
make_three_line_table -> pd.DataFrame
This function creates a three-line table from the input data frame, with specified factors and tags.
Params
- factors: List of strings representing the factors to be included in the table.
- tags: List of strings representing the tags to be included in the table.
- df: Input pandas DataFrame containing the data.
- float_fmt: String representing the format for floating point numbers (default is '.3f').
- t_samples: Integer representing the threshold for the number of samples (default is 30).
Returns
- ndf: Pandas DataFrame containing the three-line table.
Notes
- The function calculates the three-line table using the input factors and tags, and the provided data frame.
- It applies formatting to the floating point numbers based on the specified float format.
- It uses a threshold for the number of samples to determine the confidence interval.
Example
import pandas as pd
from typing import List
# Create sample data
data = {
'factor1': [1, 2, 3, 4],
'factor2': [5, 6, 7, 8],
'tag1': [0.1, 0.2, 0.3, 0.4],
'tag2': [0.5, 0.6, 0.7, 0.8],
'tag1_SE': [0.01, 0.02, 0.03, 0.04],
'tag2_SE': [0.05, 0.06, 0.07, 0.08],
'tag1_N': [20, 25, 30, 35],
'tag2_N': [40, 45, 50, 55]
}
df = pd.DataFrame(data)
factors = ['factor1', 'factor2']
tags = ['tag1', 'tag2']
# Create three-line table
result = make_three_line_table(factors, tags, df)
print(result)
Notes
- The functions in this module are designed to work with pandas DataFrames.
- Some functions have optional parameters that allow for customization of the behavior.
- The examples provided demonstrate the usage of each function.