dcs.analyze¶

Language: Python

dcs.analyze is a module of the dcs package that contains functions for performing text and numerical analysis on individual columns of pandas.DataFrame objects. Most functions return Python dictionaries containing the calculated statistical metrics and values.

dcs.analyze.analysisForColumn(df, column)¶

Computes statistics on a pandas.DataFrame column, returning a dictionary containing computed metrics

The function detects the data type of the pandas.Series object, and delegates the actual analysis to the appropriate analysis function:

dcs.analyze.numericalAnalysis() for numerical
dcs.analyze.textAnalysis() for string
dcs.analyze.dateAnalysis() for datetime

Parameters:	df (pandas.DataFrame) – data frame column (str) – name of column to analyze
Returns:	dictionary containing statistical metric–value pairs
Return type:	dict

dcs.analyze.dateAnalysis(series)¶

Analyzes a pandas.Series of type datetime, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

max
min
median
invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:	series (pandas.Series) – series to analyze
Returns:	dictionary containing statistical metric–value pairs
Return type:	dict
Raises:	`ValueError` – if provided `pandas.Series` not of datetime data type

dcs.analyze.genericAnalysis(series)¶

Computes various general statistics on a pandas.Series object, returning a dictionary containing computed metrics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

unique_count: total number of unique values
frequencies: a list<tuple<str, int>> object containing top 50 most commonly occurring values and their frequencies
mode: a list of the most frequently occurring value
mode_count: frequency of mode(s)

Parameters:	series (pandas.Series) – series to analyze
Returns:	dictionary containing statistical metric–value pairs
Return type:	dict

dcs.analyze.numericalAnalysis(series)¶

Analyzes a pandas.Series of numerical type, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. On top of the metrics calculated by pandas.Series.describe() which include quartiles and various averages, the calculated metrics are:

range: difference between maximum and minium
invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:	series (pandas.Series) – series to analyze
Returns:	dictionary containing statistical metric–value pairs
Return type:	dict
Raises:	`ValueError` – if provided `pandas.Series` not of numerical data type

dcs.analyze.textAnalysis(series)¶

Analyzes a pandas.Series of type str, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

word_count_min: minimum number of words in each row
word_count_max: maximum number of words in each row
word_count_average: average number of words in each row
word_length_min: length of shortest word
word_length_max: length of longets word
word_total: total number of words
word_mode: most frequently occurring word
word_mode_frequency: frequency of word_mode
word_frequencies: a list<tuple<str, int>> object containing top 50 words (by frequency) and their counts
invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:	series (pandas.Series) – series to analyze
Returns:	dictionary containing statistical metric–value pairs
Return type:	dict

dcs.analyze¶

Willow

Versions

Navigation