dcs.analyze¶
Language: Python
dcs.analyze is a module of the dcs package that contains functions for performing text and numerical analysis on individual columns of pandas.DataFrame objects. Most functions return Python dictionaries containing the calculated statistical metrics and values.
-
dcs.analyze.analysisForColumn(df, column)¶ Computes statistics on a
pandas.DataFramecolumn, returning a dictionary containing computed metricsThe function detects the data type of the
pandas.Seriesobject, and delegates the actual analysis to the appropriate analysis function:dcs.analyze.numericalAnalysis()for numericaldcs.analyze.textAnalysis()for stringdcs.analyze.dateAnalysis()for datetime
Parameters: - df (pandas.DataFrame) – data frame
- column (str) – name of column to analyze
Returns: dictionary containing statistical metric–value pairs
Return type:
-
dcs.analyze.dateAnalysis(series)¶ Analyzes a
pandas.Seriesof typedatetime, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- max
- min
- median
- invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict Raises: ValueError– if providedpandas.Seriesnot of datetime data type
-
dcs.analyze.genericAnalysis(series)¶ Computes various general statistics on a
pandas.Seriesobject, returning a dictionary containing computed metricsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- unique_count: total number of unique values
- frequencies: a
list<tuple<str, int>>object containing top 50 most commonly occurring values and their frequencies - mode: a list of the most frequently occurring value
- mode_count: frequency of mode(s)
Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict
-
dcs.analyze.numericalAnalysis(series)¶ Analyzes a
pandas.Seriesof numerical type, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. On top of the metrics calculated by
pandas.Series.describe()which include quartiles and various averages, the calculated metrics are:- range: difference between maximum and minium
- invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict Raises: ValueError– if providedpandas.Seriesnot of numerical data type
-
dcs.analyze.textAnalysis(series)¶ Analyzes a
pandas.Seriesof typestr, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- word_count_min: minimum number of words in each row
- word_count_max: maximum number of words in each row
- word_count_average: average number of words in each row
- word_length_min: length of shortest word
- word_length_max: length of longets word
- word_total: total number of words
- word_mode: most frequently occurring word
- word_mode_frequency: frequency of word_mode
- word_frequencies: a
list<tuple<str, int>>object containing top 50 words (by frequency) and their counts - invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict