dcs.analyze¶
Language: Python
dcs.analyze
is a module of the dcs
package that contains functions for performing text and numerical analysis on individual columns of pandas.DataFrame
objects. Most functions return Python dictionaries containing the calculated statistical metrics and values.
-
dcs.analyze.
analysisForColumn
(df, column)¶ Computes statistics on a
pandas.DataFrame
column, returning a dictionary containing computed metricsThe function detects the data type of the
pandas.Series
object, and delegates the actual analysis to the appropriate analysis function:dcs.analyze.numericalAnalysis()
for numericaldcs.analyze.textAnalysis()
for stringdcs.analyze.dateAnalysis()
for datetime
Parameters: - df (pandas.DataFrame) – data frame
- column (str) – name of column to analyze
Returns: dictionary containing statistical metric–value pairs
Return type:
-
dcs.analyze.
dateAnalysis
(series)¶ Analyzes a
pandas.Series
of typedatetime
, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- max
- min
- median
- invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()
Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict Raises: ValueError
– if providedpandas.Series
not of datetime data type
-
dcs.analyze.
genericAnalysis
(series)¶ Computes various general statistics on a
pandas.Series
object, returning a dictionary containing computed metricsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- unique_count: total number of unique values
- frequencies: a
list<tuple<str, int>>
object containing top 50 most commonly occurring values and their frequencies - mode: a list of the most frequently occurring value
- mode_count: frequency of mode(s)
Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict
-
dcs.analyze.
numericalAnalysis
(series)¶ Analyzes a
pandas.Series
of numerical type, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. On top of the metrics calculated by
pandas.Series.describe()
which include quartiles and various averages, the calculated metrics are:- range: difference between maximum and minium
- invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()
Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict Raises: ValueError
– if providedpandas.Series
not of numerical data type
-
dcs.analyze.
textAnalysis
(series)¶ Analyzes a
pandas.Series
of typestr
, returning a dictionary containing computed statisticsThe returned dictionary has the following structure: {metric: value}. The calculated metrics are:
- word_count_min: minimum number of words in each row
- word_count_max: maximum number of words in each row
- word_count_average: average number of words in each row
- word_length_min: length of shortest word
- word_length_max: length of longets word
- word_total: total number of words
- word_mode: most frequently occurring word
- word_mode_frequency: frequency of word_mode
- word_frequencies: a
list<tuple<str, int>>
object containing top 50 words (by frequency) and their counts - invalid: number of invalid values
The returned dictionary will also contain the general statistical metrics returned by
dcs.analyze.genericAnalysis()
Parameters: series (pandas.Series) – series to analyze Returns: dictionary containing statistical metric–value pairs Return type: dict