dcs.analyze

Language: Python

dcs.analyze is a module of the dcs package that contains functions for performing text and numerical analysis on individual columns of pandas.DataFrame objects. Most functions return Python dictionaries containing the calculated statistical metrics and values.

dcs.analyze.analysisForColumn(df, column)

Computes statistics on a pandas.DataFrame column, returning a dictionary containing computed metrics

The function detects the data type of the pandas.Series object, and delegates the actual analysis to the appropriate analysis function:

Parameters:
Returns:

dictionary containing statistical metric–value pairs

Return type:

dict

dcs.analyze.dateAnalysis(series)

Analyzes a pandas.Series of type datetime, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

  • max
  • min
  • median
  • invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:series (pandas.Series) – series to analyze
Returns:dictionary containing statistical metric–value pairs
Return type:dict
Raises:ValueError – if provided pandas.Series not of datetime data type
dcs.analyze.genericAnalysis(series)

Computes various general statistics on a pandas.Series object, returning a dictionary containing computed metrics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

  • unique_count: total number of unique values
  • frequencies: a list<tuple<str, int>> object containing top 50 most commonly occurring values and their frequencies
  • mode: a list of the most frequently occurring value
  • mode_count: frequency of mode(s)
Parameters:series (pandas.Series) – series to analyze
Returns:dictionary containing statistical metric–value pairs
Return type:dict
dcs.analyze.numericalAnalysis(series)

Analyzes a pandas.Series of numerical type, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. On top of the metrics calculated by pandas.Series.describe() which include quartiles and various averages, the calculated metrics are:

  • range: difference between maximum and minium
  • invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:series (pandas.Series) – series to analyze
Returns:dictionary containing statistical metric–value pairs
Return type:dict
Raises:ValueError – if provided pandas.Series not of numerical data type
dcs.analyze.textAnalysis(series)

Analyzes a pandas.Series of type str, returning a dictionary containing computed statistics

The returned dictionary has the following structure: {metric: value}. The calculated metrics are:

  • word_count_min: minimum number of words in each row
  • word_count_max: maximum number of words in each row
  • word_count_average: average number of words in each row
  • word_length_min: length of shortest word
  • word_length_max: length of longets word
  • word_total: total number of words
  • word_mode: most frequently occurring word
  • word_mode_frequency: frequency of word_mode
  • word_frequencies: a list<tuple<str, int>> object containing top 50 words (by frequency) and their counts
  • invalid: number of invalid values

The returned dictionary will also contain the general statistical metrics returned by dcs.analyze.genericAnalysis()

Parameters:series (pandas.Series) – series to analyze
Returns:dictionary containing statistical metric–value pairs
Return type:dict