dcs.clean

Language: Python

dcs.clean is a module of the dcs package that contains functions for cleaning invalid data in pandas.DataFrame objects as well as performing transformations for machine learning.

dcs.clean.combineColumns(df, columnHeadings, seperator='', newName='merged_column', insertIndex=0)

Combines multiple columns into a new column, concatenating each value using a specified separator

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnHeadings (list<str>) – list of columns to combine
  • seperator (str, optional) – separator character or string
  • newName (str, optional) – name for column containing combined values
  • insertIndex (int, optional) – index to insert new column at
Raises:

ValueError – if columnHeadings parameter doesn’t contain at least two columns

dcs.clean.deleteRowsWithNA(df, columnIndex)

Drops all rows with missing values in the specified column

The function uses the pandas.DataFrame.dropna() function, before resetting the index of the dataframe with pandas.DataFrame.reset_index()

Parameters:
dcs.clean.discretize(df, columnIndex, cutMode, numberOfBins)

Performs in-place discretization on a numeric column

The function has two modes of operation: discretization and quantiling, using the pandas.cut() and pandas.qcut() functions respectively.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column to discretize
  • cutMode (str) – ‘quantiling’ or ‘discretization’
  • numberOfBins (int) – arg passed directly into pandas.cut() and pandas.qcut() functions
dcs.clean.executeCommand(df, command)

Executes a Python statement in a pre-configured environment

Danger

Using this function carries direct risk, as any arbitrary command can be executed

The command parameter can be a string containing multiple lines of Python statements. The command is executed in a pre-configured environment with df holding a reference to the data frame, and multiple modules loaded, including pandas and numpy

Parameters:
  • df (pandas.DataFrame) – data frame
  • command (str) – string containing a single Python command, or multiple Python commands delimited by newline
dcs.clean.fillByInterpolation(df, columnIndex, method, order)

Fills in invalid values in the specified column by performing interpolation, in-place

Warning

The function only works on numeric columns and will raise an exception in any other case.

The function makes use of the pandas.Series.interpolate() method.

Parameters:
dcs.clean.fillDown(df, columnFrom, columnTo, method)

Replaces invalid values in specified columns with the last/next valid value, in-place

Multiple columns can be specified by giving a range of column indices. Therefore the operation can only be performed on a series of adjacent columns. The function makes use of the pandas.Series.fillna() method.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnFrom (int) – starting index for range of columns
  • columnTo (int) – ending index for range of columns (inclusive)
  • method (str) – ‘bfill’ for backwards fill (next valid value) and ‘pad’ for forward fill (last valid value)
dcs.clean.fillWithAverage(df, columnIndex, metric)

Fills in invalid values in the specified column with an average metric, in-place

Average metrics that can be used to fill with are: mean, median and mode.

Warning

Using mean or median metric on a non numeric column will raise an exception.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column
  • metric (str) – average metric to use, options are: ‘mean’, ‘median’ and ‘mode’
Returns:

True on success, False on failure

Return type:

bool

dcs.clean.fillWithCustomValue(df, columnIndex, newValue)

Fills in all invalid values in the specified column with a custom specified value, in-place

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column
  • newValue – value to fill with
dcs.clean.findReplace(df, columnIndex, toReplace, replaceWith, matchRegex)

Finds all values matching the given patterns in the specified column and replaces them with a value

The function supports searching for multiple patterns, and uses the pandas.Series.replace() method Patterns can be strings which will be matched as a whole, or regular expressions (if matchRegex boolean flag is set to True).

Standard Pythonic regex subsitutions are also possible.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column
  • toReplace (list<str>) – list of search strings or regular expressions
  • replaceWith (list<str>) – list of replacement strings or regular expressions
  • matchRegex (bool) – must be set to True if supplying list of regular expressions
dcs.clean.generateDummies(df, columnIndex, inplace)

Generates dummies/indicator variable columns from a specified column (containing categorical data)

The function uses the pandas.get_dummies() function.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column
  • inplace (bool) – removes original column if True
dcs.clean.insertDuplicateColumn(df, columnIndex)

Duplicates a column, inserting the new column to the right of the original column

Parameters:
dcs.clean.normalize(df, columnIndex, rangeFrom=0, rangeTo=1)

Performs normalization on a numeric column, in-place

Uniformally scales the values in a numeric data set to fit in the specified range

Warning

Calling the function on a non numeric column will raise an exception.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column
  • rangeFrom (int/float, optional) – range start
  • rangeTo (int/float, optional) – range end
dcs.clean.splitColumn(df, columnIndex, delimiter, regex=False)

Splits a string column according to a specified delimiter or regular expression.

The split values are put in new columns inserted to the right of the original column

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – index of column to split
  • delimiter (str) – delimiting character, string or regular expression for splitting each row
  • regex (bool, optional) – must be set to True if delimiter is a regular expression
dcs.clean.standardize(df, columnIndex)

Performs standardization on a numeric column, in-place

Uniformally scales the values in a numeric data set so that the mean is 0 and standard deviation is 1.

Warning

Calling the function on a non numeric column will raise an exception.

Parameters: