dcs.clean¶

Language: Python

dcs.clean is a module of the dcs package that contains functions for cleaning invalid data in pandas.DataFrame objects as well as performing transformations for machine learning.

dcs.clean.combineColumns(df, columnHeadings, seperator='', newName='merged_column', insertIndex=0)¶

Combines multiple columns into a new column, concatenating each value using a specified separator

Parameters:	df (pandas.DataFrame) – data frame columnHeadings (list<str>) – list of columns to combine seperator (str, optional) – separator character or string newName (str, optional) – name for column containing combined values insertIndex (int, optional) – index to insert new column at
Raises:	`ValueError` – if columnHeadings parameter doesn’t contain at least two columns

dcs.clean.deleteRowsWithNA(df, columnIndex)¶

Drops all rows with missing values in the specified column

The function uses the pandas.DataFrame.dropna() function, before resetting the index of the dataframe with pandas.DataFrame.reset_index()

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column

dcs.clean.discretize(df, columnIndex, cutMode, numberOfBins)¶

Performs in-place discretization on a numeric column

The function has two modes of operation: discretization and quantiling, using the pandas.cut() and pandas.qcut() functions respectively.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column to discretize cutMode (str) – ‘quantiling’ or ‘discretization’ numberOfBins (int) – arg passed directly into pandas.cut() and pandas.qcut() functions

dcs.clean.executeCommand(df, command)¶

Executes a Python statement in a pre-configured environment

Danger

Using this function carries direct risk, as any arbitrary command can be executed

The command parameter can be a string containing multiple lines of Python statements. The command is executed in a pre-configured environment with df holding a reference to the data frame, and multiple modules loaded, including pandas and numpy

Parameters:	df (pandas.DataFrame) – data frame command (str) – string containing a single Python command, or multiple Python commands delimited by newline

dcs.clean.fillByInterpolation(df, columnIndex, method, order)¶

Fills in invalid values in the specified column by performing interpolation, in-place

Warning

The function only works on numeric columns and will raise an exception in any other case.

The function makes use of the pandas.Series.interpolate() method.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of numeric column method (str) – passed directly to `method` kwarg for `pandas.Series.interpolate()`, options include ‘linear’, ‘spline’ and ‘polynomial’ order (int) – passed direclty to `order` kwarg for `pandas.Series.interpolate()`, required for certain methods such as ‘polynomial’

dcs.clean.fillDown(df, columnFrom, columnTo, method)¶

Replaces invalid values in specified columns with the last/next valid value, in-place

Multiple columns can be specified by giving a range of column indices. Therefore the operation can only be performed on a series of adjacent columns. The function makes use of the pandas.Series.fillna() method.

Parameters:	df (pandas.DataFrame) – data frame columnFrom (int) – starting index for range of columns columnTo (int) – ending index for range of columns (inclusive) method (str) – ‘bfill’ for backwards fill (next valid value) and ‘pad’ for forward fill (last valid value)

dcs.clean.fillWithAverage(df, columnIndex, metric)¶

Fills in invalid values in the specified column with an average metric, in-place

Average metrics that can be used to fill with are: mean, median and mode.

Warning

Using mean or median metric on a non numeric column will raise an exception.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column metric (str) – average metric to use, options are: ‘mean’, ‘median’ and ‘mode’
Returns:	True on success, False on failure
Return type:	bool

dcs.clean.fillWithCustomValue(df, columnIndex, newValue)¶

Fills in all invalid values in the specified column with a custom specified value, in-place

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column newValue – value to fill with

dcs.clean.findReplace(df, columnIndex, toReplace, replaceWith, matchRegex)¶

Finds all values matching the given patterns in the specified column and replaces them with a value

The function supports searching for multiple patterns, and uses the pandas.Series.replace() method Patterns can be strings which will be matched as a whole, or regular expressions (if matchRegex boolean flag is set to True).

Standard Pythonic regex subsitutions are also possible.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column toReplace (list<str>) – list of search strings or regular expressions replaceWith (list<str>) – list of replacement strings or regular expressions matchRegex (bool) – must be set to True if supplying list of regular expressions

dcs.clean.generateDummies(df, columnIndex, inplace)¶

Generates dummies/indicator variable columns from a specified column (containing categorical data)

The function uses the pandas.get_dummies() function.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column inplace (bool) – removes original column if `True`

dcs.clean.insertDuplicateColumn(df, columnIndex)¶

Duplicates a column, inserting the new column to the right of the original column

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column to duplicate

dcs.clean.normalize(df, columnIndex, rangeFrom=0, rangeTo=1)¶

Performs normalization on a numeric column, in-place

Uniformally scales the values in a numeric data set to fit in the specified range

Warning

Calling the function on a non numeric column will raise an exception.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column rangeFrom (int/float, optional) – range start rangeTo (int/float, optional) – range end

dcs.clean.splitColumn(df, columnIndex, delimiter, regex=False)¶

Splits a string column according to a specified delimiter or regular expression.

The split values are put in new columns inserted to the right of the original column

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column to split delimiter (str) – delimiting character, string or regular expression for splitting each row regex (bool, optional) – must be set to `True` if delimiter is a regular expression

dcs.clean.standardize(df, columnIndex)¶

Performs standardization on a numeric column, in-place

Uniformally scales the values in a numeric data set so that the mean is 0 and standard deviation is 1.

Warning

Calling the function on a non numeric column will raise an exception.

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – index of column

dcs.clean¶

Willow

Versions

Navigation