dcs.clean¶
Language: Python
dcs.clean
is a module of the dcs
package that contains functions for cleaning invalid data in pandas.DataFrame
objects as well as performing transformations for machine learning.
-
dcs.clean.
combineColumns
(df, columnHeadings, seperator='', newName='merged_column', insertIndex=0)¶ Combines multiple columns into a new column, concatenating each value using a specified separator
Parameters: - df (pandas.DataFrame) – data frame
- columnHeadings (list<str>) – list of columns to combine
- seperator (str, optional) – separator character or string
- newName (str, optional) – name for column containing combined values
- insertIndex (int, optional) – index to insert new column at
Raises: ValueError
– if columnHeadings parameter doesn’t contain at least two columns
-
dcs.clean.
deleteRowsWithNA
(df, columnIndex)¶ Drops all rows with missing values in the specified column
The function uses the
pandas.DataFrame.dropna()
function, before resetting the index of the dataframe withpandas.DataFrame.reset_index()
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
-
dcs.clean.
discretize
(df, columnIndex, cutMode, numberOfBins)¶ Performs in-place discretization on a numeric column
The function has two modes of operation: discretization and quantiling, using the
pandas.cut()
andpandas.qcut()
functions respectively.Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column to discretize
- cutMode (str) – ‘quantiling’ or ‘discretization’
- numberOfBins (int) – arg passed directly into pandas.cut() and pandas.qcut() functions
-
dcs.clean.
executeCommand
(df, command)¶ Executes a Python statement in a pre-configured environment
Danger
Using this function carries direct risk, as any arbitrary command can be executed
The command parameter can be a string containing multiple lines of Python statements. The command is executed in a pre-configured environment with
df
holding a reference to the data frame, and multiple modules loaded, includingpandas
andnumpy
Parameters: - df (pandas.DataFrame) – data frame
- command (str) – string containing a single Python command, or multiple Python commands delimited by newline
-
dcs.clean.
fillByInterpolation
(df, columnIndex, method, order)¶ Fills in invalid values in the specified column by performing interpolation, in-place
Warning
The function only works on numeric columns and will raise an exception in any other case.
The function makes use of the
pandas.Series.interpolate()
method.Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of numeric column
- method (str) – passed directly to
method
kwarg forpandas.Series.interpolate()
, options include ‘linear’, ‘spline’ and ‘polynomial’ - order (int) – passed direclty to
order
kwarg forpandas.Series.interpolate()
, required for certain methods such as ‘polynomial’
-
dcs.clean.
fillDown
(df, columnFrom, columnTo, method)¶ Replaces invalid values in specified columns with the last/next valid value, in-place
Multiple columns can be specified by giving a range of column indices. Therefore the operation can only be performed on a series of adjacent columns. The function makes use of the
pandas.Series.fillna()
method.Parameters: - df (pandas.DataFrame) – data frame
- columnFrom (int) – starting index for range of columns
- columnTo (int) – ending index for range of columns (inclusive)
- method (str) – ‘bfill’ for backwards fill (next valid value) and ‘pad’ for forward fill (last valid value)
-
dcs.clean.
fillWithAverage
(df, columnIndex, metric)¶ Fills in invalid values in the specified column with an average metric, in-place
Average metrics that can be used to fill with are: mean, median and mode.
Warning
Using mean or median metric on a non numeric column will raise an exception.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
- metric (str) – average metric to use, options are: ‘mean’, ‘median’ and ‘mode’
Returns: True on success, False on failure
Return type:
-
dcs.clean.
fillWithCustomValue
(df, columnIndex, newValue)¶ Fills in all invalid values in the specified column with a custom specified value, in-place
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
- newValue – value to fill with
-
dcs.clean.
findReplace
(df, columnIndex, toReplace, replaceWith, matchRegex)¶ Finds all values matching the given patterns in the specified column and replaces them with a value
The function supports searching for multiple patterns, and uses the
pandas.Series.replace()
method Patterns can be strings which will be matched as a whole, or regular expressions (if matchRegex boolean flag is set toTrue
).Standard Pythonic
regex subsitutions
are also possible.Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
- toReplace (list<str>) – list of search strings or regular expressions
- replaceWith (list<str>) – list of replacement strings or regular expressions
- matchRegex (bool) – must be set to True if supplying list of regular expressions
-
dcs.clean.
generateDummies
(df, columnIndex, inplace)¶ Generates dummies/indicator variable columns from a specified column (containing categorical data)
The function uses the
pandas.get_dummies()
function.Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
- inplace (bool) – removes original column if
True
-
dcs.clean.
insertDuplicateColumn
(df, columnIndex)¶ Duplicates a column, inserting the new column to the right of the original column
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column to duplicate
-
dcs.clean.
normalize
(df, columnIndex, rangeFrom=0, rangeTo=1)¶ Performs normalization on a numeric column, in-place
Uniformally scales the values in a numeric data set to fit in the specified range
Warning
Calling the function on a non numeric column will raise an exception.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column
- rangeFrom (int/float, optional) – range start
- rangeTo (int/float, optional) – range end
-
dcs.clean.
splitColumn
(df, columnIndex, delimiter, regex=False)¶ Splits a string column according to a specified delimiter or regular expression.
The split values are put in new columns inserted to the right of the original column
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column to split
- delimiter (str) – delimiting character, string or regular expression for splitting each row
- regex (bool, optional) – must be set to
True
if delimiter is a regular expression
-
dcs.clean.
standardize
(df, columnIndex)¶ Performs standardization on a numeric column, in-place
Uniformally scales the values in a numeric data set so that the mean is 0 and standard deviation is 1.
Warning
Calling the function on a non numeric column will raise an exception.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – index of column