dcs.load

Language: Python

dcs.load is a module of the dcs package that contains functions for loading and exporting pandas.DataFrame objects to/from other formats as well as performing basic manipulations such as removing rows and renaming columns.

dcs.load.CSVtoDataFrame(filename, header=0, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)

Initializes a pandas.DataFrame object by reading a CSV file from disk

Note that lines with missing commas/rows will be automatically discarded and lines with extra commas/rows will be automatically truncated to proper length. Forces conversion to UTF-8 text encoding when creating pandas.DataFrame.

Parameters:
  • filename (str) – path to CSV file
  • header (int, optional) – line number of row which contains column headers
  • initialSkip (int) – number of lines to skip (beginning of file)
  • sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
  • seed (hashable, optional) – for deterministic/reproducible sampling if sampling enabled
  • headerIncluded (bool, optional) – if False, columns will not be named based on any row. If True, header parameter will be ignored
Returns:

pandas.DataFrame on success, or None on failure

Return type:

pandas.DataFrame

dcs.load.JSONtoDataFrame(filename, sampleSize=100, seed=None)

Initializes a pandas.DataFrame object by reading a JSON file from disk

Forces conversion to UTF-8 text encoding when creating pandas.DataFrame.

Parameters:
  • filename (str) – path to JSON file
  • sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
  • seed (int, optional) – for deterministic/reproducible sampling if sampling enabled
Returns:

pandas.DataFrame on success, or None on failure

Return type:

pandas.DataFrame

dcs.load.XLSXtoDataFrame(filename, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)

Initializes a pandas.DataFrame object by reading an Excel file from disk

The function supports loading both .XLS and .XLSX files.

Parameters:
  • filename (str) – path to Excel file
  • initialSkip (int) – number of lines to skip (beginning of file)
  • sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
  • seed (int, optional) – for deterministic/reproducible sampling if sampling enabled
  • headerIncluded (bool, optional) – if False, columns will not be named based on any row
Returns:

pandas.DataFrame on success, or None on failure

Return type:

pandas.DataFrame

dcs.load.changeColumnDataType(df, column, newDataType, dateFormat=None)

Changes the data type of a pandas.DataFrame column.

The function performs the data type conversion in place, modifying the passed in dataframe. The new data type must be a string that can be parsed by the numpy.dtype() function. Valid data types include “int”, “float64”, “datetime64” and “str”.

Parameters:
  • df (pandas.DataFrame) – data frame
  • column (list<int>) – column to change data type of
  • newDataType (str) – new data type
  • dateFormat (str, optional) – Python date format string for parsing dates, if converting to datetime column
Returns:

pandas.DataFrame on success or None on failure

Return type:

pandas.DataFrame

dcs.load.convertEncoding(filename, source='utf-8', destination='utf-8', buffer=1024)

Converts a file on disk to specified text encoding in-place

The function iterates over the file in blocks, specified by number of bytes in buffer parameter which defaults to 1024

Parameters:
  • filename (str) – path to file
  • source (str, optional) – original text encoding of file
  • destination (str, optional) – text encoding to convert to
  • buffer (int) – conversion iteration block size
Returns:

True if successful, False otherwise

Return type:

bool

dcs.load.dataFrameToJSON(df, rowIndexFrom=None, rowIndexTo=None, columnIndexFrom=None, columnIndexTo=None)

Serializes a pandas.DataFrame object to a JSON string

Tip

By default, the function converts the entire data frame to JSON, but one can also request partial segments of the data frame to be encoded as JSON, by supplying the four parameters.

Note

All four index parameters must either be left as the default value of None or be supplied valid integer index values.

The function uses the pandas.DataFrame.to_json() method to convert the object to JSON, using the ‘split’ format. All dates are encoded to strings using the ISO8601 date format.

Parameters:
  • df (pandas.DataFrame) – dataframe to convert
  • rowIndexFrom (int, optional) – if serializing partial segment, start index of row interval
  • rowIndexTo (int, optional) – if serializing partial segment, end index of row interval
  • columnIndexFrom (int, optional) – if serializing partial segment, start index of column interval
  • columnIndexTo (int, optional) – if serializing partial segment, end index of column interval
Returns:

JSON string on success, or None on failure

Return type:

str

dcs.load.duplicateRowsInColumns(df, columnIndices)

Finds the rows in a pandas.DataFrame object that have duplicate values in the specified columns.

A set of rows must have duplicate values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Note

The returned data frame is sorted according to the first specified column in order to better show the duplicate values.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndices (list<int>) – indices of columns for search
Returns:

pandas.DataFrame on success or None on failure

Return type:

pandas.DataFrame

dcs.load.emptyStringToNan(df, columnIndex)

Replaces all instances of ‘’ (empty string) with numpy.NaN for a specified pandas.DataFrame column

Parameters:
dcs.load.guessEncoding(filename)

Guesses encoding of a file.

Uses chardet library to guess the text encoding of a file, and returns a guess if confidence > 80%.

Parameters:filename (str) – path to file
Returns:Python text encoding string e.g. ‘utf-8’ and ‘latin-1’
Return type:str
dcs.load.newCellValue(df, columnIndex, rowIndex, newValue)

Modifies the value of a specified cell in a pandas.DataFrame object

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndex (int) – integer index of column
  • rowIndex (int) – integer index of row
  • newValue – fill value
dcs.load.outliersTrimmedMeanSd(df, columnIndices, r=2, k=0)

Finds the rows in a pandas.DataFrame column that are outliers.

A set of rows must have outliers in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Tip

The behaviour of this function can be tweaked with the r and k parameters. The r parameter specifies how many standard deviations away from the mean an outlier must be, and the k parameter can be used to exclude the highest and lowest values (the outliers) when calculating the mean of a column, in order to prevent outliers from skewing the calculation of the mean.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndices (list<int>) – indices of columns to find outliers in
  • r (int/float, optional) – standard deviations from mean
  • k (float, optional) – portion to trim from highest and lowest values, 0 (no trimming) <= k < 0.5 (everything trimmed)
Returns:

pandas.DataFrame on success or None on failure

Return type:

pandas.DataFrame

dcs.load.removeColumns(df, columnIndices)

Removes specified columns from a pandas.DataFrame object.

Parameters:
  • df (pandas.DataFrame) – data frame
  • rowIndices (list<int>) – indices of columns to remove
dcs.load.removeRows(df, rowIndices)

Removes specified rows from a pandas.DataFrame object.

Note that the function calls pandas.DataFrame.reset_index() method after removing the rows, meaning that the rows are re-indexed to be sequential.

Parameters:
  • df (pandas.DataFrame) – data frame
  • rowIndices (list<int>) – indices of rows to remove
dcs.load.renameColumn(df, column, newName)

Renames a pandas.DataFrame column

Note

If there are multiple columns matching the target, all matching columns will be renamed to the new name.

Parameters:
  • df (pandas.DataFrame) – dataframe
  • column – name of column to rename
  • newName – new name
dcs.load.rowsWithInvalidValuesInColumns(df, columnIndices)

Finds the rows in a pandas.DataFrame object that contain invalid values in the specified columns.

A row must have invalid values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Parameters:
  • df (pandas.DataFrame) – data frame
  • columnIndices (list<int>) – indices of columns for search
Returns:

pandas.DataFrame on success or None on failure

Return type:

pandas.DataFrame