dcs.load¶

Language: Python

dcs.load is a module of the dcs package that contains functions for loading and exporting pandas.DataFrame objects to/from other formats as well as performing basic manipulations such as removing rows and renaming columns.

dcs.load.CSVtoDataFrame(filename, header=0, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)¶

Initializes a pandas.DataFrame object by reading a CSV file from disk

Note that lines with missing commas/rows will be automatically discarded and lines with extra commas/rows will be automatically truncated to proper length. Forces conversion to UTF-8 text encoding when creating pandas.DataFrame.

Parameters:	filename (str) – path to CSV file header (int, optional) – line number of row which contains column headers initialSkip (int) – number of lines to skip (beginning of file) sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed. seed (hashable, optional) – for deterministic/reproducible sampling if sampling enabled headerIncluded (bool, optional) – if `False`, columns will not be named based on any row. If `True`, `header` parameter will be ignored
Returns:	pandas.DataFrame on success, or None on failure
Return type:	pandas.DataFrame

dcs.load.JSONtoDataFrame(filename, sampleSize=100, seed=None)¶

Initializes a pandas.DataFrame object by reading a JSON file from disk

Forces conversion to UTF-8 text encoding when creating pandas.DataFrame.

Parameters:	filename (str) – path to JSON file sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed. seed (int, optional) – for deterministic/reproducible sampling if sampling enabled
Returns:	pandas.DataFrame on success, or None on failure
Return type:	pandas.DataFrame

dcs.load.XLSXtoDataFrame(filename, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)¶

Initializes a pandas.DataFrame object by reading an Excel file from disk

The function supports loading both .XLS and .XLSX files.

Parameters:	filename (str) – path to Excel file initialSkip (int) – number of lines to skip (beginning of file) sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed. seed (int, optional) – for deterministic/reproducible sampling if sampling enabled headerIncluded (bool, optional) – if `False`, columns will not be named based on any row
Returns:	pandas.DataFrame on success, or None on failure
Return type:	pandas.DataFrame

dcs.load.changeColumnDataType(df, column, newDataType, dateFormat=None)¶

Changes the data type of a pandas.DataFrame column.

The function performs the data type conversion in place, modifying the passed in dataframe. The new data type must be a string that can be parsed by the numpy.dtype() function. Valid data types include “int”, “float64”, “datetime64” and “str”.

Parameters:	df (pandas.DataFrame) – data frame column (list<int>) – column to change data type of newDataType (str) – new data type dateFormat (str, optional) – Python date format string for parsing dates, if converting to datetime column
Returns:	pandas.DataFrame on success or None on failure
Return type:	pandas.DataFrame

dcs.load.convertEncoding(filename, source='utf-8', destination='utf-8', buffer=1024)¶

Converts a file on disk to specified text encoding in-place

The function iterates over the file in blocks, specified by number of bytes in buffer parameter which defaults to 1024

Parameters:	filename (str) – path to file source (str, optional) – original text encoding of file destination (str, optional) – text encoding to convert to buffer (int) – conversion iteration block size
Returns:	True if successful, False otherwise
Return type:	bool

dcs.load.dataFrameToJSON(df, rowIndexFrom=None, rowIndexTo=None, columnIndexFrom=None, columnIndexTo=None)¶

Serializes a pandas.DataFrame object to a JSON string

Tip

By default, the function converts the entire data frame to JSON, but one can also request partial segments of the data frame to be encoded as JSON, by supplying the four parameters.

Note

All four index parameters must either be left as the default value of None or be supplied valid integer index values.

The function uses the pandas.DataFrame.to_json() method to convert the object to JSON, using the ‘split’ format. All dates are encoded to strings using the ISO8601 date format.

Parameters:	df (pandas.DataFrame) – dataframe to convert rowIndexFrom (int, optional) – if serializing partial segment, start index of row interval rowIndexTo (int, optional) – if serializing partial segment, end index of row interval columnIndexFrom (int, optional) – if serializing partial segment, start index of column interval columnIndexTo (int, optional) – if serializing partial segment, end index of column interval
Returns:	JSON string on success, or None on failure
Return type:	str

dcs.load.duplicateRowsInColumns(df, columnIndices)¶

Finds the rows in a pandas.DataFrame object that have duplicate values in the specified columns.

A set of rows must have duplicate values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Note

The returned data frame is sorted according to the first specified column in order to better show the duplicate values.

Parameters:	df (pandas.DataFrame) – data frame columnIndices (list<int>) – indices of columns for search
Returns:	pandas.DataFrame on success or None on failure
Return type:	pandas.DataFrame

dcs.load.emptyStringToNan(df, columnIndex)¶

Replaces all instances of ‘’ (empty string) with numpy.NaN for a specified pandas.DataFrame column

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – integer index of column

dcs.load.guessEncoding(filename)¶

Guesses encoding of a file.

Uses chardet library to guess the text encoding of a file, and returns a guess if confidence > 80%.

Parameters:	filename (str) – path to file
Returns:	Python text encoding string e.g. ‘utf-8’ and ‘latin-1’
Return type:	str

dcs.load.newCellValue(df, columnIndex, rowIndex, newValue)¶

Modifies the value of a specified cell in a pandas.DataFrame object

Parameters:	df (pandas.DataFrame) – data frame columnIndex (int) – integer index of column rowIndex (int) – integer index of row newValue – fill value

dcs.load.outliersTrimmedMeanSd(df, columnIndices, r=2, k=0)¶

Finds the rows in a pandas.DataFrame column that are outliers.

A set of rows must have outliers in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Tip

The behaviour of this function can be tweaked with the r and k parameters. The r parameter specifies how many standard deviations away from the mean an outlier must be, and the k parameter can be used to exclude the highest and lowest values (the outliers) when calculating the mean of a column, in order to prevent outliers from skewing the calculation of the mean.

Parameters:	df (pandas.DataFrame) – data frame columnIndices (list<int>) – indices of columns to find outliers in r (int/float, optional) – standard deviations from mean k (float, optional) – portion to trim from highest and lowest values, 0 (no trimming) <= k < 0.5 (everything trimmed)
Returns:	pandas.DataFrame on success or None on failure
Return type:	pandas.DataFrame

dcs.load.removeColumns(df, columnIndices)¶

Removes specified columns from a pandas.DataFrame object.

Parameters:	df (pandas.DataFrame) – data frame rowIndices (list<int>) – indices of columns to remove

dcs.load.removeRows(df, rowIndices)¶

Removes specified rows from a pandas.DataFrame object.

Note that the function calls pandas.DataFrame.reset_index() method after removing the rows, meaning that the rows are re-indexed to be sequential.

Parameters:	df (pandas.DataFrame) – data frame rowIndices (list<int>) – indices of rows to remove

dcs.load.renameColumn(df, column, newName)¶

Renames a pandas.DataFrame column

Note

If there are multiple columns matching the target, all matching columns will be renamed to the new name.

Parameters:	df (pandas.DataFrame) – dataframe column – name of column to rename newName – new name

dcs.load.rowsWithInvalidValuesInColumns(df, columnIndices)¶

Finds the rows in a pandas.DataFrame object that contain invalid values in the specified columns.

A row must have invalid values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.

Parameters:	df (pandas.DataFrame) – data frame columnIndices (list<int>) – indices of columns for search
Returns:	pandas.DataFrame on success or None on failure
Return type:	pandas.DataFrame

dcs.load¶

Willow

Versions

Navigation