dcs.load¶
Language: Python
dcs.load
is a module of the dcs
package that contains functions for loading and exporting pandas.DataFrame
objects to/from other formats as well as performing basic manipulations such as removing rows and renaming columns.
-
dcs.load.
CSVtoDataFrame
(filename, header=0, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)¶ Initializes a
pandas.DataFrame
object by reading a CSV file from diskNote that lines with missing commas/rows will be automatically discarded and lines with extra commas/rows will be automatically truncated to proper length. Forces conversion to UTF-8 text encoding when creating
pandas.DataFrame
.Parameters: - filename (str) – path to CSV file
- header (int, optional) – line number of row which contains column headers
- initialSkip (int) – number of lines to skip (beginning of file)
- sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
- seed (hashable, optional) – for deterministic/reproducible sampling if sampling enabled
- headerIncluded (bool, optional) – if
False
, columns will not be named based on any row. IfTrue
,header
parameter will be ignored
Returns: pandas.DataFrame on success, or None on failure
Return type:
-
dcs.load.
JSONtoDataFrame
(filename, sampleSize=100, seed=None)¶ Initializes a
pandas.DataFrame
object by reading a JSON file from diskForces conversion to UTF-8 text encoding when creating
pandas.DataFrame
.Parameters: - filename (str) – path to JSON file
- sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
- seed (int, optional) – for deterministic/reproducible sampling if sampling enabled
Returns: pandas.DataFrame on success, or None on failure
Return type:
-
dcs.load.
XLSXtoDataFrame
(filename, initialSkip=0, sampleSize=100, seed=None, headerIncluded=True)¶ Initializes a
pandas.DataFrame
object by reading an Excel file from diskThe function supports loading both .XLS and .XLSX files.
Parameters: - filename (str) – path to Excel file
- initialSkip (int) – number of lines to skip (beginning of file)
- sampleSize (int, optional) – percentage of lines to sample, must be between 0 and 100. If set to 100, no sampling will be performed.
- seed (int, optional) – for deterministic/reproducible sampling if sampling enabled
- headerIncluded (bool, optional) – if
False
, columns will not be named based on any row
Returns: pandas.DataFrame on success, or None on failure
Return type:
-
dcs.load.
changeColumnDataType
(df, column, newDataType, dateFormat=None)¶ Changes the data type of a
pandas.DataFrame
column.The function performs the data type conversion in place, modifying the passed in dataframe. The new data type must be a string that can be parsed by the
numpy.dtype()
function. Valid data types include “int”, “float64”, “datetime64” and “str”.Parameters: - df (pandas.DataFrame) – data frame
- column (list<int>) – column to change data type of
- newDataType (str) – new data type
- dateFormat (str, optional) – Python date format string for parsing dates, if converting to datetime column
Returns: pandas.DataFrame on success or None on failure
Return type:
-
dcs.load.
convertEncoding
(filename, source='utf-8', destination='utf-8', buffer=1024)¶ Converts a file on disk to specified text encoding in-place
The function iterates over the file in blocks, specified by number of bytes in
buffer
parameter which defaults to 1024Parameters: Returns: True if successful, False otherwise
Return type:
-
dcs.load.
dataFrameToJSON
(df, rowIndexFrom=None, rowIndexTo=None, columnIndexFrom=None, columnIndexTo=None)¶ Serializes a
pandas.DataFrame
object to a JSON stringTip
By default, the function converts the entire data frame to JSON, but one can also request partial segments of the data frame to be encoded as JSON, by supplying the four parameters.
Note
All four index parameters must either be left as the default value of
None
or be supplied valid integer index values.The function uses the
pandas.DataFrame.to_json()
method to convert the object to JSON, using the ‘split’ format. All dates are encoded to strings using theISO8601 date format
.Parameters: - df (pandas.DataFrame) – dataframe to convert
- rowIndexFrom (int, optional) – if serializing partial segment, start index of row interval
- rowIndexTo (int, optional) – if serializing partial segment, end index of row interval
- columnIndexFrom (int, optional) – if serializing partial segment, start index of column interval
- columnIndexTo (int, optional) – if serializing partial segment, end index of column interval
Returns: JSON string on success, or None on failure
Return type:
-
dcs.load.
duplicateRowsInColumns
(df, columnIndices)¶ Finds the rows in a
pandas.DataFrame
object that have duplicate values in the specified columns.A set of rows must have duplicate values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.
Note
The returned data frame is sorted according to the first specified column in order to better show the duplicate values.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndices (list<int>) – indices of columns for search
Returns: pandas.DataFrame on success or None on failure
Return type:
-
dcs.load.
emptyStringToNan
(df, columnIndex)¶ Replaces all instances of ‘’ (empty string) with
numpy.NaN
for a specifiedpandas.DataFrame
columnParameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – integer index of column
-
dcs.load.
guessEncoding
(filename)¶ Guesses encoding of a file.
Uses chardet library to guess the text encoding of a file, and returns a guess if confidence > 80%.
Parameters: filename (str) – path to file Returns: Python text encoding string e.g. ‘utf-8’ and ‘latin-1’ Return type: str
-
dcs.load.
newCellValue
(df, columnIndex, rowIndex, newValue)¶ Modifies the value of a specified cell in a
pandas.DataFrame
objectParameters: - df (pandas.DataFrame) – data frame
- columnIndex (int) – integer index of column
- rowIndex (int) – integer index of row
- newValue – fill value
-
dcs.load.
outliersTrimmedMeanSd
(df, columnIndices, r=2, k=0)¶ Finds the rows in a
pandas.DataFrame
column that are outliers.A set of rows must have outliers in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.
Tip
The behaviour of this function can be tweaked with the r and k parameters. The r parameter specifies how many standard deviations away from the mean an outlier must be, and the k parameter can be used to exclude the highest and lowest values (the outliers) when calculating the mean of a column, in order to prevent outliers from skewing the calculation of the mean.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndices (list<int>) – indices of columns to find outliers in
- r (int/float, optional) – standard deviations from mean
- k (float, optional) – portion to trim from highest and lowest values, 0 (no trimming) <= k < 0.5 (everything trimmed)
Returns: pandas.DataFrame on success or None on failure
Return type:
-
dcs.load.
removeColumns
(df, columnIndices)¶ Removes specified columns from a
pandas.DataFrame
object.Parameters: - df (pandas.DataFrame) – data frame
- rowIndices (list<int>) – indices of columns to remove
-
dcs.load.
removeRows
(df, rowIndices)¶ Removes specified rows from a
pandas.DataFrame
object.Note that the function calls
pandas.DataFrame.reset_index()
method after removing the rows, meaning that the rows are re-indexed to be sequential.Parameters: - df (pandas.DataFrame) – data frame
- rowIndices (list<int>) – indices of rows to remove
-
dcs.load.
renameColumn
(df, column, newName)¶ Renames a
pandas.DataFrame
columnNote
If there are multiple columns matching the target, all matching columns will be renamed to the new name.
Parameters: - df (pandas.DataFrame) – dataframe
- column – name of column to rename
- newName – new name
-
dcs.load.
rowsWithInvalidValuesInColumns
(df, columnIndices)¶ Finds the rows in a
pandas.DataFrame
object that contain invalid values in the specified columns.A row must have invalid values in all specified columns in order to be matched by this function. The function returns a subset of the dataframe containing all the original columns but only the matched rows.
Parameters: - df (pandas.DataFrame) – data frame
- columnIndices (list<int>) – indices of columns for search
Returns: pandas.DataFrame on success or None on failure
Return type: