Skip to main content

Documentation

This section explains the important code files in our repository. It lays out the methods and functions in each file, stating their arguments, return value, and purpose.

1. Codebase Structure

The important files in our codebase are listed below:

.
├── api/
│ ├── controllers/
│ │ └── classificationController.js
│ ├── routes/
│ │ └── classificationRoutes.js
│ ├── services/
│ │ ├── classificationService.js
│ │ └── hazardDefinitionService.js
│ └── app.js
├── frontend/
│ ├── app.py
│ └── hazard_definitions.xlsx
└── tools/
├── AssociationMatrix/
│ ├── llamaCTransformerAssocGenerator.py
│ └── run_assocMat.sh
├── ConfusionMatrix/
│ ├── ConfusionMatrix.py
│ └── run_confMat.sh
├── data/
│ ├── confusion_matrix.xlsx
│ ├── hazard_definitions.xlsx
│ └── scores.xlsx
└── setup_tools.sh

2. API Documentation

2.1 App.js

This creates the ExpressJS application, uses the classification routes, and exposes the application layer on the classify function for Google Cloud.

2.2 classificationRoutes.js

This file defines the API routes connecting requests to the controller layer.

2.3 classificationController.js

Function NameFull Function DefinitionDescription
startClassificationstartClassification(req, res) => ObjectStarts the rules based classification process. Uses keyword matching to identify hazards in the report.
refineClassificationrefineClassification(req, res) ⇒ ObjectUses the confirmed and rejected hazards to refine the classification by looking for hazards with the confirmed hazards in the upstream hazards.
startGPTClassificationstartGPTClassification(req, res) ⇒ ObjectUses ChatGPT to classify the report, sending the report to the ChatGPT API along with the hazard definitions to get the classification.
startCombinedClassificationstartCombinedClassification(req, res) ⇒ ObjectClassifies the report using both the rules-based and GPT classification routes, returning the combined questions.
getHazardsByCodegetHazardsByCode(req, res) ⇒ ObjectRetrieves the full hazard information for the given hazard codes

2.4 classificationService

Function NameFull Function DefinitionDescription
startClassificationstartClassification(report) ⇒ ArrayStarts the hazard classification process based on a given report. Retrieves hazard definitions and generates report-specific questions.
startGPTClassificationstartGPTClassification(report) ⇒ ArrayStarts the GPT-based hazard classification process based on a given report. Retrieves hazard definitions and identifies hazards based on OpenAI ChatGPT classification.
refineClassificationrefineClassification(confirmed, rejected) ⇒ ArrayRefines the hazard classification process based on confirmed and rejected hazards. Retrieves hazard definitions and collates associated hazards.
getHazardsByCodegetHazardsByCode(hazardCodes) ⇒ ArrayRetrieves hazard information based on hazard codes.
startCombinedClassificationstartCombinedClassification(report) ⇒ ArrayStarts the combined hazard classification process based on a given report. Retrieves hazard definitions and generates both GPT and report-specific questions.
getReportQuestionsgetReportQuestions(reportText, hazardDefinitions) ⇒ ArrayGenerates report-specific questions based on a given report and hazard definitions.
getHazardListgetHazardList(hazardDefinitions, report) ⇒ ArrayRetrieves a comprehensive hazard list using GPT API based on a given report and hazard definitions.
getHazardQuestiongetHazardQuestion(hazardCode, hazardDefinitions) ⇒ string \/ nullRetrieves the question for a specific hazard code from hazard definitions.
getGPTQuestionsgetGPTQuestions(report, hazardDefinitions) ⇒ ArrayUses ChatGPT to identify hazards in a report by sending a request to the ChatGPT API
groupHazardsByCategorygroupHazardsByCategory(hazards) ⇒ ArrayGroups hazards by category.
getUpstreamQuestionsgetUpstreamQuestions(confirmed, rejected, hazardDefinitions) ⇒ ArraySearches the hazard_definitions for hazards associated with the input hazard

2.5 hazardDefinitionService.js

Function NameFull Function DefinitionDescription
getHazardDefinitionsgetHazardDefinitions() ⇒ ArrayAccesses the hazard_definitions.json stored in the hazard_definitions_bucket, parses it, and returns it

3. llamaCTransformerAssocGenerator.py

Function NameFull Function DefinitionDescription
extract_scoreextract_score(response) ⇒ intExtracts the likelihood score from the LLM's response.
find_first_missing_pairfind_first_missing_pair(dataframe) ⇒ tuple \/ NoneFinds the first pair of row and column categories in a DataFrame where the corresponding element is missing.
run_llmrun_llm(hazard1, hazard2, def1, def2) ⇒ tupleRun a likelihood assessment using a Language Model (LLM) to evaluate the likelihood that a given hazard1 causes hazard2.
find_invalid_pairsfind_invalid_pairs() ⇒ List(tuple)Finds pairs of row and column categories in the scores DataFrame where the corresponding element is -1

4. ConfusionMatrix.py

Function NameFull Function DefinitionDescription
init__init__(self, model: str, data_file_path: str, output_folder: str)Initializes the ConfusionMatrixGenerator class.
find_category_linesfind_category_lines(self, categories: pd.Series) -> List[int]Finds the positions of category boundaries in the similarity matrix.
visualize_heatmapvisualize_heatmap(self, similarity_matrix: np.ndarray, line_positions_ordered: List[int]) -> NoneVisualizes the similarity matrix as a heatmap with ordered category lines. Uses matplotlib.
save_similarity_matrixsave_similarity_matrix(self, similarity_matrix: np.ndarray) -> NoneSaves the similarity matrix as an Excel file.
generate_similarity_pairsgenerate_similarity_pairs(self, similarity_matrix: np.ndarray) -> pd.DataFrameGenerate pairs of hazard codes along with their similarity scores based on a similarity matrix.
filter_similarity_pairsfilter_similarity_pairs(self, similarity_pairs_df: pd.DataFrame) -> pd.DataFrameFilters the similarity pairs based on the similarity score.
generate_confused_hazardsgenerate_confused_hazards(self, similarity_matrix: np.ndarray, similarity_pairs_df: pd.DataFrame) -> pd.DataFrameCreates a dataframe with the most similar hazards for each hazard.
save_confused_pairssave_confused_pairs(self, confused_pairs_df: pd.DataFrame) -> NoneSaves the confused pairs as Excel and JSON files.
add_confused_to_definitionsadd_confused_to_definitions(self, confusion_df: pd.DataFrame, hazard_definitions: pd.DataFrame) -> pd.DataFrameAdds the confused hazards to the hazard definitions.
show_plotly_heatmapshow_plotly_heatmap(self, similarity_matrix: np.ndarray, labels_list: List[str]) -> NoneVisualizes the similarity matrix as an interactive heatmap using Plotly.
run_plotly_heatmaprun_plotly_heatmap(self) -> NoneRuns the ConfusionMatrixGenerator to generate and display an interactive heatmap using Plotly.
runrun(self) -> NoneRuns the ConfusionMatrixGenerator to generate and save the confusion matrix and other outputs.

5. Data Documentation

Database NameTypeDescription
confusion_matrix.xlsxExcel SpreadsheetA 303x303 matrix in which the cosine similarity score of the embeddings of pairs of hazards is stored.
scores.xlsxExcel SpreadsheetA 303x303 matrix in which the association probability score of each pairs of hazards is stored.
hazard_definitions.xlsxExcel SpreadsheetThe full database containing comprehensive information of all hazards. Includes upstream hazards and easily confused hazards.