Identify¶
Generate Checksums & Check for duplicates¶
This module implements checksum generation and duplicate detection.
-
check_collisions
(checksums: List[str]) → Set[str]¶ Checks checksum collisions given a list of checksums as strings. Returns a set of collisions if any such are found.
Parameters: checksums (List[str]) – List of checksums that must be checked for collisions. Returns: A set of colliding checksums. Empty if none are found. Return type: Set[str]
-
check_duplicates
(files: List[digiarch.internals.FileInfo], save_path: pathlib.Path) → None¶ Generates a file with checksum collisions, indicating that duplicates are present.
Parameters: - files (List[FileInfo]) – Files for which duplicates should be checked.
- save_path (Path) – Path to which the checksum collision information should be saved.
-
checksum_worker
(file_info: digiarch.internals.FileInfo) → digiarch.internals.FileInfo¶ Worker used when multiprocessing checksums of FileInfo objects.
Parameters: fileinfo (FileInfo) – The FileInfo object that must be updated with a new checksum value. Returns: The FileInfo object with an updated checksum value. Return type: FileInfo
-
file_checksum
(file: pathlib.Path) → str¶ Calculate the checksum of an input file using BLAKE2.
Parameters: file (Path) – The file for which to calculate the checksum. Expects a pathlib.Path object. Returns: The hex checksum of the input file. Return type: str
-
generate_checksums
(files: List[digiarch.internals.FileInfo]) → List[digiarch.internals.FileInfo]¶ Multiprocesses a list of FileInfo object in order to assign new checksums.
Parameters: files (List[FileInfo]) – List of FileInfo objects that need checksums. Returns: The updated list of FileInfo objects. Return type: List[FileInfo]
Identify Files¶
Identify files using siegfried
-
identify
(files: List[digiarch.internals.FileInfo], path: pathlib.Path) → List[digiarch.internals.FileInfo]¶ Identify all files in a list, and return the updated list.
Parameters: files (List[FileInfo]) – Files to identify. Returns: Input files with updated Identification information. Return type: List[FileInfo]
-
sf_id
(path: pathlib.Path) → Dict[pathlib.Path, digiarch.internals.Identification]¶ Identify files using siegfried and update FileInfo with obtained PUID, signature name, and warning if applicable.
Parameters: path (pathlib.Path) – Path in which to identify files. Returns: Dictionary containing file path and associated identification information obtained from siegfried’s stdout. Return type: Dict[Path, Identification] Raises: IdentificationError
– If running siegfried or loading of the resulting JSON output fails, an IdentificationError is thrown.
-
update_file_info
(file_info: digiarch.internals.FileInfo, id_info: Dict[pathlib.Path, digiarch.internals.Identification]) → digiarch.internals.FileInfo¶
Generate Reports¶
Reporting utilities for file discovery.
-
report_results
(files: List[digiarch.internals.FileInfo], save_path: pathlib.Path) → None¶ Generates reports of explore_dir() results.
Parameters: - files (List[FileInfo]) – The files to report on.
- save_path (str) – The path in which to save the reports.