Identify

Generate Checksums & Check for duplicates

This module implements checksum generation and duplicate detection.

check_collisions(checksums: List[str]) → Set[str]

Checks checksum collisions given a list of checksums as strings. Returns a set of collisions if any such are found.

Parameters:checksums (List[str]) – List of checksums that must be checked for collisions.
Returns:A set of colliding checksums. Empty if none are found.
Return type:Set[str]
check_duplicates(files: List[digiarch.internals.FileInfo], save_path: pathlib.Path) → None

Generates a file with checksum collisions, indicating that duplicates are present.

Parameters:
  • files (List[FileInfo]) – Files for which duplicates should be checked.
  • save_path (Path) – Path to which the checksum collision information should be saved.
checksum_worker(file_info: digiarch.internals.FileInfo) → digiarch.internals.FileInfo

Worker used when multiprocessing checksums of FileInfo objects.

Parameters:fileinfo (FileInfo) – The FileInfo object that must be updated with a new checksum value.
Returns:The FileInfo object with an updated checksum value.
Return type:FileInfo
file_checksum(file: pathlib.Path) → str

Calculate the checksum of an input file using BLAKE2.

Parameters:file (Path) – The file for which to calculate the checksum. Expects a pathlib.Path object.
Returns:The hex checksum of the input file.
Return type:str
generate_checksums(files: List[digiarch.internals.FileInfo]) → List[digiarch.internals.FileInfo]

Multiprocesses a list of FileInfo object in order to assign new checksums.

Parameters:files (List[FileInfo]) – List of FileInfo objects that need checksums.
Returns:The updated list of FileInfo objects.
Return type:List[FileInfo]

Identify Files

Identify files using siegfried

identify(files: List[digiarch.internals.FileInfo], path: pathlib.Path) → List[digiarch.internals.FileInfo]

Identify all files in a list, and return the updated list.

Parameters:files (List[FileInfo]) – Files to identify.
Returns:Input files with updated Identification information.
Return type:List[FileInfo]
sf_id(path: pathlib.Path) → Dict[pathlib.Path, digiarch.internals.Identification]

Identify files using siegfried and update FileInfo with obtained PUID, signature name, and warning if applicable.

Parameters:path (pathlib.Path) – Path in which to identify files.
Returns:Dictionary containing file path and associated identification information obtained from siegfried’s stdout.
Return type:Dict[Path, Identification]
Raises:IdentificationError – If running siegfried or loading of the resulting JSON output fails, an IdentificationError is thrown.
update_file_info(file_info: digiarch.internals.FileInfo, id_info: Dict[pathlib.Path, digiarch.internals.Identification]) → digiarch.internals.FileInfo

Generate Reports

Reporting utilities for file discovery.

report_results(files: List[digiarch.internals.FileInfo], save_path: pathlib.Path) → None

Generates reports of explore_dir() results.

Parameters:
  • files (List[FileInfo]) – The files to report on.
  • save_path (str) – The path in which to save the reports.