In a significant stride towards bolstering maritime security, researchers have developed a synthetic dataset designed to train machine learning models for identifying vessels likely to engage in illegal activities. This dataset, a product of the CONNECTOR and FARADAI projects, was recently published in the journal ‘Data in Brief’ under the title “SIAP: Synthetic dataset for maritime vessel risk profiling and illegal activity prediction.”
The lead author, Spyridon Karamolegkos from the Multimedia Knowledge and Social Media Analytics Lab at the Centre for Research and Technology Hellas (CERTH) in Thessaloniki, Greece, explains that the dataset was created using extensive expert knowledge. “We translated operational insights into a set of probabilistic and rule-based simulation criteria,” Karamolegkos said. These criteria model various aspects of vessel behavior, including crew attributes, compliance history, cargo-related information, and operational patterns.
The dataset comprises 100,000 simulated vessel profiles, each described by features such as crew criminal record, abnormal routing, frequency of port calls, inspection history, prior violations, insurance claims, ship condition, and cargo characteristics. A synthetic binary target variable indicates whether the vessel is likely to be involved in illegal activity, with probability values derived from cumulative risk factors.
To ensure realism, the researchers extracted intelligence from anonymized real-world vessel behavior reports provided by Lloyd’s List Intelligence. “These real-world examples served as qualitative baselines for simulating typical and edge-case activity patterns,” Karamolegkos noted. This approach ensures that the dataset remains relevant for operational risk modeling while preserving ethical safeguards.
The dataset is provided in CSV format, ready for immediate ingestion into analytics pipelines, machine learning workflows, or maritime surveillance tools. It is particularly suited for researchers, enforcement agencies, and developers of maritime AI systems seeking high-quality, realistic training data for binary classification tasks.
The commercial impacts and opportunities for the maritime sectors are substantial. By leveraging this dataset, maritime professionals can enhance their anomaly detection systems, predictive analytics, risk profiling tools, and decision-support frameworks. This can lead to more efficient and effective maritime surveillance, reducing the likelihood of illegal activities and improving overall maritime security.
In summary, this synthetic dataset represents a significant advancement in the field of maritime security. By providing a realistic and ethically sound training ground for machine learning models, it offers a powerful tool for identifying and mitigating risks in the maritime domain. As Karamolegkos puts it, “This dataset is designed for reuse in the development of maritime anomaly detection systems, predictive analytics, risk profiling tools, and decision-support frameworks.” With its detailed documentation and ready-to-use format, it is poised to make a substantial impact on maritime security efforts worldwide.

