Methods and Tools
Open Source Tools and Methods for Open Source Investigations
Investigating and reporting on human rights violations in Sudan can be challenging and dangerous for journalists, human rights workers, and international organisations. Activists and human rights workers have been detained, arrested, or even killed for their work. This is the main reason Sudanese Archive and other documentation groups depend on verified user-generated content to assist in criminal case-building as well as human rights research. Sudanese Archive relies on an infrastructure and methodology developed and maintained by the Syrian Archive, through a partnership with Mnemonic.
Mnemonic and the Sudanese Archive strive for transparency in their tools, findings, and methodologies; as well as making sure that verified content is publicly available and accessible for journalists, human rights defenders, and lawyers working for the purpose of reporting, advocacy, and accountability purposes.
To achieve transparency, software developed by the Syrian Archive is released in free and open-source formats. This is done to ensure trust is built and maintained with our partners and collaborators, as well as allowing software to be reused and customised by other groups outside of the Syrian Archive. Technical integration with existing open-source investigative tools ensures that work is not duplicated. The Syrian Archive works alongside technologists to develop our open-source tools. Our methodology is developed in collaboration with other archival groups, as well as lawyers and journalists.
The Syrian Archive’s Data and Operational Model is based on the Electronic Discovery Reference Model developed by Duke University School of Law. Below is a detailed description of every step of this model.
Our collection process involved establishing a standardised metadata schema alongside a database of credible sources for digital content. Sources can be direct submissions from individuals and organisations, publicly available social media accounts and channels, as well as other publicly available information.
1) Establish database of credible sources for content
Before any collection, archival, or verification of digital materials was possible, we had to establish a database of credible sources for visual content. We have identified over 3.500 credible sources, including individual journalists and field reporters, larger media houses (e.g. local and international news agencies), human rights organisations, local medical workers and hospitals. Many of these sources began publishing or providing visual content in late 2018-early 2019 and also publish work in other credible media outlets.
Visual content is primarily accessed through social media channels (Twitter, Facebook, YouTube, Websites, Telegram), submitted files (videos, photos, pdf), and external and collaborators’ data sets. Changes in these data sets are tracked, meaning that all versions are saved. Credibility is determined by analysing whether the source is familiar to us, or our existing network, as well as checking that the source’s content and reportage has been reliable in the past. This might include evaluating how long the source has been reporting and how active they are. To identify where the source is based, social media channels might be evaluated to determine if videos uploaded are consistently from a specific location, or whether locations differ significantly. Channels and accounts might be analysed to determine whether they use a logo and whether this logo is consistently used across videos. Channels and accounts might be additionally analysed for original content to determine whether the uploader aggregates videos from other news organisations or accounts, or whether the source appears to be the primary uploader.
2) Establish database of credible sources for verification
The Sudanese Archive established a database of credible sources for verification. These sources provide additional information used to verify content on social media platforms or received from sources directly. Content verifiers include citizen journalists, human rights defenders and humanitarian workers based in Sudan and abroad. To preserve data integrity, sources used for content acquisition do not comprise part of the database for verification.
3) Establish standardised metadata scheme
Before we can preserve or verify any content we must define a system through which content can be managed and organised, this is done through metadata. This system was developed by the Syrian Archive and is being used by Sudanese Archive through our partnership with Mnemonic. Establishing a data ontology or metadata scheme is necessary to assist us in organising and managing content as well as helping users in identifying and understanding what happened, and when and where.
Whilst recognising the need for a data ontology, or standardised metadata scheme. We also recognise that the implementation of any metadata scheme is a highly political choice. Given that there are no universally accepted, legally admissible metadata standards, efforts were made to develop a framework in consultation with a variety of international investigative bodies. These include consultations with members of the United Nations Office for High Commissioner of Human Rights, the International, Impartial and Independent Mechanism on international crimes committed in Syria (IIIM), and with other archival institutes, and human rights and research organisations.
Adding metadata happens after content is preserved but it is crucial to define a metadata scheme before collecting and processing content.
The Sudanese Archive’s collection is maintained through the Syrian Archive’s secure preservation workflow, which ensures that original content is not lost due to its removal from corporate platforms. This is achieved by collecting and securely storing digital content on external backend servers before it is taken offline and prior to basic verification procedures. Content is then backed up securely on servers throughout the world. Syrian Archive uses Sugarcube for this process, a free and open-source software developed for human rights investigations using online user-generated content.
Sugarcube is a tool designed to support journalists, non-profits, academic researchers, human rights organisations and others with investigations using online, publicly-available sources (e.g.tweets, videos, public databases, websites, online databases).
In this preservation pipeline we detect the spoken language, and standardise the data format (whilst preserving the old format). We screenshot and download the web page hosting the content. Files that are in our database get both their
sha256 hash and are time-stamped with Enigio Time - a third party collaborator. We hash and timestamp in order to ensure and prove data integrity which means that data has not been changed or manipulated since it has been archived.
Once content has been safely preserved metadata is extracted from visual content, it is parsed and aggregated automatically using Syrian Archive’s predefined and standardised metadata schema. Location and source details might be included in the parsed metadata which can be useful to geolocate where content originates.
Metadata is added both automatically and manually, depending on how it was collected, for example through open source or closed source methods. Metadata we collect includes a description of the visual object as given (e.g. YouTube title); the source of the visual content; the original link where the footage was first published; identifiable landmarks; weather (which may be useful for geolocation or time identification); specific languages or regional dialects spoken; identifiable clothes or uniforms; weapons or munitions; device used to record the footage; and media content type.
The processing pipeline also passes video files into keyframes, as well as using the machine learning software, V-FRAME.
VFRAME is a collection of open-source computer vision software tools designed for human rights investigations relying on large datasets of visual media. It utilises object detection algorithms that can automatically flag video content depicting predefined objects, such as cluster munitions.
Our data pipeline prepares visual content for initial verification. All possible additional tags and chain of custody information is recorded. This is done to assist users in identifying and understanding what happened in a specific incident, and when and where.
Verification consists of three steps: 1) Verify the source of the video uploader or publisher; 2) Verify the location where the video was filmed; 3) Verify the dates and times on which the video was filmed and uploaded.
- Verify the source of the video publisher
Firstly we establish whether the source of the video is on our list of credible sources. If not, we determine the new source’s credibility by going through the above procedure.
In some cases, near-duplicate content may be published. For example, if a 10-minute video includes all of a second 30-second video – both videos would be preserved as long as they can be verified. Similarly, videos from news organisations or media houses featuring parts of other videos are also preserved– as long as verification is possible. We also preserve duplications if they are from different sources and the original uploader is unidentifiable.
The video-upload source may differ from the camera operator. In most of the video footage which we verify, only the video uploader and not the camera operator can be identified. In advanced verification of priority cases, the analysis phase includes identifying the camera operator.
- Verify the location where the video was filmed
Each video goes through basic geolocation to verify that it has been captured in Sudan. A more accurate geolocation process is implemented for priority content in order to pinpoint its origin to a more accurate location. This is done by comparing visual references (e.g. buildings, mountain ranges, trees, minarets) with satellite imagery from Google Earth, and Maxar as well as geolocated photographs from Google Maps. Satellite imagery is also used to assess damage and destruction whilst investigating attacks targeting civilians and civilian infrastructure.
In addition to this, the Sudanese Archive compares the Arabic spoken in videos against known regional accents and dialects within Sudan to further verify the location of videos. When possible, we contact sources directly, and consult our existing network of journalists operating inside and outside Sudan to confirm the locations of specific incidents.
- Verify the dates and times in which the video was filmed and uploaded
We use time and date metadata embedded in videos we directly receive in order to corroborate the date and time of a specific incident. Date and time are extracted using the ExifTool[https://exiftool.org/].
We verify the capture date of videos by cross-referencing their publishing date on social media platforms (e.g YouTube, Twitter, Facebook, and Telegram) with dates from reports concerning the same incident. Visual content collected directly from sources is also cross-referenced with reports concerning the incident featured in the video.
Those reports include:
- Reports from international and local media outlets, including Reuters, Associated Press, Al Jazeera, BBC;
- Human rights reports published by international and local organisations, including Human Rights
- Watch, and Amnesty International;
- Daily monitoring records gathered by the Sudanese Archive’s monitoring team
- Incident reports shared by the Sudanese Archive’s network of citizen reporters on Twitter, Facebook, and Telegram.
Additional techniques such as reverse image search and chronolocation can also be used to confirm the capture time and date of the visual content.
In some cases, we conduct in-depth open-source investigations. Time and capacity limitations mean not all incidents can be analysed in-depth. However, by developing a replicable workflow it is hoped that others assist in these efforts, and investigate other incidents using similar methods.
For some incidents, our team of researchers collect witness statements or partner with organisations that do. This can include organisations whose role is collecting accounts of survivors, the injured, family members, or eyewitnesses (e.g. medical staff, managers of hospitals).
Once content has been processed, verified, and analysed, it is reviewed for accuracy. In the event of a discrepancy, content is fed back into the digital evidence workflow for further verification.
If content is deemed accurate it moves to the publishing stage of the digital evidence workflow.
For more information the tools or methods we are using, please reach out to firstname.lastname@example.org