
The project was inspired by the limitations faced by the Human Rights Data Science Lab at the James Madison College of Michigan State University, which currently focuses primarily on evidence available in English, and uses translators within the lab and translation tools to translate non-English evidence. This creates a distinct bias toward specific regions or reporters who can communicate in English, and makes it difficult to find evidence that is not in English. While translating found text is straightforward, the team identified that identifying key terms in other languages to find that evidence in the first place is a significant barrier.
CrossLing is a multilingual open-source intelligence (OSINT) pipeline designed to democratize access to human rights evidence.
Query Expansion: It aims to take English search terms and translate them into multiple languages (such as Arabic, Russian, or Spanish) to find localized reports.
Automated Gathering: It aims to scrape news articles and PDFs from the web based on those translated queries.
Synthesis: The tool aims to detect the language of the findings, translate them back into English, and provide a summary alongside original citations for verification.
The team designed a modular pipeline using a specific technical stack:
Frontend: Built with Streamlit for the MVP.
Search & Logic: Uses the Brave Search API to source evidence.
Translation & Detection: Employs DeepL or translation and fastText or langdetect for language identification.
The project faces several scope and technical constraints:
API Limitations: The team is staying within free tier limits or limited spending, restricting searches to approximately 5 sources per run.
Technical Complexity: Coordinating four distinct stages (Search, Scraping, Translation, and Frontend) to "talk to each other cleanly" requires significant integration effort.
We recognized the importance of human-in-the-loop workflows. By visiting an actual lab work session, we are learning how the tool must fit into the actual research process of human rights investigators rather than operating in a vacuum. We also learned to prioritize core functionality—providing citations and web addresses—before moving to advanced features like live translation.
Language Expansion: Finalizing the MVP with at least one Latin script and one non-Latin script language (from a target list of Arabic, Russian, Ukrainian, Spanish, and French).
Refined Packaging: Potentially moving beyond Streamlit to a React/Fast API web app, CLI, or web extension.
Other sources: Integrating computer vision and rapidly moving into social media sourcing.