Research Projects

Projects in which I am the Principal Investigator

Geologi-QA

2023 - Present

Quick access to information is crucial for the oil, natural gas, energy, and biofuels sectors. A significant portion of professionals' time is spent searching for relevant information. The importance of this project for the sector lies in facilitating access to information in two ways. The first is the enhancement of GeoLógica, an information retrieval system that searches for documents and images based on a query composed of keywords. The second is the development of an interactive question-answering system that receives a question in natural language and provides an answer derived from structured and unstructured data. The project aims to identify, evaluate, recommend, and implement state-of-the-art solutions in these areas. The techniques developed will be assessed in terms of their quality as well as from a usability perspective, through user experience studies.

Computing Applied to Health Safety

2022 - Present

The search for improving health safety contributes to the prevention of harm to people and reduces costs for health systems. In this context, the focus of this project is on the application of data mining, natural language processing, and information retrieval techniques on three distinct fronts but with a common thread - health safety. The first work front deals with safety in the context of compiling evidence on the effectiveness of health treatments. The second work front deals with safety in the context of information retrieved by search engines. The third work front deals with medications, as they represent the most common clinical intervention in the health area. According to the World Health Organization (WHO), unsafe medication practices and medication-related errors are the main causes of preventable harm to health systems worldwide. The project's contribution is in the proposal of a machine learning-based system to automatically detect prescriptions with potential errors. Among the scientific contributions is the proposal of new algorithms and methodologies to solve the aforementioned problems.

Funded by CNPq

MD2 – Data Mining Applied to Medical Data

2018 - Present

The growing availability of data in the medical field coupled with the maturation of data mining techniques go on fostering collaboration between Medicine and Computing. This collaboration is motivated by the potential positive impact that the discovery of knowledge about medical data can generate for the improvement of people's health. However, there are some challenges that need to be addressed so that the application of mining algorithms can yield useful results that have practical value for the medical area. Medical data is typically heterogeneous, with both structured and unstructured data (text, images and vital signs). Furthermore, the volume and the need for confidentiality may represent practical challenges. The focus of this project is on the application of data mining and natural language processing to analyze electronic health records.

Funded by CNPq

Geodigital: Integrated Search for Heterogeneous Geoscientific Data

2019 - 2021

E&P data management specialists agree that the most valuable information for an organization is represented in the form of unstructured data. Textual documents, emails, images, and diagrams are typical examples of such information. Estimates say that employees spend almost ten hours a week searching and gathering information. Aiming at reducing the effort in searching for relevant information, the overall goal of our project is to provide a solution for multimodal information retrieval (MIR). MIR is the process of organizing and enabling search for different types of data, i.e. modalities, such as text, image, audio, video, or 3D models. Here, our focus is on unstructured texts and figures.

Funded by CENPES/Petrobras

MOML - Multilingual Opinion Mining

2016 - Present

Opinion Mining is the computational study of opinions, sentiments and emotions expressed in texts. Its groal is, from a collection of documents, to identify, classify and aggregate the sentiment about a target automatically. This area has been gaining attention with the increase in the number of users who publish their opinions and experiences on the Web. The focus of this project is on Multilingual Mining Opinion (MOML), which aims to address opinions expressed in several languages, allowing to analyze a greater number of reviews, as well as obtaining a greater diversity and density of people's views. Our motivation comes from observing three limitation of MOML. The first limitation refers to the fact that the habit of writing comments evaluating products and / or services is much more frequent in some regions of the world than in others. In order to remedy this deficiency, we propose to develop methods that use customer reviews written in foreign languages to provide product summaries on Brazilian websites. The second limitation observed concerns the scarcity of resources for MOML involving languages other than English, especially considering the classification of emotions expressed in the text (and not just the polarity). In this context, the objective is to develop methods for the classification of emotions in multilingual texts. The third limitation refers to the treatment of contradictions found in the evaluation texts in order to identify sentences that express contradictory feelings and how this impacts the Opinion Mining methods.

Funded by CNPq

Multi-Match – Multilingual Matching

2012 - 2015

Multilingual information is available in many different sources and formats . This fact has motivated research aimed at finding mappings between data represented in different languages in the areas of Information Retrieval, Natural Language Processing, and more recently, Databases . This project focuses on researching and proposing new methods for matching of multilingual data in different scenarios. Our goals are:(i) collecting parallel corpora on the web , (ii) finding correspondences in multilingual Wikipedia, and (iii) detecting multilingual plagiarism. The results of this project will represent contributions to the fields of Information Retrieval and Natural Language Processing by providing parallel corpora that are very important resources for the advancement of these areas. The wide availability of multilingual data also facilitates plagiarism. Multilingual plagiarism detection will also be targeted in this project. In this topic, our main contribution will be the inclusion of analysis of citations and references that is essential for the confirmation of plagiarism.
Funded by CNPq (Edital Universal)

DP-ML Cross-Language Plagiarism Detection

2010 - 2012

With the dissemination of the Web, millions of people gain access to information from several areas of knowledge. The number of digital information available has grown immensely. However, despite its many benefits, the Web is one of the easiest means to enable plagiarism. Plagiarism is one of the most serious forms of academic misconduct. It is defined as “the use of another person's written work without acknowledging the source”. Recent research has shown that this type of misconduct is increasingly frequent in the academic world. This fact has been motivating techniques to automate plagiarism detection. This project focuses on cross-language plagiarism detection, in which the contents of a document are translated without making any reference to its source. Our aim is to develop an efficient method to enable the detection of cross-language plagiarism.
Funded by CNPq (Edital Universal)

Cross-Language Information Retrieval

2008 - 2010

The aim of this project is to contribute to the development of Cross-Language Information Retrieval (CLIR) involving Portuguese. The motivation is the growing need to explore documents in foreign languages, experienced by more and more people. CLIR research has been developing quickly since the late 90's. Despite recent advances there are still many aspects left unexplored, specially regarding the use of Portuguese. In this project we hope to develop a system that accepts queries in Portuguese and searches for documents in English. In addition, the following aspects will be investigated: (i) development of stemming algorithms for Portuguese; (ii) proposal of new techniques for mapping concepts between languages through the analysis of parallel and comparable corpora; (iii) study of the process of relevance feedback in a CLIR environment; and (iv) development of techniques for the identification of multi-work expressions.
Funded by CNPq (Edital Universal)

Integrating Information Retrieval Techniques into Database Systems

2005 - 2009

In the classical view, the areas of Information Retrieval (IR) and Databases (DB) have little in common. DB normally deal with structured data while IR deals with unstructured documents, typically in the form of free text. Considering that data stored by most organisations is both structured and unstructured, and that users frequently have to query data in both formats, there is a growing need to integrate the two areas. The aim of this project is to apply IR concepts into DB systems to facilitate the process of solving imprecise queries. The challenge it to modify the processing of queries to include the notions of similarity and relevance.
Funded by CAPES-PRODOC

Projects in which I took part as Faculty Participant

CIDIA-19

2020 - Present

In the context of COVID-19, the speed with which the disease has spread throughout the world demands agile solutions to accelerate the diagnosis of patients with suspicious symptoms and estimate the evolution of the disease. In view of the limitations associated with the application of laboratory tests and the sharp increase in demand for health services in a pandemic scenario, it is essential to develop strategies to leverage the analysis of clinical data, epidemiological history, and imaging tests, in order to more quickly diagnose COVID-19 among suspected cases, especially in the early stages of the disease. The goal of this project is to apply state-of-the-art machine learning techniques to aid in the diagnosis and improvement of understanding about COVID-19. The project's contributions are divided into five goals, namely: (i) automatic classification of chest tomographies; (ii) development of predictive models and identification of new risk factors from the mining of clinical data; (iii) integration of clinical data and images using multimodal approaches to improve diagnosis; (iv) semantic search for scientific articles related to COVID-19; and (v) monitoring and predicting the evolution of COVID-19 through information visualization.

Funded by FAPERGS

Principal Investigator: João Comba

Formation and Analysis of Groups in Big Data Using Visualization Techniques

2018 - Present

Data acquisition has never been so vast, diverse, and accessible. Today, almost all computer-based devices offer a wide range of options for collecting multi-dimensional data. Social networks, government data, application logs, and many others generate immense amounts of raw data. The term big data is used to refer to this collection of data, which is usually heterogeneous and large. The exploration and analysis through the aggregation of these data sets are essential, often using histograms and heat maps, among other techniques for visualizing information. Traditional tools, such as relational databases and business intelligence software, have trouble supporting these visualizations in low-latency interactive scenarios. Current specialized solutions can take prohibitively large amounts of memory as the number of dimensions increases. Thus, research on specialized data structures that reduces query latency in these scenarios remains necessary. In this project, we will develop several research lines related to aspects of analysis and efficient big data processing.

Funded by CAPES/COFECUB

Principal Investigator: João Comba

Cameleon

2010 - 2015

The goal of this project is to investigate, propose, experiment, apply and validate automatic and collaborative techniques for the development of lexical and ontological resources that can be useful in the context of multilingual applications, particularly for French, Portuguese and English.
Funded by CAPES
Coordinated by Aline Villavicencio

INCT Web

2009 - 2014

INWeb was created to study the various phenomena related to the Web. It is an institute composed by a network of researchers from four Brazilian Universities. The mission of this institute is to develop models, algorithms and technologies to contribute to the integration of the Web with the society. As a result, we expect more effective and secure distribution of information, more efficient and useful applications, so that the Web can become a vector for social and economic changes in our country. The institute activities include research, education of human resources and knowledge transfer to the society and companies. Our research proposal plans to improve the state-of-the-art for the three layers of networks. Specifically, we aim to develop solutions for three great challenges defined on the unified view of the Web: (i) Identification, characterization and modeling of interests and patterns of people behavior on the Web and the established networks among them; (ii) Treatment of information that circulates through the Web layers, considering the activities of crawling, extracting and processing information; (iii) Delivery of information in a satisfying way regardless of time and place.
Coordinated by Virgílio Almeida (UFMG).
Website: http://www.inweb.org.br/
Funded by: CNPq, MCT, and Fapemig

GPU Cluster

2008 - 2012

The aim of this project is to build a computer cluster based on Graphics Processing Units (GPUs) at the Institute of Informatics of UFRGS. The cluster consist in 6 computers with quad-core processors (4 CPUs), each connected via PCI-X to a external unit containing 4 GPUs. Thus, the cluster will have 24 CPUs and 24 GPUs conected by high speed Infiniband switches. Given that each CPU is composed internally by 4 processors, there will be effectively 3072 internal processors with a computational power of approximately 12 TFLOPS. The computational resource provided by this cluster will allow the processing of computationally complex tasks and will be vital for the research to be developed in the uniniversity in the next few years.

Funded by CNPq (Edital Jovens Pesquisadores)

Principal Investigator: João Comba

ApproxMatch

2007 - 2009

Approximate Data Matching aims at deciding whether two data instances represent the same real world entity. This technique is employed in many data management applications, such as record deduplication, similarity querying, similarity joining and schema integration. This project aims at addressing three open problems in Approximate Data Matching: (i) defining adequate similarity functions for complex objects such as XML trees; (ii) developing quantitative measures to compare the quality of similarity functions; (iii) study how query decomposition methods should behave in environments where schema matching happens at query time.

Funded by CNPq (Edital Universal)
Principal Investigator Carlos A. Heuser

Managing Large Volumes of Textual Data

2008 - 2010

This project is within the scope of the challenges set by the Brazilian Computer Society, namely the management of large volumes of multimedia distributed data. In the context of this challenge, this project deals specifically with the management of textual data, such as web pages or electronic documents, created by public or private organisations. One of the central problems is to establish relations and associations between documents. In this project, two types of relationships are considered (i) versioning of documents, aiming at determining groups of documents that can be considered as different versions of the same information; and (ii) content similarity, aiming at clustering documents that deal with the same subject.

Funded by CNPq (Edital Grandes Desafios)
Principal Investigator J. Palazzo M. de Oliveira

Cell Assemblies

2003 - 2003

Reverberating circuits of neurons can explain many psychological phenomena; as the neural representation of concepts, they may be the basis of thought. While evidence exists for neural Cell Assemblies (CAs), there has been very little work on the computational modelling of CAs. The goal of this project is to explore models of CAs, metrics and uses of CAs. My contribution was to adapt the CAs model to perform Information Retrieval.
Funded by EPSRC -England
Coordinated by Christian Huyck