Project Description
Our Team, NetMine, has two members; Dorna Heidari, and Majid Zarrinkolah (Ario). We participated at Deakin University. We entered the challenge "Making public archives more accessible" and developed a project named DataTrans.
Our goal was to make public archives more accessible. The technical aspects of DataTrans includes how we extracted information from various documents using OCR and transformed it into vectors. We then utilised Large Language Models (LLM) to ask questions and gain insights from the data.
"We collectively contributed to coding this project segment, which focuses on extracting information from a multitude of documents, encompassing tasks like tagging, title retrieval, and more. For this purpose, we've implemented an Optical Character Recognition (OCR) function. This function adeptly extracts data from various document types, including PDFs that contain both images and text. Subsequently, the extracted data is organised within documents, which are then divided into chunks. These chunks are further transformed into vectors, serving as retrievable data points. These vectors are crucial for interfacing with large language models.
When selecting suitable large language models, we had a variety of options at our disposal. It's important to note that these models require substantial memory resources. For our implementation, we collectively decided on utilising Llama 2, which boasts billions of parameters. In scenarios where exceptional output quality is desired, we can opt for a more extensive model, such as the one equipped with 70 billion parameters. With the vectorised data safely stored in our database, we feed it into our language model (LLM), Llama 2. This enables us to pose inquiries to the AI model. These queries can encompass topics like titles, brief overviews, summaries, terms, and keywords. The results offer insights akin to human comprehension.
Furthermore, the amassed information can be archived within a database. This strategic approach empowers efficient searching and data retrieval from the database, streamlining access to the associated documents.