Data Ingestion#
In Ragrank you can add data in multiple ways. 2 types of data is there.
DataNode: Contain single data point.from ragrank.dataset import DataNode example_datanode = DataNode( question="What is the tallest mountain in the world?", context=[ "Mount Everest is the tallest mountain above sea level.", "It is located in the Himalayas.", ], response="The tallest mountain in the world is Mount Everest." )
Dataset: Contain multiple data points.from ragrank.dataset import Dataset example_dataset = Dataset( question=[ "What is the tallest mountain in the world?", "Who wrote the Harry Potter series?", ], context=[ [ "Mount Everest is the tallest mountain above sea level.", "It is located in the Himalayas.", ], [ "J.K. Rowling wrote the Harry Potter series.", "The series became extremely popular worldwide.", ], ], response=[ "The tallest mountain in the world is Mount Everest.", "The Harry Potter series was written by J.K. Rowling.", ] )
Column Map#
In a datapoint, there should If your data columns have different names you can use Column Map
from ragrank.dataset import from_csv, ColumnMap
example_datanode = {
"question": "What is the largest mammal on Earth?",
"context": [
"The blue whale holds the title of the largest mammal.",
"It is a marine mammal found in oceans around the world.",
],
"response": "The largest mammal on Earth is the blue whale.",
}
data = from_csv(
example_datanode,
column_map=ColumnMap(
question="query", context="related_context", response="answer"
),
)
In all reader methods, you can use ColumnMap to map columns.
Caution
Internally, the data object is saving the data in the question, context, and response fields. After reading the data, the previous field names are not preserved. You canβt access the data with the previous field names either.
Data Readers#
There are multiple data readers availble in Ragrank.
from_dict: ingest data from a dict. Will convert
DataNodeandDatasetaccording to the type of data.from ragrank.dataset import from_dict data = from_dict( { "question": "What is the largest mammal on Earth?", "context": [ "The blue whale holds the title of the largest mammal.", "It is a marine mammal found in oceans around the world.", ], "response": "The largest mammal on Earth is the blue whale.", }, return_as_dataset=False, column_map=None # specify if any )
from_csv: ingesting data from csv file
from ragrank.dataset import from_csv data = from_csv( path="data.csv", column_map=None, # specify if any )
from_dataframe: Ingesting data from Pandas DataFrame.
from ragrank.dataset import from_dataframe from pandas import DataFrame dataframe = DataFrame( { "question": "What is the largest mammal on Earth?", "context": [ "The blue whale holds the title of the largest mammal.", "It is a marine mammal found in oceans around the world.", ], "response": "The largest mammal on Earth is the blue whale.", } ) data = from_dataframe( data=dataframe, column_map=None # specify if any )
from_hfdataset: Ingesting data from Huggingface datasets
from ragrank.dataset import from_hfdataset data = from_hfdataset( url="izammohammed/engineering_qa", split="train", column_map=None # specify if any )