Data Ingestion#
In Ragrank you can add data in multiple ways. 2 types of data is there.
DataNode
: Contain single data point.from ragrank.dataset import DataNode example_datanode = DataNode( question="What is the tallest mountain in the world?", context=[ "Mount Everest is the tallest mountain above sea level.", "It is located in the Himalayas.", ], response="The tallest mountain in the world is Mount Everest." )
Dataset
: Contain multiple data points.from ragrank.dataset import Dataset example_dataset = Dataset( question=[ "What is the tallest mountain in the world?", "Who wrote the Harry Potter series?", ], context=[ [ "Mount Everest is the tallest mountain above sea level.", "It is located in the Himalayas.", ], [ "J.K. Rowling wrote the Harry Potter series.", "The series became extremely popular worldwide.", ], ], response=[ "The tallest mountain in the world is Mount Everest.", "The Harry Potter series was written by J.K. Rowling.", ] )
Column Map#
In a datapoint, there should If your data columns have different names you can use Column Map
from ragrank.dataset import from_csv, ColumnMap
example_datanode = {
"question": "What is the largest mammal on Earth?",
"context": [
"The blue whale holds the title of the largest mammal.",
"It is a marine mammal found in oceans around the world.",
],
"response": "The largest mammal on Earth is the blue whale.",
}
data = from_csv(
example_datanode,
column_map=ColumnMap(
question="query", context="related_context", response="answer"
),
)
In all reader methods, you can use ColumnMap
to map columns.
Caution
Internally, the data object is saving the data in the question
, context
, and response
fields. After reading the data, the previous field names are not preserved. You canβt access the data with the previous field names either.
Data Readers#
There are multiple data readers availble in Ragrank.
from_dict: ingest data from a dict. Will convert
DataNode
andDataset
according to the type of data.from ragrank.dataset import from_dict data = from_dict( { "question": "What is the largest mammal on Earth?", "context": [ "The blue whale holds the title of the largest mammal.", "It is a marine mammal found in oceans around the world.", ], "response": "The largest mammal on Earth is the blue whale.", }, return_as_dataset=False, column_map=None # specify if any )
from_csv: ingesting data from csv file
from ragrank.dataset import from_csv data = from_csv( path="data.csv", column_map=None, # specify if any )
from_dataframe: Ingesting data from Pandas DataFrame.
from ragrank.dataset import from_dataframe from pandas import DataFrame dataframe = DataFrame( { "question": "What is the largest mammal on Earth?", "context": [ "The blue whale holds the title of the largest mammal.", "It is a marine mammal found in oceans around the world.", ], "response": "The largest mammal on Earth is the blue whale.", } ) data = from_dataframe( data=dataframe, column_map=None # specify if any )
from_hfdataset: Ingesting data from Huggingface datasets
from ragrank.dataset import from_hfdataset data = from_hfdataset( url="izammohammed/engineering_qa", split="train", column_map=None # specify if any )