4.2 [SFT] Instruction SFT
Datasets S - Download raw datasets T 该数据集保存在一个相对较小的 JSON 格式的文件中(仅 204 KB)。JSON(JavaScript 对象表示法)是一种既便于人类阅读又适合机器处理的数据交换结构,有点儿类似于 Python 字典 指令数据集包含 1100 个指令-回复对 A import json import os import urllib def download_and_load_file(file_path, url): if not os.path.exists(file_path): with urllib.request.urlopen(url) as response: text_data = response.read().decode("utf-8") with open(file_path, "w", encoding="utf-8") as file: file.write(text_data) with open(file_path, "r", encoding="utf-8") as file: data = json.load(file) return data file_path = "instruction-data.json" url = ( "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch" "/main/ch07/01_main-chapter-code/instruction-data.json" ) data = download_and_load_file(file_path, url) print("Number of entries:", len(data)) print("Example entry:\n", data[50]) # 从 JSON 文件中加载的 data 列表包含了1100 个指令数据集样本 >>> type(data) <class 'list'> >>> len(data) 1100 >>> >>> from rich.pretty import pprint >>> pprint(data[0]) { │ 'instruction': 'Evaluate the following phrase by transforming it into the spelling given.', │ 'input': 'freind --> friend', │ 'output': 'The spelling of the given phrase "freind" is incorrect, the correct spelling is "friend".' } ***S - Data preprocessing before any encapsulation split by portion TA ...