docs.sheetjs.com/06-loader.md at ffbe7af307e67737ad3887e578c1f32cb132d674 (2024)

titlepagination_prevpagination_nextsidebar_position
Loader Tutorialgetting-started/installation/indexgetting-started/roadmap6

import current from '/version.js';import Tabs from '@theme/Tabs';import TabItem from '@theme/TabItem';import CodeBlock from '@theme/CodeBlock';

Many existing systems and platforms include support for loading data from CSVfiles. Many users prefer to work in spreadsheet software and multi-sheet fileformats including XLSX. SheetJS libraries help bridge the gap by translatingcomplex workbooks to simple CSV data.

The goal of this example is to load spreadsheet data into a vector store and usea large language model to generate queries based on English language input. Theexisting tooling supports CSV but does not support real spreadsheets.

In "SheetJS Conversion", we will use SheetJS libraries togenerate CSV files for the LangChain CSV loader. These conversions can be run ina preprocessing step without disrupting existing CSV workflows.

In "SheetJS Loader", we will use SheetJS libraries in acustom LoadOfSheet data loader to directly generate documents and metadata.

"SheetJS Loader Demo" is a complete demo that uses theSheetJS Loader to answer questions based on data from a XLS workbook.

:::note Tested Deployments

This demo was tested in the following configurations:

DatePlatform
2024-07-15Apple M2 Max 12-Core CPU + 30-Core GPU (32 GB unified memory)
2024-07-14NVIDIA RTX 4090 (24 GB VRAM) + i9-10910 (128 GB RAM)
2024-07-14NVIDIA RTX 4080 SUPER (16 GB VRAM) + i9-10910 (128 GB RAM)

SheetJS users have verified this demo in other configurations:

Other tested configurations (click to show)
DemoPlatform
LangChainJSNVIDIA RTX 4070 Ti (12 GB VRAM) + Ryzen 7 5800x (64 GB RAM)
LangChainJSNVIDIA RTX 4060 (8 GB VRAM) + Ryzen 7 5700g (32 GB RAM)
LangChainJSNVIDIA RTX 3090 (24 GB VRAM) + Ryzen 9 3900XT (128 GB RAM)
LangChainJSNVIDIA RTX 3080 (12 GB VRAM) + Ryzen 7 5800X (32 GB RAM)
LangChainJSNVIDIA RTX 3060 (12 GB VRAM) + i5-11400 (32 GB RAM)
LangChainJSNVIDIA RTX 2080 (12 GB VRAM) + i7-9700K (16 GB RAM)
LangChainJSNVIDIA RTX 2060 (6 GB VRAM) + Ryzen 5 3600 (32 GB RAM)
LangChainJSNVIDIA GTX 1080 (8 GB VRAM) + Ryzen 7 5800x (64 GB RAM)
LangChainJSNVIDIA GTX 1070 (8 GB VRAM) + Ryzen 7 7700x (32 GB RAM)

Special thanks to:

:::

CSV Loader

:::note pass

This explanation was verified against LangChain 0.2.

:::

Document loaders generate data objects ("documents") and associated metadatafrom data sources.

LangChain offers a CSVLoader1 component for loading CSV data from a file:

import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";const loader = new CSVLoader("pres.csv");const docs = await loader.load();console.log(docs);

The CSV loader uses the first row to determine column headers and generates onedocument per data row. For example, the following CSV holds Presidential data:

Name,IndexBill Clinton,42GeorgeW Bush,43Barack Obama,44Donald Trump,45Joseph Biden,46

Each data row is translated to a document whose content is a list of attributesand values. For example, the third data row is shown below:

CSV RowDocument Content
Name,IndexBarack Obama,44
Name: Barack ObamaIndex: 44

The LangChain CSV loader will include source metadata in the document:

Document { pageContent: 'Name: Barack Obama\nIndex: 44', metadata: { source: 'pres.csv', line: 3 }}

SheetJS Conversion

The SheetJS NodeJS module can beimported in NodeJS scripts that use LangChain and other JavaScript libraries.

A simple pre-processing step can convert workbooks to CSV files that can beprocessed by the existing CSV tooling:

flowchart LR file[(Workbook\nXLSX/XLS)] subgraph SheetJS Structures wb(((SheetJS\nWorkbook))) ws((SheetJS\nWorksheet)) end csv(CSV\nstring) docs[[Documents\nArray]] file --> |readFile\n\n| wb wb --> |wb.Sheets\nselect sheet| ws ws --> |sheet_to_csv\n\n| csv csv --> |CSVLoader\n\n| docs linkStyle 0,1,2 color:blue,stroke:blue;

The SheetJS readFile method2 can read general workbooks. The method returnsa workbook object that conforms to the SheetJS data model3.

Workbook objects represent multi-sheet workbook files. They store individualworksheet objects and other metadata.

Each worksheet in the workbook can be written to CSV text using the SheetJSsheet_to_csv4 method.

For example, the following NodeJS script reads pres.xlsx and displays CSV rowsfrom the first worksheet:

/* Load SheetJS Libraries */import { readFile, set_fs, utils } from 'xlsx';/* Load 'fs' for readFile support */import * as fs from 'fs';set_fs(fs);/* Parse `pres.xlsx` */const wb = readFile("pres.xlsx");/* Print CSV rows from first worksheet */const first_ws = wb.Sheets[wb.SheetNames[0]];const csv = utils.sheet_to_csv(first_ws);console.log(csv);

:::note pass

A number of demos cover spiritually similar workflows:

  • Stata, MATLABand Maple support XLSX data import. The SheetJSintegrations generate clean XLSX workbooks from user-supplied spreadsheets.

  • TensorFlow.js, Pandasand Mathematica support CSV. The SheetJSintegrations generate clean CSVs and use built-in CSV processors.

  • The "Command-Line Tools" demo covers techniques for makingstandalone command-line tools for file conversion.

:::

Single Worksheet

For a single worksheet, a SheetJS pre-processing step can write the CSV rows tofile and the CSVLoader can load the newly written file.

Code example (click to hide)
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";import { readFile, set_fs, utils } from 'xlsx';/* Load 'fs' for readFile support */import * as fs from 'fs';set_fs(fs);/* Parse `pres.xlsx`` */const wb = readFile("pres.xlsx");/* Generate CSV and write to `pres.xlsx.csv` */const first_ws = wb.Sheets[wb.SheetNames[0]];const csv = utils.sheet_to_csv(first_ws);fs.writeFileSync("pres.xlsx.csv", csv);/* Create documents with CSVLoader */const loader = new CSVLoader("pres.xlsx.csv");const docs = await loader.load();console.log(docs);// ...

Workbook

A workbook is a collection of worksheets. Each worksheet can be exported to aseparate CSV. If the CSVs are written to a subfolder, a DirectoryLoader5can process the files in one step.

Code example (click to hide)

In this example, the script creates a subfolder named csv. Each worksheet inthe workbook will be processed and the generated CSV will be stored to numberedfiles. The first worksheet will be stored to csv/0.csv.

import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";import { DirectoryLoader } from "langchain/document_loaders/fs/directory";import { readFile, set_fs, utils } from 'xlsx';/* Load 'fs' for readFile support */import * as fs from 'fs';set_fs(fs);/* Parse `pres.xlsx`` */const wb = readFile("pres.xlsx");/* Create a folder `csv` */try { fs.mkdirSync("csv"); } catch(e) {}/* Generate CSV data for each worksheet */wb.SheetNames.forEach((name, idx) => { const ws = wb.Sheets[name]; const csv = utils.sheet_to_csv(ws); fs.writeFileSync(`csv/${idx}.csv`, csv);});/* Create documents with DirectoryLoader */const loader = new DirectoryLoader("csv", { ".csv": (path) => new CSVLoader(path)});const docs = await loader.load();console.log(docs);// ...

SheetJS Loader

The CSVLoader that ships with LangChain does not add any Document metadata anddoes not generate any attributes. A custom loader can work around limitations inthe CSV tooling and potentially include metadata that has no CSV equivalent.

flowchart LR file[(Workbook\nXLSX/XLS)] subgraph SheetJS Structures wb(((SheetJS\nWorkbook))) ws((SheetJS\nWorksheet)) end aoo[(Array of\nObjects)] docs[[Documents\nArray]] file --> |readFile\n\n| wb wb --> |wb.Sheets\nEach worksheet| ws ws --> |sheet_to_json\n\n| aoo aoo --> |new Document\nEach Row| docs linkStyle 0,1,2 color:blue,stroke:blue;

The demo LoadOfSheet loader willgenerate one Document per data row across all worksheets. It will also attemptto build metadata and attributes for use in self-querying retrievers.

/* read and parse `data.xlsb` */const loader = new LoadOfSheet("./data.xlsb");/* generate documents */const docs = await loader.load();/* synthesized attributes for the SelfQueryRetriever */const attributes = loader.attributes;
Sample SheetJS Loader (click to show)

This example loader pulls data from each worksheet. It assumes each worksheetincludes one header row and a number of data rows.

import { Document } from "@langchain/core/documents";import { BufferLoader } from "langchain/document_loaders/fs/buffer";import { read, utils } from "xlsx";/** * Document loader that uses SheetJS to load documents. * * Each worksheet is parsed into an array of row objects using the SheetJS * `sheet_to_json` method and projected to a `Document`. Metadata includes * original sheet name, row data, and row index */export default class LoadOfSheet extends BufferLoader { /** @type {import("langchain/chains/query_constructor").AttributeInfo[]} */ attributes = []; /** * Document loader that uses SheetJS to load documents. * * @param {string|Blob} filePathOrBlob Source Data */ constructor(filePathOrBlob) { super(filePathOrBlob); this.attributes = []; } /** * Parse document * * NOTE: column labels in multiple sheets are not disambiguated! * * @param {Buffer} raw Raw data Buffer * @param {Document["metadata"]} metadata Document metadata * @returns {Promise<Document[]>} Array of Documents */ async parse(raw, metadata) { /** @type {Document[]} */ const result = []; this.attributes = [ { name: "worksheet", description: "Sheet or Worksheet Name", type: "string" }, { name: "rowNum", description: "Row index", type: "number" } ]; const wb = read(raw, {type: "buffer", WTF:1}); for(let name of wb.SheetNames) { const fields = {}; const ws = wb.Sheets[name]; if(!ws) return; const aoo = utils.sheet_to_json(ws); aoo.forEach((row, idx) => { result.push({ pageContent: "Row " + (idx + 1) + " has the following content: \n" + Object.entries(row).map(kv => `- ${kv[0]}: ${kv[1]}`).join("\n") + "\n", metadata: { worksheet: name, rowNum: row["__rowNum__"], ...metadata, ...row } }); Object.entries(row).forEach(([k,v]) => { if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true } ); }); Object.entries(fields).forEach(([k,v]) => this.attributes.push({ name: k, description: k, type: Object.keys(v).join(" or ") })); } return result; }};

From Text to Binary

Many libraries and platforms offer generic "text" loaders that process filesassuming the UTF8 encoding. This corrupts many spreadsheet formats includingXLSX, XLSB, XLSM and XLS.

:::note pass

This issue affects many JavaScript tools. Various demos cover workarounds:

  • ViteJS plugins receive the relative pathto the workbook file and can read the file directly.

  • Webpack Plugins have a specialoption to instruct the library to pass raw binary data rather than text.

:::

The CSVLoader extends a special TextLoader that forces UTF8 text parsing.

There is a separate BufferLoader class, used by the PDF loader, that passesthe raw data using NodeJS Buffer objects.

BinaryText
export class PDFLoader extends BufferLoader { // ... public async parse( raw: Buffer, metadata: Document["metadata"] ): Promise<Document[]> { // ... } // ...}
export class CSVLoader extends TextLoader { // ... protected async parse( raw: string ): Promise<string[]> { // ... } // ...}

NodeJS Buffers

The SheetJS read method supports NodeJS Buffer objects directly6:

import { BufferLoader } from "langchain/document_loaders/fs/buffer";import { read, utils } from "xlsx";export default class LoadOfSheet extends BufferLoader { // ... async parse(raw, metadata) { // highlight-next-line const wb = read(raw, {type: "buffer"}); // At this point, `wb` is a SheetJS workbook object // ... }}

The read method returns a SheetJS workbook object7.

Generating Content

The SheetJS sheet_to_json method8 returns an array of data objects whosekeys are drawn from the first row of the worksheet.

SpreadsheetArray of Objects

[ { Name: "Bill Clinton", Index: 42 }, { Name: "GeorgeW Bush", Index: 43 }, { Name: "Barack Obama", Index: 44 }, { Name: "Donald Trump", Index: 45 }, { Name: "Joseph Biden", Index: 46 }]

The original CSVLoader wrote one row for each key-value pair. This text can begenerated by looping over the keys and values of the data row object. TheObject.entries helper function simplifies the conversion:

function make_csvloader_doc_from_row_object(row) { return Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n");}

Generating Documents

The loader must generate row objects for each worksheet in the workbook.

In the SheetJS data model, the workbook object has two relevant fields:

  • SheetNames is an array of sheet names
  • Sheets is an object whose keys are sheet names and values are sheet objects.

A for..of loop can iterate across the worksheets:

 const wb = read(raw, {type: "buffer", WTF:1}); for(let name of wb.SheetNames) { const ws = wb.Sheets[name]; const aoa = utils.sheet_to_json(ws); // at this point, `aoa` is an array of objects }

This simplified parse function uses the snippet from the previous section:

 async parse(raw, metadata) { /* array to hold generated documents */ const result = []; /* read workbook */ const wb = read(raw, {type: "buffer", WTF:1}); /* loop over worksheets */ for(let name of wb.SheetNames) { const ws = wb.Sheets[name]; const aoa = utils.sheet_to_json(ws); /* loop over data rows */ aoa.forEach((row, idx) => { /* generate a new document and add to the result array */ result.push({ pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n") }); }); } return result; }

Metadata and Attributes

It is strongly recommended to generate additional metadata and attributes forself-query retrieval applications.

Implementation Details (click to show)

Metadata

Metadata is attached to each document object. The following example appends theraw row data to the document metadata:

 /* generate a new document and add to the result array */ result.push({ pageContent: Object.entries(row).map(([k,v]) => `${k}: ${v}`).join("\n"), metadata: { worksheet: name, // name of the worksheet rowNum: idx, // data row index ...row // raw row data } });

Attributes

Each attribute object specifies three properties:

  • name corresponds to the field in the document metadata
  • description is a description of the field
  • type is a description of the data type.

While looping through data rows, a simple type check can keep track of the datatype for each column:

 for(let name of wb.SheetNames) { /* track column types */ const fields = {}; // ... aoo.forEach((row, idx) => { result.push({/* ... */}); /* Check each property */ Object.entries(row).forEach(([k,v]) => { /* Update fields entry to reflect the new data point */ if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true }); }); // ... }

Attributes can be generated after writing the worksheet data. Storing attributesin a loader property will make it accessible to scripts that use the loader.

export default class LoadOfSheet extends BufferLoader { // highlight-next-line attributes = []; // ... async parse(raw, metadata) { // Add the worksheet name and row index attributes // highlight-start this.attributes = [ { name: "worksheet", description: "Sheet or Worksheet Name", type: "string" }, { name: "rowNum", description: "Row index", type: "number" } ]; // highlight-end const wb = read(raw, {type: "buffer", WTF:1}); for(let name of wb.SheetNames) { // highlight-next-line const fields = {}; // ... const aoo = utils.sheet_to_json(ws); aoo.forEach((row, idx) => { result.push({/* ... */}); /* Check each property */ Object.entries(row).forEach(([k,v]) => { /* Update fields entry to reflect the new data point */ if(v != null) (fields[k] || (fields[k] = {}))[v instanceof Date ? "date" : typeof v] = true }); }); /* Add one attribute per metadata field */ // highlight-start Object.entries(fields).forEach(([k,v]) => this.attributes.push({ name: k, description: k, /* { number: true, string: true } -> "number or string" */ type: Object.keys(v).join(" or ") })); // highlight-end } // ... }

SheetJS Loader Demo

The demo performs the query "Which rows have over 40 miles per gallon?" againsta sample cars dataset and displays the results.

:::note pass

SheetJS team members have tested this demo on Windows 10 and Windows 11 usingPowerShell and Ollama for Windows.

SheetJS users have also tested this demo within Windows Subsystem for Linux.

:::

:::caution pass

This demo was tested using the ChatQA-1.5 model9 in Ollama10.

The tested model used up to 9.2GB VRAM. It is strongly recommended to run thedemo on a newer Apple Silicon Mac or a PC with an Nvidia GPU with at least 12GBVRAM. SheetJS users have tested the demo on systems with as little as 6GB VRAM.

:::

  1. Create a new project:
mkdir sheetjs-loadercd sheetjs-loadernpm init -y
  1. Download the demo scripts:
  • loadofsheet.mjs
  • query.mjs
curl -LO https://docs.sheetjs.com/loadofsheet/query.mjscurl -LO https://docs.sheetjs.com/loadofsheet/loadofsheet.mjs

:::note pass

In PowerShell, the command may fail with a parameter error:

Invoke-WebRequest : A parameter cannot be found that matches parameter name 'LO'.

curl.exe must be invoked directly:

curl.exe -LO https://docs.sheetjs.com/loadofsheet/query.mjscurl.exe -LO https://docs.sheetjs.com/loadofsheet/loadofsheet.mjs

:::

  1. Install the SheetJS NodeJS module:

{\ npm i --save https://sheet.lol/balls/xlsx-${current}.tgz}

  1. Install dependencies:
npm i --save @langchain/community@0.2.18 @langchain/core@0.2.15 langchain@0.2.9 peggy@4.0.3 --force

:::note pass

When this demo was last tested, there were error messages relating to dependencyand peer dependency versions. The --force flag was required.

:::

  1. Download the cars dataset:
curl -LO https://docs.sheetjs.com/cd.xls

:::note pass

In PowerShell, the command may fail with a parameter error:

Invoke-WebRequest : A parameter cannot be found that matches parameter name 'LO'.

curl.exe must be invoked directly:

curl.exe -LO https://docs.sheetjs.com/cd.xls

:::

  1. Install the llama3-chatqa:8b-v1.5-q8_0 model using Ollama:
ollama pull llama3-chatqa:8b-v1.5-q8_0

:::note pass

If the command cannot be found, install Ollama10 and run the command in a newterminal window.

:::

  1. Run the demo script
node query.mjs

The demo performs the query "Which rows have over 40 miles per gallon?". It willprint the following nine results:

{ Name: 'volkswagen rabbit custom diesel', MPG: 43.1 }{ Name: 'vw rabbit c (diesel)', MPG: 44.3 }{ Name: 'renault lecar deluxe', MPG: 40.9 }{ Name: 'honda civic 1500 gl', MPG: 44.6 }{ Name: 'datsun 210', MPG: 40.8 }{ Name: 'vw pickup', MPG: 44 }{ Name: 'mazda glc', MPG: 46.6 }{ Name: 'vw dasher (diesel)', MPG: 43.4 }{ Name: 'vw rabbit', MPG: 41.5 }

:::caution pass

Some SheetJS users with older GPUs have reported errors.

If the command fails, please try running the script a second time.

:::

To find the expected results:

  • Open the cd.xls spreadsheet in Excel
  • Select Home > Sort & Filter > Filter in the Ribbon
  • Select the filter option for column B (Miles_per_Gallon)
  • In the popup, select "Greater Than" in the Filter dropdown and type 40

The filtered results should match the following screenshot:

:::tip pass

The SheetJS model exposes formulaeand other features.

SheetJS Pro builds expose cell styling, images,charts, tables, and other features.

:::

  1. See "How to load CSV data" in the LangChain documentation ↩︎

  2. See readFile in "Reading Files" ↩︎

  3. See "SheetJS Data Model" ↩︎

  4. See sheet_to_csv in "CSV and Text" ↩︎

  5. See "Folders with multiple files" in the LangChain documentation ↩︎

  6. See "Supported Output Formats" type in "Writing Files" ↩︎

  7. See "Workbook Object" ↩︎

  8. See sheet_to_json in "Utilities" ↩︎

  9. See the official ChatQA website for the ChatQA paper and other model details. ↩︎

  10. See the official Ollama website for installation instructions. ↩︎

docs.sheetjs.com/06-loader.md at ffbe7af307e67737ad3887e578c1f32cb132d674 (2024)
Top Articles
Used Women's Golf Clubs'' - Craigslist
Jack and Jill (2011) | Rotten Tomatoes
Revolve 360 Extend Manual
Zachary Zulock Linkedin
Celebrity Guest Tape Free
Inmate Inquiry Mendocino
Tmobile Ipad 10Th Gen
Culver's Flavor Of The Day Little Chute
Angelaalvarez Leak
What Was D-Day Weegy
Craigslist 5Th Wheel Campers For Sale
Generation Zero beginner’s guide: six indispensable tips to help you survive the robot revolution
Joann Ally Employee Portal
‘Sound of Freedom’ Is Now Streaming: Here’s Where to Stream the Controversial Crime Thriller Online for Free
Julia Is A Doctor Who Treats Patients
Expendables 4 Showtimes Near Cinemark 14 Rockwall And Xd
Super Nash Bros Tft
Best Pedicure Nearby
Craigslist Shelves
Nyu Paralegal Program
Star Rug Aj Worth
Jennette Mccurdy Tmz Hawaii
Unit 8 Lesson 2 Coding Activity
Bones And All Showtimes Near Tucson Spectrum 18
352-730-1982
Sm64Ex Coop Mods
Meineke Pacific Beach
Busted Paper Haysi Regional Jail
Twitter Jeff Grubb
Insidekp.kp.org Myhr Portal
Rolling-Embers Reviews
Duitse Rechtspraak: is de Duitse Wet op het minimumloon wel of niet van toepassing op buitenlandse transportondernemingen? | Stichting Vervoeradres
Culver's Flavor Of The Day Taylor Dr
Gmail Psu
Find The Eagle Hunter High To The East
Banning Beaumont Patch
Western Lake Erie - Lake Erie and Lake Ontario
Our Favorite Paper Towel Holders for Everyday Tasks
Fgo Rabbit Review
Walmart Careers Application Part Time
Bloxburg Bedroom Ideas That Will Make Your Kid's Jaw Drop
Con Edison Outage Map Staten Island
John Deere Z355R Parts Diagram
celebrity guest tape Videos EroThots 🍌 celebrity guest tape Videos EroThots
Ups First And Nees
Borderlands 2 Mechromancer Leveling Build
How to Set Up Dual Carburetor Linkage (with Images)
Antonin Balthazar Lévy
How To Use Price Chopper Points At Quiktrip
Tax Guidelines for Uber Eats Delivery Partners
Ehc Workspace Login
Katia Uriarte Husband
Latest Posts
Article information

Author: Catherine Tremblay

Last Updated:

Views: 5439

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Catherine Tremblay

Birthday: 1999-09-23

Address: Suite 461 73643 Sherril Loaf, Dickinsonland, AZ 47941-2379

Phone: +2678139151039

Job: International Administration Supervisor

Hobby: Dowsing, Snowboarding, Rowing, Beekeeping, Calligraphy, Shooting, Air sports

Introduction: My name is Catherine Tremblay, I am a precious, perfect, tasty, enthusiastic, inexpensive, vast, kind person who loves writing and wants to share my knowledge and understanding with you.