Name		Name	Last commit message	Last commit date
parent directory ..
auth_init.ps1		auth_init.ps1
auth_init.py		auth_init.py
auth_init.sh		auth_init.sh
auth_update.ps1		auth_update.ps1
auth_update.py		auth_update.py
auth_update.sh		auth_update.sh
config.json		config.json
data_preparation.py		data_preparation.py
data_utils.py		data_utils.py
loadenv.ps1		loadenv.ps1
loadenv.sh		loadenv.sh
prepdocs.ps1		prepdocs.ps1
prepdocs.py		prepdocs.py
prepdocs.sh		prepdocs.sh
readme.md		readme.md

readme.md

Data Preparation

Setup

Install the necessary packages listed in requirements.txt, e.g. pip install --user -r requirements.txt

Configure

Create a config file like config.json. The format should be a list of JSON objects, with each object specifying a configuration of local data path and target search service and index.

[
    {
        "data_path": "<path to data>",
        "location": "<azure region, e.g. 'westus2'>", 
        "subscription_id": "<subscription id>",
        "resource_group": "<resource group name>",
        "search_service_name": "<search service name to use or create>",
        "index_name": "<index name to use or create>",
        "chunk_size": 1024, // set to null to disable chunking before ingestion
        "token_overlap": 128 // number of tokens to overlap between chunks
        "semantic_config_name": "default",
        "language": "en" // setting to set language of your documents. Change if your documents are not in English. Look in data_preparation.py for SUPPORTED_LANGUAGE_CODES,
        "vector_config_name": "default" // used if adding vectors to index
    }
]

Create Indexes and Ingest Data

Disclaimer: Make sure there are no duplicate pages in your data. That could impact the quality of the responses you get in a negative way.

Run the data preparation script, passing in your config file. You can set njobs for parallel parsing of your files.

python data_preparation.py --config config.json --njobs=4

Optional: Add vector embeddings

Azure Cognitive Search supports vector search in public preview. See the docs for more information.

To add vectors to your index, you will first need an Azure OpenAI resource with an Ada embedding model deployment. The text-embedding-ada-002 model is supported.

Get the endpoint for embedding model deployment. The endpoint will generally be of the format https://<azure openai resource name>.openai.azure.com/openai/deployments/<ada deployment name>/embeddings?api-version=2023-06-01-preview.
Run the data preparation script, passing in your config file and the embedding endpoint and key as extra arguments:
```
`python data_preparation.py --config config.json --embedding-model-endpoint "<embedding endpoint>"`
```

Optional: Crack PDFs to Text

If your data is in PDF format, you'll first need to convert from PDF to .txt format. You can use your own script for this, or use the provided conversion code here.

Setup for PDF Cracking

Create a Form Recognizer resource in your subscription
Make sure you have the Form Recognizer SDK: pip install azure-ai-formrecognizer
Run the following command to get an access key for your Form Recognizer resource: az cognitiveservices account keys list --name "<form-rec-resource-name>" --resource-group "<resource-group-name>"

Copy one of the keys returned by this command.

Create Indexes and Ingest Data from PDF with Form Recognizer

Pass in your Form Recognizer resource name and key when running the data preparation script:

python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key>

This will use the Form Recognizer Read model by default.

If your documents have a lot of tables and relevant layout information, you can use the Form Recognizer Layout model, which is more costly and slower to run but will preserve table information with better quality. The Layout model will also help preserve some of the formatting information in your document such as titles and sub-headings, which will make the citations more readable. To use the Layout model instead of the default Read model, pass in the argument --form-rec-use-layout.

python data_preparation.py --config config.json --njobs=4 --form-rec-resource <form-rec-resource-name> --form-rec-key <form-rec-key> --form-rec-use-layout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

readme.md

Data Preparation

Setup

Configure

Create Indexes and Ingest Data

Optional: Add vector embeddings

Optional: Crack PDFs to Text

Setup for PDF Cracking

Create Indexes and Ingest Data from PDF with Form Recognizer

Files

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

readme.md

Data Preparation

Setup

Configure

Create Indexes and Ingest Data

Optional: Add vector embeddings

Optional: Crack PDFs to Text

Setup for PDF Cracking

Create Indexes and Ingest Data from PDF with Form Recognizer