Refining data on Instabase Platform Documentation

Refining data on Instabase Platform Documentation https://platform.instabase.com/docs/26.04/extract-classify/refiner/index.html Recent content in Refining data on Instabase Platform Documentation Hugo -- gohugo.io en-us About Refiner https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner5/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner5/index.html Table of Contents Supported extraction forms When to use Refiner Getting started Creating Refiner programs Navigating Refiner Select records to run Refiner functions Extracting text fields Extracting visual fields Using text fields to process visual fields Field execution View options Integrating a completed Refiner program Keyboard shortcuts Fixed structure documents Variable structure documents Advanced extraction Provenance tracking Adding a UDF Troubleshooting What to do when files don’t load? What to do when the page goes blank? Refiner language grammar https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-grammar/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-grammar/index.html Refiner functions process data and extract text in the Refiner step of a Flow. Refiner functions are one part of the Refiner grammar that makes up the Instabase Refiner language. Refiner language grammar requires strict adherence to these syntax rules. In-product documentation and examples of are available for each Refiner function. Uppercase and lowercase Functions are case-insensitive INPUT_COL is a case-sensitive reserved keyword Operators The Refiner language supports the following types of operators: Boolean, binary, unary. Measure accuracy with Target Comparison https://platform.instabase.com/docs/26.04/extract-classify/refiner/compare-targets/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/compare-targets/index.html Enable the Target Comparison feature for a Refiner run to measure extraction text field extraction accuracy as you build and make incremental changes in your Refiner program. As you build out an extraction program in Refiner, you might wonder “How accurately is my extraction program extracting against the labeled data?” Accuracy and progress metrics Target Comparison is applied only to eligible mapped target fields that are present in the selected targets file. Scan Box https://platform.instabase.com/docs/26.04/extract-classify/refiner/scan-box/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/scan-box/index.html Table of Contents How to use Scan Box Example of basic usage Accepted arguments pixel_tolerance exclude_label_line Enable line detection and OCR Config settings The scan_box Refiner function extracts text from a rectangle (box) in the image domain based on a label within that rectangle. How to use Scan Box In the image above, you can use scan_box to extract the employer’s name and address from the rectangle by using the label 'c Employer\'s name'. TokenMatchers and Tokenizers (Legacy) https://platform.instabase.com/docs/26.04/extract-classify/refiner/tokenmatchers-and-tokenizers/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/tokenmatchers-and-tokenizers/index.html Table of Contents TokenMatcher and Tokenizer usage Available TokenMatchers Available Tokenizers Creating custom TokenMatchers and Tokenizers You can use Token Matchers and Tokenizers with some Refiner functions. Tokenizers provide a way to break text into multiple pieces, while TokenMatchers provide a way to score and clean each particular piece of text with knowledge of the semantic category that it belongs to. For example, without context, it is difficult for a computer to interpret the value of 1o Novembr 200B, but knowing that this value is supposed be a date changes everything: it is clearly 10 November 2008. Provenance Tracking https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-tracking/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-tracking/index.html Table of Contents What is provenance tracking Writing a provenance-tracked UDF Compatibility with previously written UDFs instabase.provenance.tracking: Provenance APIs instabase.provenance.tracking.Value class tracker set_tracker image_tracker set_image_tracker value get_copy freeze_tracker instabase.provenance.tracking.ProvenanceTracker class convert_to_informational insert_information_from deepcopy instabase.provenance.tracking.ImageProvenanceTracker class string Value objects substring length delete concatenate replace lstrip, rstrip, strip insert split join Regex-based helper functions regex_search regex_findall regex_finditer regex_sub regex_split TrackedMatchProxy Collection Value objects Modifying provenance-tracked values Advanced provenance tracking Freezing Accessing provenance information Auto provenance tracking Switching between OCR and INPUT_COL domain Troubleshooting provenance tracking The return value of my UDF shows different provenance information than I would expect. Provenance Tracking - Extracted Tables https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-tracking-extracted-tables/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-tracking-extracted-tables/index.html Table of Contents About Extracted Tables instabase.provenance.table.ExtractedTablesList class Accesing the list Iterating through the list instabase.provenance.table.ExtractedTable class Getting the dimensions of a table Getting a table cell’s value Slicing the table Iterating through the cells of a table Combining tables together Adding rows and columns to a table Getting a copy of a table instabase.provenance.table.ExtractedTableCell class Getting a copy of a cell About Extracted Tables Extracted tables are tables extracted by table extraction models. Model confidence metrics https://platform.instabase.com/docs/26.04/extract-classify/refiner/model-confidence/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/model-confidence/index.html Model confidence metrics in Refiner functions and Refiner UDFs indicate how confident the system is about the information it extracts. Understanding the model’s confidence in its predictions can help you prioritize refining data sources, or take measures to validate output that doesn’t meet required confidence levels. The Value object includes integrated model confidence metrics. Data types WordConfidence – A dictionary type capturing the actual word (val) and its corresponding model confidence (confidence). Confidence functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/confidence-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/confidence-functions/index.html extracted_is_sure(value_obj) Returns true if the given Value does not contain characters that are marked as “unsure” by OCR within the original text. Args: value\_obj (Value): The provenance\-tracked result from some extraction process. Returns: True if the given Value does not contain characters that are marked as "unsure" by OCR within the original text, and False otherwise (as a Value\-wrapped object). Examples: extracted\_is\_sure(value) \-> Value(True) extracted_sureness_above extracted_sureness_above(value_obj, percentage) Returns True if the given Value contains a percentage of Excel functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/excel-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/excel-functions/index.html get_cell_from_sheet_index get_cell_from_sheet_index(record, sheet_index, row_index, col_index) Returns the dict representation of a cell from an Excel sheet referenced by sheet index Args: record (IBOCRRecord): original IBOCRRecord passed in using INPUT_IBOCR_RECORD sheet_index (int): index of the desired sheet in the record row_index (int): row index of the cell in the sheet col_index (int): column index of the cell in the sheet Returns: Returns a dictionary representation of the cell in the format {’type’: <cell_type>, ‘value’: <value>}. List functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/list-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/list-functions/index.html filter filter(input_list, fn, tolerate_errors=false) Filter a list based on whether a refiner function outputs true or false for each value. Args: input\_list (list): list or list json\-encoded as a string fn (str): name of function to be used for mapping. tolerate\_errors (bool): should errors be allowed (with elements causing errors filtered out)? Returns: Returns a list of the filtered elements Examples: filter(\[' a', ' b '\], 'contains(x, \\'a\\')') \-> \[' a'\] first first(input_list) Logical functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/logical-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/logical-functions/index.html col_index_from_letters col_index_from_letters(letter_name) Translates a letter-style column name (commonly seen in Excel) to its equivalent numerical index value. Args: letter\_name (str): the letter\-style column name. Should be a string consisting of letters A\-Z without spaces. Returns: An integer value Examples: col\_index\_from\_letters('A') \-> 1 col\_index\_from\_letters('AA') \-> 27 equals equals(val1, val2) Returns true if val1 == val2 Args: val1: The first value to compare val2: The second value to compare Returns: True if val1 == val2, and false otherwise Map functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/map-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/map-functions/index.html map_copy map_copy(input_map) Returns a deep copy of the map Args: input_map (dict): the input map Returns: Returns a deep copy of the map. All map values must be primitive types Examples: map_copy({'key1': 'val1', 'key2': 'val2'}) -> {'key1': 'val1', 'key2': 'val2'} map_create map_create(list_of_tuples) Creates a map given a list of 2-tuples, where each tuple is [key, val] Args: list_of_tuples (list): the list of 2-tuple key-value pairs Returns: Returns the created map Examples: map_create(list()) -> {} map_create([['key1', 'val1'], ['key2', 'val2']]) -> {'key1': 'val1', 'key2': 'val2'} map_delete_key map_delete_key(input_map, key) NLP functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/nlp-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/nlp-functions/index.html nlp_get_entities nlp_get_entities(text, label=None) Extracts entities from natural language text. Args: text (str): the text of interest label (str): filters for a specific kind of entity, such as PERSON or ORG. Defaults to None, which gets all entity types. Returns: Returns a dictionary containing entities extracted from the text Examples: nlp_get_entities('The Massachusetts Institute of Technology is a private research university in Cambridge, Massachusetts, United States.') -> { 'entities': [ {'char_pos': {'end': 41, 'start': 0}, 'entity': u'The Massachusetts Institute of Technology', 'label': u'ORG', 'word_pos': {'end': 5, 'start': 0}}, {'char_pos': {'end': 87, 'start': 78}, 'entity': u'Cambridge', 'label': u'GPE', 'word_pos': {'end': 12, 'start': 11}}, {'char_pos': {'end': 102, 'start': 89}, 'entity': u'Massachusetts', 'label': u'GPE', 'word_pos': {'end': 14, 'start': 13}}, {'char_pos': {'end': 117, 'start': 104}, 'entity': u'United States', 'label': u'GPE', 'word_pos': {'end': 17, 'start': 15}} ], 'status': 'OK' } nlp_token_clean nlp_token_clean(text, model=None, model_config=None) Numerical functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/numerical-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/numerical-functions/index.html abs abs(val) Get the absolute value of a given number Args: val (str/int/float): a number, a number as a single-quoted string, or a field name without quotes Returns: (float): The absolute value of the number Examples: abs(-2) -> 2.0 abs(INPUT_COL) -> 2.0 ceil ceil(val) Get the smallest integer greater than or equal to the given number Args: val (str/int/float): a number, a number as a single-quoted string, or a field name without quotes Returns: (float): The smallest integer greater than or equal to the given number Examples: ceil(2. OCR functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/ocr-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/ocr-functions/index.html get_ocr_confidence get_ocr_confidence(ibocr, skip_missing_confidence_scores=false) Get OCR confidences associated with provided input Args: ibocr (IBOCRRecord): The INPUT_IBOCR_RECORD to get OCR Confidence of skip_missing_confidence_scores (bool): if true, don't raise an error if no word/char-level confidence scores are found for a particular word Returns: Returns the average confidence associated with the input. Confidence is reported as a percentage (70 means 70%) Examples: get_ocr_confidence(INPUT_IBOCR_RECORD) -> 70.2 get_ocr_confidence(INPUT_IBOCR_RECORD, skip_missing_confidence_scores=true) -> 70.2 is_ocr_required is_ocr_required(ibocr) Returns whether or not the record was generated using ocr. Parsing functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/parsing-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/parsing-functions/index.html left_pos left_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None) Finds leftmost character position of a word in a given text Args: text (str): original text label (str, optional): string whose leftmost character will be used for determining left position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. Path functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/path-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/path-functions/index.html filename filename(path) Return the filename in a path Args: path (Text): The path to the file or directory Returns: Returns the base name of a path Examples: filename('/foo/bar/baz.pdf') -> 'baz.pdf' filename('/foo/bar/baz') -> 'baz' filename('/foo/bar/') -> '' PDF functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/pdf-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/pdf-functions/index.html get_pdf_fonts get_pdf_fonts(ibocr) Get PDF Fonts associated with provided input NOTE: The flavour of the function that takes INPUT_IBOCR will be deprecated after September 30th 2019. Please use in INPUT_IBOCR_RECORD. Args: ibocr (Union[IBOCRRecordDict, IBOCRRecord]): Could be either a: - Dictionary with info about one ibocr record - The IBOCRRecord itself Returns: Returns pdf fonts used across this entire document Examples: get_pdf_fonts(INPUT_IBOCR) -> [{'name': 'TimesNewRoman', 'type': 'Type1', 'encoding': 'PDFEncoding'}] get_pdf_fonts(INPUT_IBOCR_RECORD) -> [{'name': 'TimesNewRoman', 'type': 'Type1', 'encoding': 'PDFEncoding'}] get_pdf_metadata get_pdf_metadata(ibocr, field_name) Provenance functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/provenance-functions/index.html freeze freeze(val) Freezes the tracker so that no tracker operations can affect this particular tracker. Args: val (any): any provenance-tracked (i.e. value-wrapped) Value Returns: the same value with the provenance frozen Examples: my_udf(freeze(field_1)) -> the output of your udf with the same provenance as field_1 provenance_get provenance_get(val) Get provenance information for a provenance-tracked value. Returns a dictionary of provenance information for the given Value. Args: val (any): any provenance-tracked (i.e. value-wrapped) Value Returns: a dictionary of provenance information for the given value Examples: provenance_get(field_1) String functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/string-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/string-functions/index.html clean clean(text, strip=true, match_pattern=None, replacement=None) Removes extra whitespace from a string Args: text (str): string to be cleaned strip (bool, optional): strips leading and trailing space on the cleaned up text match_pattern (str, optional): override default match pattern ('\s+') replacement (str, optional): override default replacement pattern (' ') Returns: Returns a string with trimmed whitespace Examples: clean(' ab cd e ') -> 'ab cd e' concat concat(*args: Any) Concatenate strings from fields or raw values Args: *args: Variable length argument list containing strings (with single quotes) or, field names (without quotes) Returns: Returns a concatenated string Examples: concat('hello ', 'world') -> 'hello world' contains contains(text, q) Table functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/table-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/table-functions/index.html merge_tables merge_tables(table_list, as_row=true) Not-provenance-tracked version of merge_tables_fn_v table_get_range table_get_range(table, row_range, col_range) Not-provenance-tracked version of table_get_range table_list_get table_list_get(table_list, i) Not-provenance-tracked version of table_list_get_fn_v Validation functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/validation-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/validation-functions/index.html assert_not_blurry assert_not_blurry(input_val) Determines whether the input image is not blurry. In order to use this function, make sure to enable this flag in the Process Files step by setting "detect_blurry_files" to true in the OCR Config box. Args: input_val (dict): Any IBOCR Record dictionary. Returns: True if validation is successful. Otherwise, throws an Exception and report blur factor. assert_not_empty assert_not_empty(input_val) Determines whether the input value exists. If the input is a string, will validate if the string has a length > 0. Visual functions https://platform.instabase.com/docs/26.04/extract-classify/refiner/visual-functions/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/visual-functions/index.html clean_signature clean_signature(image) Gets an image patch of the input value as a base64-encoded string. Args: image (Value[Text]): A provenance-tracked value of extracted image to clean Returns: Returns a base64-encoded string of the extracted and cleaned image patch Examples: clean_signature(list_get(match(INPUT_COL, '[SIGN]'), 0)) -> '/9j/4AAQSkZ...' detect_checkbox detect_checkbox(anchors, relative_positions) See docs for detect_checkbox_fn_v detect_signature detect_signature(anchors, relative_positions) See docs for detect_signature_fn_v extract_image_crop extract_image_crop(value) Gets an image patch of the input value as a base64-encoded string. Text extraction with Refiner https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-text-extraction/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-text-extraction/index.html Table of Contents Prerequisites 1. Setting up your workspace 2. A brief UI tour A paystub sheet overview 3. Scan functions Scan function gotchas Scan below 4. Refiner - a mental model Cutting up our paystubs 5. Refiner outputs From formulas to CSV Conclusion Instabase’s Refiner App helps you “refine” (or “extract”) specific data from similar documents. Instabase helps you create formulas to extract fields in a set of similar documents. Visual extraction in Refiner https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-visual-extraction/index.html Mon, 01 Jan 0001 00:00:00 +0000 https://platform.instabase.com/docs/26.04/extract-classify/refiner/refiner-visual-extraction/index.html Table of Contents 1. Setting up your workspace The Refiner project template 2. The Refiner UI Navigating the .ibrefiner view 3. Defining anchors Discovering repeated text 4. Image extraction Checkbox extraction 5. Decoding an image Discovering a checkbox output via image functions Discovering a checkbox output via text functions 6. On your own—extracting a signature Conclusion This feature of Refiner will help you extract images from portions of similar documents and detect their contents, such as if a checkbox is checked or a signature line is signed, even if these regions aren’t always in the same locations across multiple documents.