Documentation for `ToMeDa`

tomeda.t11_create_dataverse_tsv_from_mapping

This file gets a metadata schema as json from the dataset a dataset tsv and mapping table will be derived from it outputs a dataverse tsv file and a mapping table

Input

json schema file (no data content, just structure)

Output

2 files:
    1. dataverse tsv file with all keys that are not in citation.tsv
    2. mapping table for all keys that are not in citation.tsv with depth 2
       at most (schema<->dataverse)

logger `module-attribute`

logger: TraceLogger = getLogger(__name__)

types `module-attribute`

types = []

InfoElement `dataclass`

InfoElement(
    name: str,
    title: str,
    description: str,
    allow_multiples: bool,
    required: bool,
    type: str,
    dataverse_name: str,
    displayFormat: str | None = None,
    controlledVocabulary: list[str] | None = None,
)

allow_multiples `instance-attribute`

allow_multiples: bool

controlledVocabulary `class-attribute` `instance-attribute`

controlledVocabulary: list[str] | None = None

dataverse_name `instance-attribute`

dataverse_name: str

description `instance-attribute`

description: str

displayFormat `class-attribute` `instance-attribute`

displayFormat: str | None = None

name `instance-attribute`

name: str

required `instance-attribute`

required: bool

title `instance-attribute`

title: str

type `instance-attribute`

type: str

NotLeafError

NotLeafError(message='Not a leaf Error')

Bases: Exception

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def __init__(self, message="Not a leaf Error"):
    super().__init__(message)

assemble_dataset_field_line

assemble_dataset_field_line(
    line_information: InfoElement, display_order: int
) -> DatasetFieldLine

Create a DatasetFieldLine instance using provided line information and order.

Done for https://guides.dataverse.org/en/latest/admin/metadatacustomization.html

Parameters:

Name	Type	Description	Default
`line_information`	`dict`	Dictionary containing information about a line.	required
`display_order`	`int`	Integer representing the order of display.	required

Returns:

Type	Description
`DatasetFieldLine`	DatasetFieldLine instance.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def assemble_dataset_field_line(
    line_information: InfoElement, display_order: int
) -> DatasetFieldLine:
    """
    Create a DatasetFieldLine instance using provided line information and
    order.

    Done for
    https://guides.dataverse.org/en/latest/admin/metadatacustomization.html

    Parameters
    ----------
    line_information : dict
        Dictionary containing information about a line.
    display_order : int
        Integer representing the order of display.

    Returns
    -------
    DatasetFieldLine
        DatasetFieldLine instance.
    """
    global types

    name = line_information.dataverse_name
    title = line_information.title
    description = line_information.description
    field_type = line_information.type
    types = list(set(types + [field_type]))

    to_text = ["str", "bool"]
    if field_type in to_text:
        field_type = "text"
    to_int = ["int", "PositiveInt", "PositiveSmallInt", "SmallInt", "ByteSize"]
    if field_type in to_int:
        field_type = "int"
    # those are the accepted types in dataverse
    accepted_types = ["date", "email", "text", "textbox", "url", "int", "float"]
    if field_type not in accepted_types:
        field_type = "none"

    display_format = (
        line_information.displayFormat
        if line_information.displayFormat is not None
        else "#VALUE"
    )  # ToDo: Fix
    allow_multiples = line_information.allow_multiples
    required = line_information.required

    parent = (
        (
            "Eng_" + line_information.name.split(".")[0]
        )  # ToDo: Fix the prefix generation 'Eng'
        if len(line_information.name.split(".")) > 1
        else ""
    )
    if parent and field_type in ["", "none"]:
        raise NotLeafError(
            f"Parent {parent} is not empty but field type is empty. "
            f"This is not allowed."
        )

    line = DatasetFieldLine(
        prefix="",
        name=name,
        title=title,
        description=description,
        watermark="",
        field_type=field_type,
        display_order=display_order,
        display_format=display_format,
        advanced_search_field=False,
        allow_controlled_vocabulary=False,
        allow_multiples=allow_multiples,
        facetable=False,
        display_on_create=True,
        required=required,
        parent=parent,
        metadata_block_id="engmeta",
        term_uri="",
    )
    return line

create_InfoElement_structure

create_InfoElement_structure(
    entry_description_raw: dict[str, dict[str, str]]
) -> dict[str, InfoElement]

Create a dictionary with InfoElement instances.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def create_InfoElement_structure(
    entry_description_raw: dict[str, dict[str, str]]
) -> dict[str, InfoElement]:
    """
    Create a dictionary with InfoElement instances.
    """
    entry_description: dict[str, InfoElement] = {}
    for key, value in entry_description_raw.items():
        entry_description[key] = InfoElement(
            name=value["name"],
            title=value["title"],
            description=value["description"],
            allow_multiples=value["allow_multiples"] == "True",
            required=value["required"] == "True",
            type=value["type"],
            controlledVocabulary=value.get("controlledVocabulary"),
            dataverse_name=value["dataverse_name"],
            displayFormat=value.get("displayFormat"),
        )
    return entry_description

generate_controlled_vocabulary_block

generate_controlled_vocabulary_block(
    entry_description: dict[str, InfoElement]
) -> ControlledVocabulary

Generate a controlled vocabulary block.

Parameters:

Name	Type	Description	Default
`entry_description`	`dict`	Description of the entries.	required

Returns:

Type	Description
`ControlledVocabulary`	ControlledVocabulary instance.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def generate_controlled_vocabulary_block(
    entry_description: dict[str, InfoElement],
) -> ControlledVocabulary:
    """
    Generate a controlled vocabulary block.

    Parameters
    ----------
    entry_description : dict
        Description of the entries.

    Returns
    -------
    ControlledVocabulary
        ControlledVocabulary instance.
    """
    tsv_lines: list[ControlledVocabularyLine] = []

    for name, value in entry_description.items():
        if controlled_vocab := value.controlledVocabulary:
            related_dataset_field = value.dataverse_name
            for display_order, vocab in enumerate(controlled_vocab):
                tsv_lines.append(
                    ControlledVocabularyLine(
                        prefix="",
                        related_dataset_field=related_dataset_field,
                        value=vocab,
                        display_order=display_order,
                        identifier="",
                    )
                )

    return ControlledVocabulary(body=tsv_lines)

generate_dataset_field_block

generate_dataset_field_block(
    auto_mapping: dict | list,
    entry_description: dict[str, InfoElement],
) -> DatasetField

Generate a dataset field block.

Parameters:

Name	Type	Description	Default
`auto_mapping`	`dict or list`	Auto-generated mapping.	required
`entry_description`	`dict[str, InfoElement]`	Description of the entries.	required

Returns:

Type	Description
`DatasetField`	DatasetField instance.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def generate_dataset_field_block(
    auto_mapping: dict | list, entry_description: dict[str, InfoElement]
) -> DatasetField:
    """
    Generate a dataset field block.

    Parameters
    ----------
    auto_mapping : dict or list
        Auto-generated mapping.
    entry_description : dict[str, InfoElement]
        Description of the entries.

    Returns
    -------
    DatasetField
        DatasetField instance.
    """
    tsv_lines: list[DatasetFieldLine] = []

    for display_order, (key, line_information) in enumerate(
        entry_description.items()
    ):
        if line_information.dataverse_name not in auto_mapping:
            continue
        try:
            line: DatasetFieldLine = assemble_dataset_field_line(
                line_information,
                display_order=display_order,
            )
        except NotLeafError:
            logger.warning(
                "Skipping %s because it is not a leaf.",
                line_information.name,
            )
            continue
        tsv_lines.append(line)

    logger.debug(types)

    return DatasetField(body=tsv_lines)

generate_metadata_block

generate_metadata_block(schema_name: str) -> MetadataBlock

Generate a metadata block with the provided schema name.

Parameters:

Name	Type	Description	Default
`schema_name`	`str`	Name of the schema.	required

Returns:

Type	Description
`MetadataBlock`	MetadataBlock instance.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def generate_metadata_block(schema_name: str) -> MetadataBlock:
    """
    Generate a metadata block with the provided schema name.

    Parameters
    ----------
    schema_name : str
        Name of the schema.

    Returns
    -------
    MetadataBlock
        MetadataBlock instance.
    """
    return MetadataBlock(
        MetadataBlockLine(
            prefix="",
            name=schema_name.lower(),
            dataverse_alias="",
            display_name=schema_name.title(),
        )
    )

generate_tsv

generate_tsv(
    auto_mapping: list,
    entry_description: dict[str, InfoElement],
    schema_name,
) -> dict

Generate a TSV format dictionary.

Parameters:

Name	Type	Description	Default
`auto_mapping`	`list`	Auto-generated mapping.	required
`entry_description`	`dict[str, InfoElement]`	Description of the entries.	required
`schema_name`	`str`	Name of the schema.	required

Returns:

Type	Description
`dict`	Dictionary in TSV format.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def generate_tsv(
    auto_mapping: list, entry_description: dict[str, InfoElement], schema_name
) -> dict:
    """
    Generate a TSV format dictionary.

    Parameters
    ----------
    auto_mapping : list
        Auto-generated mapping.
    entry_description : dict[str, InfoElement]
        Description of the entries.
    schema_name : str
        Name of the schema.

    Returns
    -------
    dict
        Dictionary in TSV format.
    """
    metadata_block: MetadataBlock = generate_metadata_block(
        schema_name=schema_name,
    )
    dataset_field_block: DatasetField = generate_dataset_field_block(
        auto_mapping=auto_mapping,
        entry_description=entry_description,
    )
    controlled_vocabulary_block: ControlledVocabulary = (
        generate_controlled_vocabulary_block(
            entry_description=entry_description,
        )
    )
    my_custom_file = CustomMetadataBlock(
        metadata_block=metadata_block,
        dataset_field=dataset_field_block,
        controlled_vocabulary=controlled_vocabulary_block,
    )

    tsv_dict = dataclasses.asdict(my_custom_file)

    return tsv_dict

main

main(param: TomedaParameter) -> None

Main function of the script. This function triggers the whole process of creating a mapping table and dataverse TSV file. It gets the paths of required files from the command-line arguments, generates a mapping table and finally creates a TSV file which can be uploaded to dataverse.

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def main(param: TomedaParameter) -> None:
    """
    Main function of the script. This function triggers the whole process of
    creating a mapping table and dataverse TSV file. It gets the paths of
    required files from the command-line arguments, generates a mapping table
    and finally creates a TSV file which can be uploaded to dataverse.
    """
    logger.info("Generating dataverse TSV file.")

    new_keys_file_handle = TomedaFileHandler(param.new_keys[0])
    new_dataverse_keys = new_keys_file_handle.read()

    schema_info_table_file_handle = TomedaFileHandler(param.schema_info_table)
    entry_description_raw = nt.loads(
        schema_info_table_file_handle.read(raw=True)[0]
    )

    entry_description: dict[str, InfoElement] = create_InfoElement_structure(
        entry_description_raw
    )

    schema_name = "Engmeta"

    tsv_dict = generate_tsv(
        new_dataverse_keys, entry_description, schema_name=schema_name
    )

    output_file = param.new_keys[0].parent / f"{schema_name}_dataverse.tsv"

    logger.info(f"Writing dataverse TSV file to {output_file}.")

    write_tsv(tsv_dict, output_file, param.force_overwrite)

write_tsv

write_tsv(
    tsv_dict: dict, output_path: Path, overwrite: bool
) -> None

Write the TSV data into a file.

Parameters:

Name	Type	Description	Default
`tsv_dict`	`dict`	A dictionary containing the TSV data.	required
`output_path`	`Path`	File path to the output file.	required
`overwrite`	`bool`	If True, the output file will be overwritten if it already exists.	required

Source code in tomeda/t11_create_dataverse_tsv_from_mapping.py

def write_tsv(tsv_dict: dict, output_path: Path, overwrite: bool) -> None:
    """
    Write the TSV data into a file.

    Parameters
    ----------
    tsv_dict : dict
        A dictionary containing the TSV data.
    output_path : Path
        File path to the output file.
    overwrite : bool
        If True, the output file will be overwritten if it already exists.
    """
    output_path_file_handle = TomedaFileHandler(
        output_path, overwrite=overwrite
    )

    content = []
    for part, data in tsv_dict.items():
        header = data["header"]
        body = data["body"]

        output = StringIO()
        writer = csv.DictWriter(
            output,
            delimiter="\t",
            lineterminator="\n",
            fieldnames=header,
        )
        writer.writerow(header)
        body = [body] if isinstance(body, dict) else body
        writer.writerows(body)

        content.append(output.getvalue())
        output.close()

    output_path_file_handle.write(content)

Documentation for ToMeDa

tomeda.t11_create_dataverse_tsv_from_mapping

logger module-attribute

types module-attribute

InfoElement dataclass

allow_multiples instance-attribute

controlledVocabulary class-attribute instance-attribute

dataverse_name instance-attribute

description instance-attribute

displayFormat class-attribute instance-attribute

name instance-attribute

required instance-attribute

title instance-attribute

type instance-attribute

NotLeafError

assemble_dataset_field_line

create_InfoElement_structure

generate_controlled_vocabulary_block

generate_dataset_field_block

generate_metadata_block

generate_tsv

main

write_tsv

Documentation for `ToMeDa`

logger `module-attribute`

types `module-attribute`

InfoElement `dataclass`

allow_multiples `instance-attribute`

controlledVocabulary `class-attribute` `instance-attribute`

dataverse_name `instance-attribute`

description `instance-attribute`

displayFormat `class-attribute` `instance-attribute`

name `instance-attribute`

required `instance-attribute`

title `instance-attribute`

type `instance-attribute`