Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

janosh · 2025-03-20T12:46:30Z

does not fix #230 (yet). the symmetry analysis rerun for all models is pushed back to a subsequent PR. the current lineup of models was tested with mixture of spglib==2.5.0 and several moyopy versions so a rerun would have been in order anyway for consistency. the code to calculate RMSD metric is fixed in this PR but the existing metrics are not updated. my machine keeps crashing trying to rerun all models. may need to rerun models one by one over night.

this PR makes debugging and iterating on future geometry optimization much faster by allowing to load just ~100 DFT/model-relaxed structures instead of all 257k in WBM on test runs

Unpolished conversion script for ML-relaxed structures from JSON to JSON Lines

"""
Convert geometry optimization JSON files to JSON lines format for better debug performance.

JSON lines format allows for reading specific number of lines without loading the entire file
into memory, which is especially useful for large structure files when only a subset is needed.

Example usage:
    python scripts/convert_to_jsonl.py  # Convert all model geo-opt files
    python scripts/convert_to_jsonl.py --models mace_mp_0 m3gnet  # Convert specific models
    python scripts/convert_to_jsonl.py --force  # Overwrite existing output files
"""

import argparse
import glob
import os
import sys

import pandas as pd
from tqdm import tqdm

from matbench_discovery.enums import Model


def convert_to_jsonl(
    input_file: str,
    output_file: str | None = None,
    suffix: str = ".jsonl.gz",
    force: bool = False,
) -> bool:
    """Convert a JSON file to JSON lines format.

    Args:
        input_file: Path to input JSON file
        output_file: Path to output JSON lines file (default: input_file with .jsonl.gz extension)
        suffix: Suffix to use for output file if output_file is not specified
        force: Whether to overwrite existing output file

    Returns:
        bool: True if the file was processed, False if it was skipped
    """
    if not os.path.exists(input_file):
        print(f"Error: Input file {input_file} does not exist")
        return False

    try:
        print(f"Reading {input_file}...")
        df = pd.read_json(input_file)

        # Get row count for filename
        row_count = len(df)

        # Check if index is already a material ID
        if df.index.name is None or df.index.name != "material_id":
            # If not indexed by material_id, try to find material_id column
            if "material_id" in df.columns:
                df = df.set_index("material_id")
            elif df.index.dtype == "object" and df.index[0].startswith("wbm-"):
                # Index already contains material IDs but not named
                df.index.name = "material_id"

        # Ensure index name is material_id for proper loading later
        if df.index.name is None:
            df.index.name = "material_id"

        if output_file is None:
            # By default, replace extension with the line count and .jsonl.gz
            basename = os.path.splitext(input_file)[0]
            basename = basename.removesuffix(".json")  # Remove .json from .json.gz
            output_file = f"{basename}-n={row_count}{suffix}"

        # Silently skip if file exists and not forcing overwrite
        if os.path.exists(output_file) and not force:
            return False

        print(
            f"Writing {row_count} structures to {output_file} in JSON lines format..."
        )
        df.reset_index().to_json(output_file, orient="records", lines=True)
        print(f"Successfully converted {input_file} to {output_file}")

        # Print info on how to verify the conversion
        print("\nTo verify conversion:")
        print(
            f"  python -c \"import pandas as pd; print(pd.read_json('{output_file}', lines=True).head())\""
        )
        return True

    except Exception as exc:
        print(f"Error converting {input_file}: {exc}")
        return False


def main():
    parser = argparse.ArgumentParser(
        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument(
        "--models",
        nargs="*",
        type=str,
        help="Specific models to convert (default: all models with geo_opt_path)",
    )
    parser.add_argument(
        "--suffix",
        default=".jsonl.gz",
        help="Suffix for output file (default: .jsonl.gz)",
    )
    parser.add_argument(
        "--force", action="store_true", help="Overwrite existing output files"
    )
    args = parser.parse_args()

    # Get list of models to process
    models_to_process = []
    if args.models:
        # Convert model names to actual Model enum values
        for model_name in args.models:
            try:
                model = getattr(Model, model_name)
                models_to_process.append(model)
            except AttributeError:
                print(f"Warning: Unknown model '{model_name}', skipping")
    else:
        # Process all models with geo_opt_path
        models_to_process = list(Model)

    # Filter to models with geometry optimization data
    geo_opt_models = []
    for model in models_to_process:
        try:
            if model.geo_opt_path:  # Returns None if model doesn't have geo_opt
                geo_opt_models.append(model)
        except ValueError as exc:
            print(f"Warning: {model.label} - {exc}")
        except Exception as exc:
            print(f"Error with {model.label}: {exc}")

    if not geo_opt_models:
        print("No models with geometry optimization data found.")
        sys.exit(1)

    print(f"Found {len(geo_opt_models)} models with geo_opt paths")

    # Process each model's geo_opt file
    processed_count = 0
    skipped_count = 0
    already_exists_count = 0

    for model in tqdm(geo_opt_models, desc="Converting files"):
        try:
            input_file = model.geo_opt_path
            if input_file and os.path.exists(input_file):
                # Check if output file would exist without conversion attempt
                basename = os.path.splitext(input_file)[0]
                basename = basename.removesuffix(".json")  # Remove .json from .json.gz

                # Check if any file matching this pattern exists
                potential_output_files = glob.glob(f"{basename}-n=*{args.suffix}")
                if potential_output_files and not args.force:
                    already_exists_count += 1
                    continue

                # Process file and track results
                if convert_to_jsonl(input_file, None, args.suffix, args.force):
                    processed_count += 1
                else:
                    already_exists_count += 1
            else:
                print(
                    f"Warning: geo_opt_path for {model.label} doesn't exist: {input_file}"
                )
                skipped_count += 1
        except Exception as exc:
            print(f"Error processing {model.label}: {exc}")
            skipped_count += 1

    print(
        f"Processed {processed_count} file(s), skipped {skipped_count} file(s), {already_exists_count} already existed"
    )


if __name__ == "__main__":
    main()

…ML files

…lling them with 1.0 - scripts/evals/geo_opt.py add note explaining RMSD values are unitless

- after moving spacegroup comparison out of pred_vs_ref_struct_symmetry to analyze_geo_opt.py - calc_structure_distances now only calculates distance metrics between predicted and reference structures - fix tests to cover new functionality and ensure robustness against mismatched IDs

… 'distance' - Conditional execution of symmetry analysis and structure distance calculations based on the specified analysis type

…aster loading on debug runs uploaded to new figshare article: https://figshare.com/articles/dataset/28642406 old one is now deprecated: https://figshare.com/articles/dataset/28187999 symmetry and distance analysis files to be added to new article next

- update_yaml_at_path now allows reading from a YAML file when `data` is set to None, returning the value at the specified dotted path without modifying the file - tests to verify read-only functionality

…a dictionary - Refined `calc_geo_opt_metrics` to handle NaN values - more test coverage for both functions

…IDS` - Modified `calc_structure_distances` in `symmetry.py` to print a warning instead of raising an error when no shared IDs between predicted and reference structures - new test cases in `test_symmetry.py` to verify the new warning behavior and ensure proper handling of NaN values in distance calculations

…imization analysis in `test_analyze_geo_opt.py` - bump pre-commit hooks for ruff, eslint, and pyright

…analysis filename - updated analysis file paths to include structure counts in filenames for consistency - modified RMSD values to reflect updated config in several models

…`lines=True` for new JSON files in line-delimited format - Updated relevant scripts and models to ensure compatibility

…pdate figshare URLs in data-files.yml - remove DataFiles.wbm_cses_plus_init_structs altogether, usually you just need one or the other, not both initial and relaxed structures - change all references from `wbm_cses_plus_init_structs` to `wbm_initial_structures` and `wbm_computed_structure_entries` in scripts and models - enhance upload script with argparse for file selection

…ability

…g.md and PR template.md update data-files.yml to reflect changes in file paths from .json.bz2 to .jsonl.gz for WBM computed and initial structures, removal of wbm_cses_plus_init_structs

…hanges for now

The original trajectories can be found at: https://figshare.com/s/a629acbf3bed6a04b3ce?file=53060504

…ed structures in JSON Lines format

…y/structure/symmetry.py + matbench_discovery/data.py + tests/structure/test_symmetry.py keep only new paragraph in module doc str

janosh added 2 commits March 20, 2025 08:37

fix: changed RMSD units from Ångström to unitless across all model YA…

89db10b

…ML files

- fix calc_geo_opt_metrics to handle NaN values appropriately by fi…

23a78af

…lling them with 1.0 - scripts/evals/geo_opt.py add note explaining RMSD values are unitless

janosh added fix Bug fix geo opt Geometry optimization labels Mar 20, 2025

janosh temporarily deployed to github-pages March 20, 2025 12:49 — with GitHub Actions Inactive

janosh temporarily deployed to github-pages March 20, 2025 13:07 — with GitHub Actions Inactive

janosh added 2 commits March 20, 2025 09:24

analyze_geo_opt.py add CLI flag analysis_type`: 'all', 'symmetry', or…

c4ca284

… 'distance' - Conditional execution of symmetry analysis and structure distance calculations based on the specified analysis type

janosh temporarily deployed to github-pages March 22, 2025 15:56 — with GitHub Actions Inactive

janosh added 3 commits March 22, 2025 12:22

add read-only mode to update_yaml_at_path by passing data=None

655c71f

- update_yaml_at_path now allows reading from a YAML file when `data` is set to None, returning the value at the specified dotted path without modifying the file - tests to verify read-only functionality

write_metrics_to_yaml now accepts metrics as either a DataFrame or …

cf800ab

…a dictionary - Refined `calc_geo_opt_metrics` to handle NaN values - more test coverage for both functions

janosh temporarily deployed to github-pages March 22, 2025 18:11 — with GitHub Actions Inactive

add tests/metrics/test_analyze_geo_opt.py unit tests for geometry opt…

ac14a20

…imization analysis in `test_analyze_geo_opt.py` - bump pre-commit hooks for ruff, eslint, and pyright

janosh temporarily deployed to github-pages March 22, 2025 21:15 — with GitHub Actions Inactive

janosh added 4 commits March 22, 2025 17:35

removed n_structures field from YAML files since already included in …

a83d8ec

…analysis filename - updated analysis file paths to include structure counts in filenames for consistency - modified RMSD values to reflect updated config in several models

change all WBM initial/relaxed structure pd.read_json calls to use …

b6f35ea

…`lines=True` for new JSON files in line-delimited format - Updated relevant scripts and models to ensure compatibility

rename all df_cse variables to df_wbm_cse or df_mp_cse for read…

fa389a0

…ability

janosh temporarily deployed to github-pages March 22, 2025 23:51 — with GitHub Actions Inactive

janosh added 2 commits March 23, 2025 12:32

specify JSON Lines format for model-relaxed structures in contributin…

95f1a2e

…g.md and PR template.md update data-files.yml to reflect changes in file paths from .json.bz2 to .jsonl.gz for WBM computed and initial structures, removal of wbm_cses_plus_init_structs

temp revert to previous metrics.geo_opt format for now

2cef9cf

janosh changed the title ~~Fix geo_opt RMSD metric~~ Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading Mar 23, 2025

janosh added 6 commits March 23, 2025 14:43

fix pytest

66061df

revert half-baked update_yaml_at_path() and write_metrics_to_yaml() c…

437d65c

…hanges for now

add models/mattersim/extract_final_structs_from_relax_traj_take2.py

3eb07ea

The original trajectories can be found at: https://figshare.com/s/a629acbf3bed6a04b3ce?file=53060504

migrate model scripts for geo_opt test or pred joining to write relax…

15c8b8f

…ed structures in JSON Lines format

revert code changes in scripts/analyze_geo_opt.py + matbench_discover…

72c4cd9

…y/structure/symmetry.py + matbench_discovery/data.py + tests/structure/test_symmetry.py keep only new paragraph in module doc str

reapply minimal RMSD fixes as proposed in #230

7c9aaf2

janosh merged commit 85892aa into main Mar 23, 2025
6 checks passed

janosh deleted the fix-geo-opt-rmsd branch March 23, 2025 19:31

janosh mentioned this pull request Mar 24, 2025

Parameters for StructureMatcher impact on geo-opt metrics #230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

janosh commented Mar 20, 2025 •

edited

Loading

Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

Migrate MLFF geometry optimization files to JSON Lines format for fast partial loading #231

Conversation

janosh commented Mar 20, 2025 • edited Loading

janosh commented Mar 20, 2025 •

edited

Loading