• Home
  • Species
  • Sets
  • Create New Job
  • Tutorial
  • Scaffolder Tutorial
  • Help
In [ ]:
%%bash

# get the results
curl -L -o out.tar.gz https://anchored.bioinf.uni-leipzig.de/sets/04cb956d-66a6-4740-9e28-eb807e4c6e89/download-small/
# extract the webserver output archive and see whats in there
tar -xzf out.tar.gz
ls out/
# to run this notebook completely you can access the showcase data on Zenodo as well as this notebook:
# place the genomes into out/utils/
# place the halos,trnas and syn_out_nucl_preloaded directories into out/stable/synthology
    
# install the requirements.txt, e.g.:
# conda create -n ancst_downstream python=3.11
# conda activate ancst_downstream
# pip install -r requirements.txt

# you will also need MCScanX, blast and clasp
In [ ]:
%%bash

# if not available install MCScanX from:
# https://github.com/wyp1125/MCScanX
# first, run MCScanX on theanchor alignments
cd out/stable/MCScanX
MCScanX_h MCScanX
In [3]:
%%bash

# now draw the alignments globally for all "chromosomes" >= 1000000 nucleotides
cd out/stable/MCScanX
python3 mcdraw.py --mcscanx_file MCScanX.collinearity --threshold_chr 1000000
In [4]:
# it is drawn greedily, so the next track down is always the ones with most alignments out of the species left
# blue and green are forward and reverse alignments which map to the chromsome their chromosome mostly maps to in general
# magenta and orange alignments are forward and reverse ones to chromosomes which do not map to their chromosome primarily,
# indicating rearrangements
# here you can already see extensive conservation between the first two species but also small rearrangements
# the next pair shows significantly more global rearrangements

from IPython.display import SVG, display
display(SVG('out/stable/MCScanX/mcscx_blocks.svg'))
No description has been provided for this image
In [9]:
# lets run a comprehensive synteny-enabled orthology analysis for a multi-copy gene family in Drosophilas
%cd out/stable/synthology
/Users/schmackeroodle/test_out_dir2/out/stable/synthology
In [ ]:
%%bash
ls
# for the following analysis you need these genes (their annotations) and also the genome sequences of the Drosophilas
# get this data most easily from Zenodo (including the jnotebook): 
# otherwise the Drosophilas are just NCBI accessions with which you can find this data
# you need the genomes. put them in out/utils/genomes/
# and the annotations which i put in halos here in out/stable/synthology
# I used blastp to find the candidates in the other (not D.mel) Drosophilas
# (the tRNA data is used below)

# you need the NCBI blast suite (blastn, tblastn, blastp for everything in our output)
# make sure clasp.x is working. there is compiled version in out/utils. if its not working on your machine, simply recompile from:
# http://legacy.bioinf.uni-leipzig.de/Software/clasp/

# first, lets run the run_synthology_prot.py pipeline. what it does (roughly):
# 1. place the elements of interest from the gff files in the MCScanX collinearity chains
# 2. make orthology graphs from all elements which are in the same pairwise chains and edit to cograph because of:
# https://pubmed.ncbi.nlm.nih.gov/22456957/
# 3. use dynamic programming to align them allowing duplications and other rearrangements as first decribed in:
# https://pmc.ncbi.nlm.nih.gov/articles/PMC4981973/#CR42
# 4. make union of pairwise graphs and edit to cograph again to get multi-species orthology graphs
# the results of this are already quite heavy in the sense that you can now technically check out the list alignments and orthology graphs
# to further investigate your orthology relationships of interest but that is too much for this tutorial

# most arguments are self-explanatory and just regard the locations of stuff
# technically with --use-clasp n you can avoid chaining blast hits but you'd usually want to do that
# --blast-mode chains means to only consider the elements of interest which can actually be placed into an MCScanX chain
# --blast-mode all-vs-all addittionally performs an all-vs-all blastp which was originally developed to investigate the discrepancy between
# orthology determination solely on reciprocal best hits vs. synteny-aided

  python3 run_synthology_prot.py \
    --cores 8 \
    --col ../MCScanX/MCScanX.collinearity \
    --homology_threshold 30 \
    --annotation_dir halos \
    --new_blast y \
    --mapping_file ../../utils/mapping \
    --pairwise_alignments ../../utils/pairwise_alignments_table \
    --use-clasp y \
    --blast-mode chains \
    --blastp-path blastp \
    --clasp-path ../../utils/clasp.x \
    --output_dir synthology_out_prot
In [20]:
%%bash

# to make this heavy output usable we implemented the extraction of "SynOrthoGroup" akin to OrthoFinders OrthoGroups
# they are defined as the connected components in the final to-cograph-edited union graph of the elements which could be placed into
# MCScanX synteny chains and whose connection is also supported by a respective list alignment
# cograph editing has to be contrained in our use case, firstly in general to only make between-species not within-species edges and 
# secondly, we create to final graphs, one in which the editing is allowed to introduce new edges which amounts to transitively closing some 
# orthology relationships ("with_new_edges") and not allowing new edges ("no_new_edges"). i recommend to conservatively use the version 
# without any new edges because the transitive closures can basically create big connected components due to sparse phylogenetic coverage, 
# assembly artefacts and genuine genome rearrangements. but anyway, try both
# to get them run:

python3 export_synorthogroups.py synthology_out_prot --graph-version both
================================================================================
EXPORTING SYNORTHOGROUPS (no_new_edges)
================================================================================

Loading synthology_out_prot/final_union_graph_no_new_edges.pickle...
Loading synthology_out_prot/parsed_faa_gff...
  Graph nodes: 100
  Graph edges: 447
  Total input genes: 104
  Multi-species SynOrthogroups: 4
  Single-species singletons: 0
  Filtered genes: 4
  ✓ Complete gene accounting verified

Writing output files to synthology_out_prot/synorthogroups/no_new_edges/
  Writing SynOrthogroups.tsv...
  Writing SynOrthogroups.GeneCount.tsv...
  Writing SynOrthogroups_SingleCopyOrthologues.txt...
  Writing SynOrthogroups_UnassignedGenes.tsv...
  Writing Statistics_Overall.tsv...
  Writing Statistics_PerSpecies.tsv...
  Writing SynOrthogroups_WithCoords.tsv...
  Writing synorthogroups.pickle...

✓ Export complete: synthology_out_prot/synorthogroups/no_new_edges/

================================================================================
EXPORTING SYNORTHOGROUPS (with_new_edges)
================================================================================

Loading synthology_out_prot/final_union_graph_with_new_edges.pickle...
Loading synthology_out_prot/parsed_faa_gff...
  Graph nodes: 100
  Graph edges: 472
  Total input genes: 104
  Multi-species SynOrthogroups: 4
  Single-species singletons: 0
  Filtered genes: 4
  ✓ Complete gene accounting verified

Writing output files to synthology_out_prot/synorthogroups/with_new_edges/
  Writing SynOrthogroups.tsv...
  Writing SynOrthogroups.GeneCount.tsv...
  Writing SynOrthogroups_SingleCopyOrthologues.txt...
  Writing SynOrthogroups_UnassignedGenes.tsv...
  Writing Statistics_Overall.tsv...
  Writing Statistics_PerSpecies.tsv...
  Writing SynOrthogroups_WithCoords.tsv...
  Writing synorthogroups.pickle...

�� Export complete: synthology_out_prot/synorthogroups/with_new_edges/

================================================================================
ALL EXPORTS COMPLETE
================================================================================

In [27]:
# e.g. look at the main halo protein from D. mel

import pandas as pd

# Read the synorthogroups table
sog_file = "synthology_out_prot/synorthogroups/with_new_edges/SynOrthogroups.tsv"
sog_df = pd.read_csv(sog_file, sep='\t')

# Read gene counts
counts_file = "synthology_out_prot/synorthogroups/with_new_edges/SynOrthogroups.GeneCount.tsv"
counts_df = pd.read_csv(counts_file, sep='\t')

# Read statistics
stats_file = "synthology_out_prot/synorthogroups/with_new_edges/Statistics_Overall.tsv"
stats_df = pd.read_csv(stats_file, sep='\t')

print("=" * 80)
print("OVERALL STATISTICS (WITH NEW EDGES)")
print("=" * 80)
for _, row in stats_df.iterrows():
    val = row['Value']
    # Round if it's a numeric value that should be an integer
    if isinstance(val, float) and val == int(val):
        val = int(val)
    print(f"{row['Metric']}: {val}")

print("\n" + "=" * 80)
print("ALL SYNORTHOGROUPS")
print("=" * 80)

for idx, row in sog_df.iterrows():
    sog_id = row['SynOrthogroup']
    print(f"\n{sog_id}")
    print("-" * 80)

    # Get gene counts for this SOG
    sog_counts = counts_df[counts_df['SynOrthogroup'] == sog_id]

    print("\nMembers per species:")
    for col in sog_df.columns:
        if col != 'SynOrthogroup':
            members = row[col]
            count = sog_counts[col].values[0] if len(sog_counts) > 0 else 0
            if isinstance(count, float):
                count = int(count)
            if pd.notna(members) and members != '':
                print(f"  {col} ({count}): {members}")
            else:
                print(f"  {col} ({count}): -")

    total = sog_counts['Total'].values[0] if len(sog_counts) > 0 else 0
    if isinstance(total, float):
        total = int(total)
    print(f"\nTotal genes in {sog_id}: {total}")
================================================================================
OVERALL STATISTICS (WITH NEW EDGES)
================================================================================
Number of species: 11
Total input genes: 104
Genes in SynOrthogroups: 100
Genes unassigned (filtered): 4
Genes unassigned (singletons): 0
Percentage in SynOrthogroups: 96.2
Number of SynOrthogroups: 4
Number of single-copy SynOrthogroups: 1
Mean SynOrthogroup size: 25
Median SynOrthogroup size: 27

================================================================================
ALL SYNORTHOGROUPS
================================================================================

SOG0000000
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (5): NP_001097507.1, NP_647925.1, NP_647926.1, NP_647928.1, NP_652709.1
  GCF_004382195.2 (5): XP_002035381.1, XP_002035384.1, XP_002035386.1, XP_002035388.1, XP_032573789.1
  GCF_009870125.1 (5): XP_001352657.3, XP_002135424.3, XP_033238392.1, XP_033238393.1, XP_033238394.1
  GCF_016746365.2 (5): XP_002093738.1, XP_002093741.1, XP_002093742.1, XP_039229380.1, XP_039229381.1
  GCF_016746395.2 (5): XP_002083669.1, XP_016030473.1, XP_016030475.1, XP_039149189.1, XP_039150257.1
  GCF_017639315.1 (5): XP_001956198.1, XP_001956199.1, XP_001956200.1, XP_001956202.1, XP_032310569.1
  GCF_018153835.1 (5): XP_017070999.1, XP_017071001.1, XP_017071002.1, XP_017071003.1, XP_017071329.1
  GCF_018902025.1 (5): XP_002062546.1, XP_002062547.1, XP_002062549.2, XP_023034409.1, XP_023037390.1
  GCF_030179895.1 (5): XP_017028404.1, XP_017028410.1, XP_017028411.1, XP_017028412.1, XP_070141710.1
  GCF_030179915.1 (5): XP_017014258.1, XP_017014261.2, XP_017014262.1, XP_017014265.1, XP_070071522.1
  GCF_030788295.1 (1): XP_002060039.2

Total genes in SOG0000000: 51

SOG0000001
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (2): NP_647920.1, NP_648635.1
  GCF_004382195.2 (2): XP_002030387.1, XP_032575457.1
  GCF_009870125.1 (2): XP_001352650.2, XP_001353594.4
  GCF_016746365.2 (2): XP_002094686.1, XP_039229810.1
  GCF_016746395.2 (2): XP_002084759.1, XP_016030468.1
  GCF_017639315.1 (2): XP_032310568.1, XP_032310840.1
  GCF_018153835.1 (2): XP_017071328.1, XP_017080036.2
  GCF_018902025.1 (2): XP_002062545.1, XP_046868055.1
  GCF_030179895.1 (2): XP_017028402.1, XP_070141894.1
  GCF_030179915.1 (2): XP_017004446.2, XP_017014257.1
  GCF_030788295.1 (7): XP_002047487.1, XP_002060043.1, XP_002060051.1, XP_032291392.1, XP_032291415.1, XP_032291459.1, XP_032291515.1

Total genes in SOG0000001: 27

SOG0000002
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (2): NP_001137765.1, NP_608940.1
  GCF_004382195.2 (1): XP_032583477.1
  GCF_009870125.1 (1): XP_001357034.2
  GCF_016746365.2 (1): XP_002087550.1
  GCF_016746395.2 (1): XP_039154542.1
  GCF_017639315.1 (1): XP_032311464.1
  GCF_018153835.1 (1): XP_017075608.1
  GCF_018902025.1 (1): XP_002064938.1
  GCF_030179895.1 (1): XP_017021791.1
  GCF_030179915.1 (1): XP_016994922.1
  GCF_030788295.1 (1): XP_002051725.1

Total genes in SOG0000002: 12

SOG0000003
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (0): -
  GCF_004382195.2 (1): XP_002037965.1
  GCF_009870125.1 (1): XP_033235652.1
  GCF_016746365.2 (1): XP_015054458.1
  GCF_016746395.2 (1): XP_016023623.1
  GCF_017639315.1 (1): XP_001961803.1
  GCF_018153835.1 (1): XP_017082362.2
  GCF_018902025.1 (1): XP_002069270.2
  GCF_030179895.1 (1): XP_017032011.1
  GCF_030179915.1 (1): XP_017006947.2
  GCF_030788295.1 (1): XP_002052907.1

Total genes in SOG0000003: 10
In [28]:
import pandas as pd

# Read the synorthogroups table
sog_file = "synthology_out_prot/synorthogroups/no_new_edges/SynOrthogroups.tsv"
sog_df = pd.read_csv(sog_file, sep='\t')

# Read gene counts
counts_file = "synthology_out_prot/synorthogroups/no_new_edges/SynOrthogroups.GeneCount.tsv"
counts_df = pd.read_csv(counts_file, sep='\t')

# Read statistics
stats_file = "synthology_out_prot/synorthogroups/no_new_edges/Statistics_Overall.tsv"
stats_df = pd.read_csv(stats_file, sep='\t')

print("=" * 80)
print("OVERALL STATISTICS (NO NEW EDGES)")
print("=" * 80)
for _, row in stats_df.iterrows():
    val = row['Value']
    # Round if it's a numeric value that should be an integer
    if isinstance(val, float) and val == int(val):
        val = int(val)
    print(f"{row['Metric']}: {val}")

print("\n" + "=" * 80)
print("ALL SYNORTHOGROUPS")
print("=" * 80)

for idx, row in sog_df.iterrows():
    sog_id = row['SynOrthogroup']
    print(f"\n{sog_id}")
    print("-" * 80)

    # Get gene counts for this SOG
    sog_counts = counts_df[counts_df['SynOrthogroup'] == sog_id]

    print("\nMembers per species:")
    for col in sog_df.columns:
        if col != 'SynOrthogroup':
            members = row[col]
            count = sog_counts[col].values[0] if len(sog_counts) > 0 else 0
            if isinstance(count, float):
                count = int(count)
            if pd.notna(members) and members != '':
                print(f"  {col} ({count}): {members}")
            else:
                print(f"  {col} ({count}): -")

    total = sog_counts['Total'].values[0] if len(sog_counts) > 0 else 0
    if isinstance(total, float):
        total = int(total)
    print(f"\nTotal genes in {sog_id}: {total}")
================================================================================
OVERALL STATISTICS (NO NEW EDGES)
================================================================================
Number of species: 11
Total input genes: 104
Genes in SynOrthogroups: 100
Genes unassigned (filtered): 4
Genes unassigned (singletons): 0
Percentage in SynOrthogroups: 96.2
Number of SynOrthogroups: 4
Number of single-copy SynOrthogroups: 1
Mean SynOrthogroup size: 25
Median SynOrthogroup size: 27

================================================================================
ALL SYNORTHOGROUPS
================================================================================

SOG0000000
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (5): NP_001097507.1, NP_647925.1, NP_647926.1, NP_647928.1, NP_652709.1
  GCF_004382195.2 (5): XP_002035381.1, XP_002035384.1, XP_002035386.1, XP_002035388.1, XP_032573789.1
  GCF_009870125.1 (5): XP_001352657.3, XP_002135424.3, XP_033238392.1, XP_033238393.1, XP_033238394.1
  GCF_016746365.2 (5): XP_002093738.1, XP_002093741.1, XP_002093742.1, XP_039229380.1, XP_039229381.1
  GCF_016746395.2 (5): XP_002083669.1, XP_016030473.1, XP_016030475.1, XP_039149189.1, XP_039150257.1
  GCF_017639315.1 (5): XP_001956198.1, XP_001956199.1, XP_001956200.1, XP_001956202.1, XP_032310569.1
  GCF_018153835.1 (5): XP_017070999.1, XP_017071001.1, XP_017071002.1, XP_017071003.1, XP_017071329.1
  GCF_018902025.1 (5): XP_002062546.1, XP_002062547.1, XP_002062549.2, XP_023034409.1, XP_023037390.1
  GCF_030179895.1 (5): XP_017028404.1, XP_017028410.1, XP_017028411.1, XP_017028412.1, XP_070141710.1
  GCF_030179915.1 (5): XP_017014258.1, XP_017014261.2, XP_017014262.1, XP_017014265.1, XP_070071522.1
  GCF_030788295.1 (1): XP_002060039.2

Total genes in SOG0000000: 51

SOG0000001
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (2): NP_647920.1, NP_648635.1
  GCF_004382195.2 (2): XP_002030387.1, XP_032575457.1
  GCF_009870125.1 (2): XP_001352650.2, XP_001353594.4
  GCF_016746365.2 (2): XP_002094686.1, XP_039229810.1
  GCF_016746395.2 (2): XP_002084759.1, XP_016030468.1
  GCF_017639315.1 (2): XP_032310568.1, XP_032310840.1
  GCF_018153835.1 (2): XP_017071328.1, XP_017080036.2
  GCF_018902025.1 (2): XP_002062545.1, XP_046868055.1
  GCF_030179895.1 (2): XP_017028402.1, XP_070141894.1
  GCF_030179915.1 (2): XP_017004446.2, XP_017014257.1
  GCF_030788295.1 (7): XP_002047487.1, XP_002060043.1, XP_002060051.1, XP_032291392.1, XP_032291415.1, XP_032291459.1, XP_032291515.1

Total genes in SOG0000001: 27

SOG0000002
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (2): NP_001137765.1, NP_608940.1
  GCF_004382195.2 (1): XP_032583477.1
  GCF_009870125.1 (1): XP_001357034.2
  GCF_016746365.2 (1): XP_002087550.1
  GCF_016746395.2 (1): XP_039154542.1
  GCF_017639315.1 (1): XP_032311464.1
  GCF_018153835.1 (1): XP_017075608.1
  GCF_018902025.1 (1): XP_002064938.1
  GCF_030179895.1 (1): XP_017021791.1
  GCF_030179915.1 (1): XP_016994922.1
  GCF_030788295.1 (1): XP_002051725.1

Total genes in SOG0000002: 12

SOG0000003
--------------------------------------------------------------------------------

Members per species:
  GCF_000001215.4 (0): -
  GCF_004382195.2 (1): XP_002037965.1
  GCF_009870125.1 (1): XP_033235652.1
  GCF_016746365.2 (1): XP_015054458.1
  GCF_016746395.2 (1): XP_016023623.1
  GCF_017639315.1 (1): XP_001961803.1
  GCF_018153835.1 (1): XP_017082362.2
  GCF_018902025.1 (1): XP_002069270.2
  GCF_030179895.1 (1): XP_017032011.1
  GCF_030179915.1 (1): XP_017006947.2
  GCF_030788295.1 (1): XP_002052907.1

Total genes in SOG0000003: 10
In [32]:
%%bash

# so for this example, the two graphs represent the same information
# since i know about this example, i know that NP_608940.1 should probably be allocated into SOG0000003
# thats a good example - always check this stuff from multiple angles:
# lets first draw an image of the synorthogroups and synteny anchor alignments with 250000 margin to check for anchors around
# in the next cell we will draw the raw data

python3 draw_synorthogroups.py 250000 synthology_out_prot/synorthogroups/no_new_edges/
Loading SynOrthogroups from synthology_out_prot/synorthogroups/no_new_edges//SynOrthogroups.tsv...
  Found 4 SynOrthogroups across 11 species
Loading gene coordinates from synthology_out_prot/synorthogroups/no_new_edges//SynOrthogroups_WithCoords.tsv...
  Loaded coordinates for 100 genes
  After filtering (min 2 species): 4 SOGs to draw
Loaded 679879 alignment fragments

=== Drawing SOG0000000 (1/4) ===
  Species: 11, Genes: 51
  Saved to images_synorthogroups/no_new_edges/250000/SOG0000000_11species_51genes.svg

=== Drawing SOG0000001 (2/4) ===
  Species: 11, Genes: 27
  Saved to images_synorthogroups/no_new_edges/250000/SOG0000001_11species_27genes.svg

=== Drawing SOG0000002 (3/4) ===
  Species: 11, Genes: 12
  Saved to images_synorthogroups/no_new_edges/250000/SOG0000002_11species_12genes.svg

=== Drawing SOG0000003 (4/4) ===
  Species: 10, Genes: 10
  Saved to images_synorthogroups/no_new_edges/250000/SOG0000003_10species_10genes.svg

=== Done! Figures saved to images_synorthogroups/no_new_edges/250000/ ===
In [35]:
from IPython.display import SVG, display, Markdown
import glob
import os

svg_files = sorted(glob.glob('images_synorthogroups/no_new_edges/250000/*.svg'))

for svg_file in svg_files:
    display(Markdown(f"### {os.path.basename(svg_file)}"))
    display(SVG(svg_file))

SOG0000000_11species_51genes.svg¶

No description has been provided for this image

SOG0000001_11species_27genes.svg¶

No description has been provided for this image

SOG0000002_11species_12genes.svg¶

No description has been provided for this image

SOG0000003_10species_10genes.svg¶

No description has been provided for this image
In [ ]:
%%bash

# now lets double check - this is a recommended workflow
# with the draw_verbose.py script one can draw such images only given gffs directory or an alternative "coords" file
# and the anchor alignments (without synthology inference)
# (the output warnings about tracks not being adjacent are expected and the negative coordinates too)

python3 draw_verbose.py 250000 \
    --gff-dir halos/gff \
    --alignments-file ../../utils/pairwise_alignments_table \
    --auto-order \
    --draw-all-elements
In [53]:
from IPython.display import SVG, display, Markdown
import glob
import os

svg_files = sorted(glob.glob('images_draw_verbose/250000/*.svg'))

# so this makes 3 images...and at that margin therefore 3, not 4 synorthogroups.
# in component_2_of_3_ref_GCF_018902025.1_NW_025814050.1_23821953.svg one can see why: the elements of SOG0000002 and SOG0000003 are on
# the same chromosome and soemtimes have larger and sometimes smaller gaps between each other in the genomes which is why depending 
# on the margin used, they get thrown into the same "group"

for svg_file in svg_files:
    display(Markdown(f"### {os.path.basename(svg_file)}"))
    display(SVG(svg_file))

component_1_of_3_ref_GCF_016746395.2_NC_052522.2_4643833.svg¶

No description has been provided for this image

component_2_of_3_ref_GCF_000001215.4_NT_037436.4_13378622.svg¶

No description has been provided for this image

component_3_of_3_ref_GCF_018902025.1_NW_025814050.1_23821179.svg¶

No description has been provided for this image
In [ ]:
%%bash

# lets try with a smaller margin

python3 draw_verbose.py 50000 \
    --gff-dir halos/gff \
    --alignments-file ../../utils/pairwise_alignments_table \
    --auto-order \
    --draw-all-elements
In [54]:
from IPython.display import SVG, display, Markdown
import glob
import os

svg_files = sorted(glob.glob('images_draw_verbose/50000/*.svg'))

# now it makes 4 images corresponding largely to the synorthogroups but with the D. mel element actually probably in the right group
# so the image for halo is: component_3_of_4_ref_GCF_016746395.2_NC_052520.2_1548430.svg
# and the other element: component_1_of_4_ref_GCF_016746365.2_NC_052527.2_13771800.svg

# IMPORTANT: for any synteny analysis like this, always try with different margins that you look at. if the scenario you are looking at 
# changes drastically between using different margin within the same order of magnitude, its a sign of the unreliability of the synteny 
# resolution. conservely, if increasing the margin only adds more elements but does not change the overall picture, the synteny evidence 
# for orthology claims is good

for svg_file in svg_files:
    display(Markdown(f"### {os.path.basename(svg_file)}"))
    display(SVG(svg_file))

component_1_of_4_ref_GCF_018902025.1_NW_025814057.1_811382.svg¶

No description has been provided for this image

component_2_of_4_ref_GCF_009870125.1_NC_046681.1_28002573.svg¶

No description has been provided for this image

component_3_of_4_ref_GCF_004382195.2_NC_045951.1_13070038.svg¶

No description has been provided for this image

component_4_of_4_ref_GCF_009870125.1_NC_046681.1_9910077.svg¶

No description has been provided for this image
In [82]:
# one perk of the synteny-based orthology detection is that tandem arrays can potentially be disentangled
# for that we can use list alignments with synteny anchors. we already did that for the synthology pipeline
# here, lets extract some of this info to show that the synthology output already contains the information
# to further investigate these relationship
# we have not been able to automize this approach in a sensible fashion until writing this tutorial
# that being said, its good to investigate this "rawer" data as well and should probably be advised in most cases

import re

print("=" * 80)
print("TANDEM DUPLICATION EXAMPLES FROM SYNORTHOGROUP ALIGNMENTS")
print("=" * 80)

dmel_tandem = ['NP_001097507.1', 'NP_647925.1', 'NP_647926.1', 'NP_647928.1', 'NP_652709.1']

print(f"\nD. melanogaster SOG0000000 tandem array (5 genes):")
for gene in dmel_tandem:
    print(f"  - {gene}")

# Example 1: All matches (conserved 5→5)
species1 = 'GCF_018153835.1'
print(f"\n{'=' * 80}")
print(f"EXAMPLE 1: D. mel → {species1} (all matches)")
print("=" * 80)

alignment_file = f'synthology_out_prot/alignments/GCF_000001215.4___{species1}.txt'

with open(alignment_file) as f:
    content = f.read()

edges_match = re.search(r'Edges \(\d+\):(.*?)(?=\n-{10,}|\Z)', content, re.DOTALL)
if edges_match:
    edges_text = edges_match.group(1)

    print("\nCorrespondences:")
    for line in edges_text.split('\n'):
        line = line.strip()
        if '<->' not in line:
            continue

        match = re.match(r'(\S+)\s+<->\s+(\S+)\s+\[event=(\w+)', line)
        if not match:
            continue

        g1, g2, event = match.groups()
        if g1 in dmel_tandem:
            print(f"  {g1} <-> {g2} [event={event}]")
        elif g2 in dmel_tandem:
            print(f"  {g2} <-> {g1} [event={event}]")

# Example 2: Some tandem duplications (many-to-one)
species2 = 'GCF_016746395.2'
print(f"\n{'=' * 80}")
print(f"EXAMPLE 2: D. mel → {species2} (with tandem duplications)")
print("=" * 80)

alignment_file = f'synthology_out_prot/alignments/GCF_000001215.4___{species2}.txt'

with open(alignment_file) as f:
    content = f.read()

edges_match = re.search(r'Edges \(\d+\):(.*?)(?=\n-{10,}|\Z)', content, re.DOTALL)
if edges_match:
    edges_text = edges_match.group(1)

    # Build reverse mapping
    correspondences = {gene: [] for gene in dmel_tandem}
    for line in edges_text.split('\n'):
        line = line.strip()
        if '<->' not in line:
            continue

        match = re.match(r'(\S+)\s+<->\s+(\S+)\s+\[event=(\w+)', line)
        if not match:
            continue

        g1, g2, event = match.groups()
        if g1 in dmel_tandem:
            correspondences[g1].append((g2, event))
        elif g2 in dmel_tandem:
            correspondences[g2].append((g1, event))

    # Build target mapping
    target_to_dmel = {}
    for dmel_gene, hits in correspondences.items():
        for target, event in hits:
            if target not in target_to_dmel:
                target_to_dmel[target] = []
            target_to_dmel[target].append((dmel_gene, event))

    print("\nCorrespondences:")
    displayed = set()

    # Show many-to-one first
    for target, sources in sorted(target_to_dmel.items()):
        if len(sources) > 1:
            dmel_genes = [s[0] for s in sources]
            events = set(s[1] for s in sources)
            print(f"\n  {' + '.join(dmel_genes)} → {target}")
            print(f"    [{len(sources)}:1 many-to-one, event: {', '.join(events)}]")
            displayed.update(dmel_genes)

    # Then show 1:1
    for dmel_gene in dmel_tandem:
        if dmel_gene in displayed:
            continue
        hits = correspondences[dmel_gene]
        if len(hits) == 1:
            target, event = hits[0]
            print(f"\n  {dmel_gene} → {target} [event={event}]")

print("\n" + "=" * 80)
================================================================================
TANDEM DUPLICATION EXAMPLES FROM SYNORTHOGROUP ALIGNMENTS
================================================================================

D. melanogaster SOG0000000 tandem array (5 genes):
  - NP_001097507.1
  - NP_647925.1
  - NP_647926.1
  - NP_647928.1
  - NP_652709.1

================================================================================
EXAMPLE 1: D. mel → GCF_018153835.1 (all matches)
================================================================================

Correspondences:
  NP_001097507.1 <-> XP_017071329.1 [event=match]
  NP_647925.1 <-> XP_017070999.1 [event=match]
  NP_652709.1 <-> XP_017071003.1 [event=match]
  NP_647926.1 <-> XP_017071002.1 [event=match]
  NP_647928.1 <-> XP_017071001.1 [event=match]

================================================================================
EXAMPLE 2: D. mel → GCF_016746395.2 (with tandem duplications)
================================================================================

Correspondences:

  NP_647926.1 + NP_652709.1 → XP_002083669.1
    [2:1 many-to-one, event: tandem_dup]

  NP_001097507.1 → XP_016030475.1 [event=match]

  NP_647928.1 → XP_039149189.1 [event=match]

================================================================================
In [61]:
%%bash

# now lets introduce another interesting approach of how to use synteny anchors:
# we can use the alignments to iteratively find all regions in a set of species which show synteny with regions of interest
# e.g. lets run this for the halos. again, sequence margin/tolerances matter and different should be tried. im using a large 
# one because (1) the species are realtively closely related which reduces the risk of rearragements causing the too many regions 
# to be detected and (2) to actually include as much aas possible. this is the tradeoff
# the script is roughly doing the following:
# 1. extract the elements of interest (from the gff directories in our case) and the anchors around them
# 2. the DBScan algorithm is used to find clusters of anchors to define syntenic regions
# 3. iteratively extract new anchor matches based on the anchors extracted until no new regions are found

python3 get_syn_regions.py --cores 8 --iter 20 --threshold_new_region 1000000 --anchor_scope 500000 --gff halos/gff
Loaded 122 elements from GFF files
Total elements to process: 122
NT_033779.5 in genome GCF_000001215.4
NT_033779.5 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NT_037436.4 in genome GCF_000001215.4
NC_052527.2 in genome GCF_016746365.2
NC_052527.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NC_052529.2 in genome GCF_016746365.2
NW_024571981.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024572964.1 in genome GCF_018153835.1
NW_024573038.1 in genome GCF_018153835.1
NW_024573454.1 in genome GCF_018153835.1
NC_045949.1 in genome GCF_004382195.2
NC_045949.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NC_045951.1 in genome GCF_004382195.2
NW_025814050.1 in genome GCF_018902025.1
NW_025814050.1 in genome GCF_018902025.1
NW_025814050.1 in genome GCF_018902025.1
NW_025814050.1 in genome GCF_018902025.1
NW_025814050.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NW_025814057.1 in genome GCF_018902025.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057928.1 in genome GCF_017639315.1
NC_057930.1 in genome GCF_017639315.1
NC_057930.1 in genome GCF_017639315.1
NC_046681.1 in genome GCF_009870125.1
NC_046681.1 in genome GCF_009870125.1
NC_046681.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_046683.1 in genome GCF_009870125.1
NC_052520.2 in genome GCF_016746395.2
NC_052520.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_052522.2 in genome GCF_016746395.2
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091545.1 in genome GCF_030788295.1
NC_091546.1 in genome GCF_030788295.1
NC_091546.1 in genome GCF_030788295.1
NW_027212798.1 in genome GCF_030788295.1
NW_027212798.1 in genome GCF_030788295.1
NW_027212798.1 in genome GCF_030788295.1
NW_027212798.1 in genome GCF_030788295.1
NW_027212798.1 in genome GCF_030788295.1
NC_091728.1 in genome GCF_030179895.1
NC_091728.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091730.1 in genome GCF_030179895.1
NC_091678.1 in genome GCF_030179915.1
NC_091678.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
NC_091680.1 in genome GCF_030179915.1
iteration no 1
found 415 new regions after iteration 1
iteration no 2
found 0 new regions after iteration 2
exiting before iteration no 3 since no new regions found
writing final output file
In [ ]:
 
In [62]:
%%bash

# the output defines regions where potential orthologs to our elements of interest could lie
# BP == breakpoint. it just indicates that a region might be broken into two pieces while its syntenic partner is 
# connected in another species
cat syntenic_regions_succinct
GCF_000001215.4	NT_033779.5	17050	285850	NO BP
GCF_000001215.4	NT_033779.5	695500	2143484	NO BP
GCF_000001215.4	NT_033779.5	3004200	3376834	NO BP
GCF_000001215.4	NT_033779.5	4998800	6080850	NO BP
GCF_000001215.4	NT_033779.5	7195784	7492634	NO BP
GCF_000001215.4	NT_033779.5	7695400	7830050	NO BP
GCF_000001215.4	NT_033779.5	8033584	8142584	NO BP
GCF_000001215.4	NT_033779.5	10478900	11521934	NO BP
GCF_000001215.4	NT_033779.5	11979000	12118150	NO BP
GCF_000001215.4	NT_033779.5	13153300	13195450	NO BP
GCF_000001215.4	NT_033779.5	13904150	14057600	NO BP
GCF_000001215.4	NT_033779.5	15240500	15443434	NO BP
GCF_000001215.4	NT_033779.5	16279100	16338150	NO BP
GCF_000001215.4	NT_033779.5	16615100	17241700	NO BP
GCF_000001215.4	NT_033779.5	17792950	18150350	NO BP
GCF_000001215.4	NT_033779.5	18696400	18721950	NO BP
GCF_000001215.4	NT_033779.5	19419400	19453600	NO BP
GCF_000001215.4	NT_033779.5	19918500	20025250	NO BP
GCF_000001215.4	NT_033779.5	20704700	21206850	NO BP
GCF_000001215.4	NT_037436.4	201786	336586	NO BP
GCF_000001215.4	NT_037436.4	613950	840700	NO BP
GCF_000001215.4	NT_037436.4	1546350	1637450	NO BP
GCF_000001215.4	NT_037436.4	2770886	3309450	NO BP
GCF_000001215.4	NT_037436.4	4100400	5308216	NO BP
GCF_000001215.4	NT_037436.4	5903066	5940886	NO BP
GCF_000001215.4	NT_037436.4	7240836	7381236	NO BP
GCF_000001215.4	NT_037436.4	7968916	8611566	NO BP
GCF_000001215.4	NT_037436.4	9350216	9520966	NO BP
GCF_000001215.4	NT_037436.4	9694916	9913066	NO BP
GCF_000001215.4	NT_037436.4	10951466	11071816	NO BP
GCF_000001215.4	NT_037436.4	11948366	12135236	NO BP
GCF_000001215.4	NT_037436.4	12463116	14040166	NO BP
GCF_000001215.4	NT_037436.4	15559586	15586666	NO BP
GCF_000001215.4	NT_037436.4	16547066	16618566	NO BP
GCF_000001215.4	NT_037436.4	16778236	16897166	NO BP
GCF_000001215.4	NT_037436.4	17243066	17476116	NO BP
GCF_000001215.4	NT_037436.4	19245266	19878316	NO BP
GCF_000001215.4	NT_037436.4	20119316	20417586	NO BP
GCF_000001215.4	NT_037436.4	21814836	21959836	NO BP
GCF_000001215.4	NT_037436.4	22396816	22724686	NO BP
GCF_016746365.2	NC_052527.2	64444	335700	NO BP
GCF_016746365.2	NC_052527.2	733544	2176044	NO BP
GCF_016746365.2	NC_052527.2	3060300	3428400	NO BP
GCF_016746365.2	NC_052527.2	5158794	5284950	NO BP
GCF_016746365.2	NC_052527.2	5494700	5796394	NO BP
GCF_016746365.2	NC_052527.2	6967494	8017350	NO BP
GCF_016746365.2	NC_052527.2	8482900	8616700	NO BP
GCF_016746365.2	NC_052527.2	9016200	9482000	NO BP
GCF_016746365.2	NC_052527.2	10173344	10218144	NO BP
GCF_016746365.2	NC_052527.2	13238694	14279094	NO BP
GCF_016746365.2	NC_052527.2	19168194	19228744	NO BP
GCF_016746365.2	NC_052527.2	20073994	20283338	NO BP
GCF_016746365.2	NC_052528.2	4127714	4493414	NO BP
GCF_016746365.2	NC_052528.2	5538513	5880063	NO BP
GCF_016746365.2	NC_052528.2	6435213	7239313	NO BP
GCF_016746365.2	NC_052528.2	8501063	8539213	NO BP
GCF_016746365.2	NC_052528.2	8973563	9077214	NO BP
GCF_016746365.2	NC_052528.2	12315414	12444514	NO BP
GCF_016746365.2	NC_052529.2	222400	356350	NO BP
GCF_016746365.2	NC_052529.2	693200	876729	NO BP
GCF_016746365.2	NC_052529.2	1537600	1640000	NO BP
GCF_016746365.2	NC_052529.2	2269850	2302729	NO BP
GCF_016746365.2	NC_052529.2	2467279	2630600	NO BP
GCF_016746365.2	NC_052529.2	3365429	3914200	NO BP
GCF_016746365.2	NC_052529.2	4728379	5891450	NO BP
GCF_016746365.2	NC_052529.2	6544400	6575129	NO BP
GCF_016746365.2	NC_052529.2	7896179	8039579	NO BP
GCF_016746365.2	NC_052529.2	8618350	9231300	NO BP
GCF_016746365.2	NC_052529.2	9883000	9948329	NO BP
GCF_016746365.2	NC_052529.2	11040629	11171000	NO BP
GCF_016746365.2	NC_052529.2	12074950	12266729	NO BP
GCF_016746365.2	NC_052529.2	12604879	14218700	NO BP
GCF_016746365.2	NC_052529.2	16703879	16945800	NO BP
GCF_016746365.2	NC_052529.2	17291929	17402229	NO BP
GCF_016746365.2	NC_052529.2	17557129	17621600	NO BP
GCF_016746365.2	NC_052529.2	20183979	20455779	NO BP
GCF_016746365.2	NC_052529.2	20692529	21287529	NO BP
GCF_016746365.2	NC_052529.2	21998279	22148779	NO BP
GCF_016746365.2	NC_052529.2	22610000	22942829	NO BP
GCF_018153835.1	NW_024571881.1	92398	720050	NO BP
GCF_018153835.1	NW_024571981.1	258850	1374700	NO BP
GCF_018153835.1	NW_024571981.1	1585469	1789819	NO BP
GCF_018153835.1	NW_024571981.1	1985300	2687350	NO BP
GCF_018153835.1	NW_024571981.1	3317800	3569900	NO BP
GCF_018153835.1	NW_024572000.1	0	173700	NO BP
GCF_018153835.1	NW_024572672.1	639620	748920	NO BP
GCF_018153835.1	NW_024572675.1	443913	922313	NO BP
GCF_018153835.1	NW_024572700.1	2200	121180	NO BP
GCF_018153835.1	NW_024572939.1	0	37050	NO BP
GCF_018153835.1	NW_024572964.1	7461	136761	NO BP
GCF_018153835.1	NW_024572964.1	654650	891100	NO BP
GCF_018153835.1	NW_024572964.1	1560811	1665700	NO BP
GCF_018153835.1	NW_024572964.1	2862911	3398550	NO BP
GCF_018153835.1	NW_024572964.1	4235511	5450400	NO BP
GCF_018153835.1	NW_024572964.1	6054300	6092711	NO BP
GCF_018153835.1	NW_024572964.1	6422600	6615311	NO BP
GCF_018153835.1	NW_024572964.1	6950911	7244211	NO BP
GCF_018153835.1	NW_024572964.1	7490950	7557900	NO BP
GCF_018153835.1	NW_024572964.1	7820350	7909450	NO BP
GCF_018153835.1	NW_024573036.1	0	139608	NO BP
GCF_018153835.1	NW_024573036.1	784400	1169658	NO BP
GCF_018153835.1	NW_024573036.1	2168450	2465658	NO BP
GCF_018153835.1	NW_024573038.1	416753	741850	NO BP
GCF_018153835.1	NW_024573038.1	1187403	1339600	NO BP
GCF_018153835.1	NW_024573038.1	1720303	2215253	NO BP
GCF_018153835.1	NW_024573038.1	4089753	4340350	NO BP
GCF_018153835.1	NW_024573038.1	4683750	4805603	NO BP
GCF_018153835.1	NW_024573038.1	4965553	5031650	NO BP
GCF_018153835.1	NW_024573038.1	5550650	5636900	NO BP
GCF_018153835.1	NW_024573038.1	5785203	6005450	NO BP
GCF_018153835.1	NW_024573038.1	7058500	7183100	NO BP
GCF_018153835.1	NW_024573038.1	9828003	11449100	NO BP
GCF_018153835.1	NW_024573038.1	13133653	13162153	NO BP
GCF_018153835.1	NW_024573209.1	1454800	1613802	NO BP
GCF_018153835.1	NW_024573245.1	19225	309625	NO BP
GCF_018153835.1	NW_024573245.1	709125	815025	NO BP
GCF_018153835.1	NW_024573253.1	0	227800	NO BP
GCF_018153835.1	NW_024573287.1	146350	347889	NO BP
GCF_018153835.1	NW_024573311.1	131610	404310	NO BP
GCF_018153835.1	NW_024573311.1	1708710	1742310	NO BP
GCF_018153835.1	NW_024573311.1	2757310	2885210	NO BP
GCF_018153835.1	NW_024573387.1	775850	883080	NO BP
GCF_018153835.1	NW_024573410.1	0	401664	NO BP
GCF_018153835.1	NW_024573454.1	0	932300	NO BP
GCF_018153835.1	NW_024573532.1	0	117113	NO BP
GCF_018153835.1	NW_024573566.1	8000	120581	NO BP
GCF_018153835.1	NW_024573661.1	638278	723978	NO BP
GCF_018153835.1	NW_024573793.1	0	114350	NO BP
GCF_018153835.1	NW_024573870.1	678035	715450	NO BP
GCF_004382195.2	NC_045949.1	82050	344100	NO BP
GCF_004382195.2	NC_045949.1	727400	2108100	NO BP
GCF_004382195.2	NC_045949.1	2949750	3312150	NO BP
GCF_004382195.2	NC_045949.1	4909800	5981000	NO BP
GCF_004382195.2	NC_045949.1	7068050	7357200	NO BP
GCF_004382195.2	NC_045949.1	7556100	7691750	NO BP
GCF_004382195.2	NC_045949.1	7877850	7985250	NO BP
GCF_004382195.2	NC_045949.1	10268650	11275200	NO BP
GCF_004382195.2	NC_045949.1	11733300	11861000	NO BP
GCF_004382195.2	NC_045949.1	12860800	12903650	NO BP
GCF_004382195.2	NC_045949.1	13588350	13744450	NO BP
GCF_004382195.2	NC_045949.1	14930200	15130500	NO BP
GCF_004382195.2	NC_045949.1	15919000	15977200	NO BP
GCF_004382195.2	NC_045949.1	16209600	16875650	NO BP
GCF_004382195.2	NC_045949.1	17511600	17773350	NO BP
GCF_004382195.2	NC_045949.1	18315900	18342950	NO BP
GCF_004382195.2	NC_045949.1	19030850	19067100	NO BP
GCF_004382195.2	NC_045949.1	19497250	19602750	NO BP
GCF_004382195.2	NC_045949.1	20279650	20759500	NO BP
GCF_004382195.2	NC_045951.1	277450	415100	NO BP
GCF_004382195.2	NC_045951.1	683550	915300	NO BP
GCF_004382195.2	NC_045951.1	1554900	1656550	NO BP
GCF_004382195.2	NC_045951.1	2778650	3288050	NO BP
GCF_004382195.2	NC_045951.1	4076600	5238800	NO BP
GCF_004382195.2	NC_045951.1	5811150	5852450	NO BP
GCF_004382195.2	NC_045951.1	7094750	7158050	NO BP
GCF_004382195.2	NC_045951.1	7807400	8423600	NO BP
GCF_004382195.2	NC_045951.1	9131650	9293650	NO BP
GCF_004382195.2	NC_045951.1	9454000	9670850	NO BP
GCF_004382195.2	NC_045951.1	10678750	10810500	NO BP
GCF_004382195.2	NC_045951.1	11645700	11832400	NO BP
GCF_004382195.2	NC_045951.1	12159200	13704000	NO BP
GCF_004382195.2	NC_045951.1	16158200	16210150	NO BP
GCF_004382195.2	NC_045951.1	16368150	16479350	NO BP
GCF_004382195.2	NC_045951.1	16811900	17046400	NO BP
GCF_004382195.2	NC_045951.1	18745000	19323000	NO BP
GCF_004382195.2	NC_045951.1	19626500	19909900	NO BP
GCF_004382195.2	NC_045951.1	21260400	21411350	NO BP
GCF_004382195.2	NC_045951.1	21861100	22173900	NO BP
GCF_018902025.1	NW_025814050.1	2344750	2492100	NO BP
GCF_018902025.1	NW_025814050.1	3367850	4076750	NO BP
GCF_018902025.1	NW_025814050.1	4801850	4897545	NO BP
GCF_018902025.1	NW_025814050.1	5464950	5724145	NO BP
GCF_018902025.1	NW_025814050.1	6128450	6864800	NO BP
GCF_018902025.1	NW_025814050.1	7140250	7179645	NO BP
GCF_018902025.1	NW_025814050.1	8964150	9053350	NO BP
GCF_018902025.1	NW_025814050.1	9507045	9724150	NO BP
GCF_018902025.1	NW_025814050.1	10660850	11301600	NO BP
GCF_018902025.1	NW_025814050.1	11695445	12027350	NO BP
GCF_018902025.1	NW_025814050.1	12922300	14013450	NO BP
GCF_018902025.1	NW_025814050.1	14539450	14986500	NO BP
GCF_018902025.1	NW_025814050.1	16499045	16637200	NO BP
GCF_018902025.1	NW_025814050.1	18086650	19316050	NO BP
GCF_018902025.1	NW_025814050.1	20859350	21103050	NO BP
GCF_018902025.1	NW_025814050.1	21983100	22271945	NO BP
GCF_018902025.1	NW_025814050.1	23015895	24296366	POT BP1
GCF_018902025.1	NW_025814051.1	0	41800	POT BP2
GCF_018902025.1	NW_025814051.1	776178	975600	NO BP
GCF_018902025.1	NW_025814051.1	1154950	1331950	NO BP
GCF_018902025.1	NW_025814051.1	3115700	3616450	NO BP
GCF_018902025.1	NW_025814051.1	4479750	5061650	NO BP
GCF_018902025.1	NW_025814051.1	6101600	6418200	NO BP
GCF_018902025.1	NW_025814051.1	6607150	6682700	NO BP
GCF_018902025.1	NW_025814054.1	1541550	1614694	NO BP
GCF_018902025.1	NW_025814054.1	2301900	2615394	NO BP
GCF_018902025.1	NW_025814054.1	3338100	3409894	NO BP
GCF_018902025.1	NW_025814055.1	844250	924850	NO BP
GCF_018902025.1	NW_025814055.1	1372700	1486471	NO BP
GCF_018902025.1	NW_025814056.1	244486	309186	NO BP
GCF_018902025.1	NW_025814056.1	2963250	3073350	NO BP
GCF_018902025.1	NW_025814056.1	3894200	4275100	NO BP
GCF_018902025.1	NW_025814056.1	4786200	4980136	NO BP
GCF_018902025.1	NW_025814056.1	5519686	5912600	NO BP
GCF_018902025.1	NW_025814056.1	6837600	6951750	NO BP
GCF_018902025.1	NW_025814056.1	7134150	7214100	NO BP
GCF_018902025.1	NW_025814056.1	7643586	7940500	NO BP
GCF_018902025.1	NW_025814056.1	8100300	8267950	NO BP
GCF_018902025.1	NW_025814057.1	92600	1405550	NO BP
GCF_018902025.1	NW_025814057.1	1618200	1711550	NO BP
GCF_018902025.1	NW_025814057.1	2223950	2684050	NO BP
GCF_018902025.1	NW_025814057.1	2870500	2922050	NO BP
GCF_018902025.1	NW_025814057.1	4043400	5264711	NO BP
GCF_018902025.1	NW_025814057.1	6079411	6245300	NO BP
GCF_018902025.1	NW_025814057.1	6916700	7091050	NO BP
GCF_018902025.1	NW_025814057.1	8244500	8772400	NO BP
GCF_018902025.1	NW_025814057.1	9249700	9376611	NO BP
GCF_018902025.1	NW_025814057.1	10671811	11306361	NO BP
GCF_018902025.1	NW_025814058.1	41600	268413	NO BP
GCF_018902025.1	NW_025814058.1	500950	1042313	NO BP
GCF_018902025.1	NW_025814058.1	1198200	1309250	NO BP
GCF_018902025.1	NW_025814058.1	1799750	1939900	NO BP
GCF_018902025.1	NW_025814059.1	1009463	1038900	NO BP
GCF_018902025.1	NW_025814059.1	1908350	2311013	NO BP
GCF_018902025.1	NW_025814059.1	2678550	2878663	NO BP
GCF_017639315.1	NC_057928.1	4557850	5632300	NO BP
GCF_017639315.1	NC_057928.1	5969800	6152576	NO BP
GCF_017639315.1	NC_057928.1	7579350	7662250	NO BP
GCF_017639315.1	NC_057928.1	8853850	9106950	NO BP
GCF_017639315.1	NC_057928.1	9599100	9747450	NO BP
GCF_017639315.1	NC_057928.1	9949600	10190500	NO BP
GCF_017639315.1	NC_057928.1	10936050	11272050	NO BP
GCF_017639315.1	NC_057928.1	12131700	12475400	NO BP
GCF_017639315.1	NC_057928.1	12958350	13775250	NO BP
GCF_017639315.1	NC_057928.1	14391350	15458850	NO BP
GCF_017639315.1	NC_057928.1	16279450	16362676	NO BP
GCF_017639315.1	NC_057928.1	17746650	18088226	NO BP
GCF_017639315.1	NC_057928.1	18657850	19031650	NO BP
GCF_017639315.1	NC_057928.1	19450500	19553526	NO BP
GCF_017639315.1	NC_057928.1	20500800	20896600	NO BP
GCF_017639315.1	NC_057928.1	22948976	22986000	NO BP
GCF_017639315.1	NC_057928.1	23333526	23580800	NO BP
GCF_017639315.1	NC_057928.1	23748000	24903800	NO BP
GCF_017639315.1	NC_057930.1	7069023	7182400	NO BP
GCF_017639315.1	NC_057930.1	8852700	8921950	NO BP
GCF_017639315.1	NC_057930.1	9890400	9963450	NO BP
GCF_017639315.1	NC_057930.1	10890450	11369623	NO BP
GCF_017639315.1	NC_057930.1	11522550	11606223	NO BP
GCF_017639315.1	NC_057930.1	12603800	12992300	NO BP
GCF_017639315.1	NC_057930.1	13215150	14319273	NO BP
GCF_017639315.1	NC_057930.1	14902050	15023150	NO BP
GCF_017639315.1	NC_057930.1	15214900	15303650	NO BP
GCF_017639315.1	NC_057930.1	16366023	16543300	NO BP
GCF_017639315.1	NC_057930.1	16793400	16952073	NO BP
GCF_017639315.1	NC_057930.1	17213473	17696400	NO BP
GCF_017639315.1	NC_057930.1	18662723	19143300	NO BP
GCF_017639315.1	NC_057930.1	19855650	19946050	NO BP
GCF_017639315.1	NC_057930.1	20580750	20647150	NO BP
GCF_017639315.1	NC_057930.1	21009223	21206623	NO BP
GCF_017639315.1	NC_057930.1	21467800	21557600	NO BP
GCF_017639315.1	NC_057930.1	22112873	22437473	NO BP
GCF_017639315.1	NC_057930.1	23766200	25160250	NO BP
GCF_017639315.1	NC_057930.1	25754473	26189150	NO BP
GCF_017639315.1	NC_057930.1	26677700	26768773	NO BP
GCF_017639315.1	NC_057930.1	26976600	27103150	NO BP
GCF_017639315.1	NC_057932.1	20486550	20511099	NO BP
GCF_009870125.1	NC_046681.1	3437092	3929600	NO BP
GCF_009870125.1	NC_046681.1	5656600	5816300	NO BP
GCF_009870125.1	NC_046681.1	6207300	7162800	NO BP
GCF_009870125.1	NC_046681.1	7925450	8087950	NO BP
GCF_009870125.1	NC_046681.1	8500942	8674250	NO BP
GCF_009870125.1	NC_046681.1	9418150	10659850	NO BP
GCF_009870125.1	NC_046681.1	16510700	17240550	NO BP
GCF_009870125.1	NC_046681.1	18526842	18762750	NO BP
GCF_009870125.1	NC_046681.1	19409792	19461800	NO BP
GCF_009870125.1	NC_046681.1	20522250	20683150	NO BP
GCF_009870125.1	NC_046681.1	20881050	20963042	NO BP
GCF_009870125.1	NC_046681.1	23444842	23520400	NO BP
GCF_009870125.1	NC_046681.1	24066850	25438950	NO BP
GCF_009870125.1	NC_046681.1	26210400	26912550	NO BP
GCF_009870125.1	NC_046681.1	27311050	28521850	NO BP
GCF_009870125.1	NC_046681.1	29339350	29744042	NO BP
GCF_009870125.1	NC_046681.1	29936100	30212500	NO BP
GCF_009870125.1	NC_046681.1	30450342	30566350	NO BP
GCF_009870125.1	NC_046683.1	42680055	42708268	NO BP
GCF_009870125.1	NC_046683.1	44464655	44731555	NO BP
GCF_009870125.1	NC_046683.1	45761618	46086755	NO BP
GCF_009870125.1	NC_046683.1	46760505	47058855	NO BP
GCF_009870125.1	NC_046683.1	47674055	48673768	NO BP
GCF_009870125.1	NC_046683.1	50793118	50849018	NO BP
GCF_009870125.1	NC_046683.1	51673205	52136905	NO BP
GCF_009870125.1	NC_046683.1	52458918	52637905	NO BP
GCF_009870125.1	NC_046683.1	55936055	55992705	NO BP
GCF_009870125.1	NC_046683.1	56242868	56600855	NO BP
GCF_009870125.1	NC_046683.1	56993905	57043768	NO BP
GCF_009870125.1	NC_046683.1	57607318	57862668	NO BP
GCF_009870125.1	NC_046683.1	58593768	59818018	NO BP
GCF_009870125.1	NC_046683.1	60070868	60174655	NO BP
GCF_009870125.1	NC_046683.1	60698655	60897718	NO BP
GCF_009870125.1	NC_046683.1	61374918	61817005	NO BP
GCF_009870125.1	NC_046683.1	62201468	62452668	NO BP
GCF_009870125.1	NC_046683.1	62741968	62813418	NO BP
GCF_009870125.1	NC_046683.1	63033068	63214168	NO BP
GCF_009870125.1	NC_046683.1	64545605	64888518	NO BP
GCF_009870125.1	NC_046683.1	65041305	65706718	NO BP
GCF_009870125.1	NC_046683.1	65888268	66031105	NO BP
GCF_009870125.1	NC_046683.1	67013655	67193305	NO BP
GCF_016746395.2	NC_052520.2	82000	351450	NO BP
GCF_016746395.2	NC_052520.2	740300	2142800	NO BP
GCF_016746395.2	NC_052520.2	3004350	3362650	NO BP
GCF_016746395.2	NC_052520.2	5001650	6055650	NO BP
GCF_016746395.2	NC_052520.2	7154300	7452750	NO BP
GCF_016746395.2	NC_052520.2	7654300	7785800	NO BP
GCF_016746395.2	NC_052520.2	7977950	8086550	NO BP
GCF_016746395.2	NC_052520.2	10386100	11425750	NO BP
GCF_016746395.2	NC_052520.2	11865800	11993950	NO BP
GCF_016746395.2	NC_052520.2	12998350	13041200	NO BP
GCF_016746395.2	NC_052520.2	13633800	13880600	NO BP
GCF_016746395.2	NC_052520.2	15047800	15249600	NO BP
GCF_016746395.2	NC_052520.2	16018400	16077300	NO BP
GCF_016746395.2	NC_052520.2	16288850	16976100	NO BP
GCF_016746395.2	NC_052520.2	17549400	17896150	NO BP
GCF_016746395.2	NC_052520.2	18444450	18469350	NO BP
GCF_016746395.2	NC_052520.2	19149550	19182200	NO BP
GCF_016746395.2	NC_052520.2	19617250	19718300	NO BP
GCF_016746395.2	NC_052520.2	20381213	20861163	NO BP
GCF_016746395.2	NC_052522.2	209600	350100	NO BP
GCF_016746395.2	NC_052522.2	684650	864050	NO BP
GCF_016746395.2	NC_052522.2	1504550	1604800	NO BP
GCF_016746395.2	NC_052522.2	2746600	3261500	NO BP
GCF_016746395.2	NC_052522.2	4049080	5163400	NO BP
GCF_016746395.2	NC_052522.2	5800650	5833300	NO BP
GCF_016746395.2	NC_052522.2	7116950	7158050	NO BP
GCF_016746395.2	NC_052522.2	7822750	8451200	NO BP
GCF_016746395.2	NC_052522.2	9166600	9331450	NO BP
GCF_016746395.2	NC_052522.2	9495600	9711080	NO BP
GCF_016746395.2	NC_052522.2	10744850	10862530	NO BP
GCF_016746395.2	NC_052522.2	11727650	11913030	NO BP
GCF_016746395.2	NC_052522.2	12244050	13807900	NO BP
GCF_016746395.2	NC_052522.2	15311600	15338050	NO BP
GCF_016746395.2	NC_052522.2	16286300	16347350	NO BP
GCF_016746395.2	NC_052522.2	16498200	16612000	NO BP
GCF_016746395.2	NC_052522.2	16953480	17188100	NO BP
GCF_016746395.2	NC_052522.2	18926100	19541200	NO BP
GCF_016746395.2	NC_052522.2	19795880	20074480	NO BP
GCF_016746395.2	NC_052522.2	21428030	21573630	NO BP
GCF_016746395.2	NC_052522.2	22025550	22347730	NO BP
GCF_030788295.1	NC_091544.1	20937561	20994600	NO BP
GCF_030788295.1	NC_091545.1	32750	159112	NO BP
GCF_030788295.1	NC_091545.1	592150	1849650	NO BP
GCF_030788295.1	NC_091545.1	3223300	3295350	NO BP
GCF_030788295.1	NC_091545.1	3649300	3923450	NO BP
GCF_030788295.1	NC_091545.1	4077562	4394162	NO BP
GCF_030788295.1	NC_091545.1	7087512	7147650	NO BP
GCF_030788295.1	NC_091545.1	7534450	8117512	NO BP
GCF_030788295.1	NC_091545.1	9676462	9896912	NO BP
GCF_030788295.1	NC_091545.1	10085462	10452162	NO BP
GCF_030788295.1	NC_091545.1	10681162	10987900	NO BP
GCF_030788295.1	NC_091545.1	11213212	11323100	NO BP
GCF_030788295.1	NC_091545.1	11577150	11610412	NO BP
GCF_030788295.1	NC_091545.1	12046700	12105050	NO BP
GCF_030788295.1	NC_091545.1	12417400	12461250	NO BP
GCF_030788295.1	NC_091545.1	13204500	13305400	NO BP
GCF_030788295.1	NC_091545.1	14200950	15327712	NO BP
GCF_030788295.1	NC_091545.1	15602262	15666562	NO BP
GCF_030788295.1	NC_091545.1	15867300	15961512	NO BP
GCF_030788295.1	NC_091545.1	16914212	17292150	NO BP
GCF_030788295.1	NC_091545.1	17713200	17851150	NO BP
GCF_030788295.1	NC_091545.1	19154062	19520150	NO BP
GCF_030788295.1	NC_091545.1	19687512	19773350	NO BP
GCF_030788295.1	NC_091545.1	20201162	20229262	NO BP
GCF_030788295.1	NC_091545.1	20600162	20706150	NO BP
GCF_030788295.1	NC_091545.1	21127262	21302200	NO BP
GCF_030788295.1	NC_091545.1	21470912	22085800	NO BP
GCF_030788295.1	NC_091545.1	22960500	23087112	NO BP
GCF_030788295.1	NC_091545.1	24802650	24922700	NO BP
GCF_030788295.1	NC_091545.1	25142862	25433762	NO BP
GCF_030788295.1	NC_091545.1	26305400	26371450	NO BP
GCF_030788295.1	NC_091546.1	5900	129553	NO BP
GCF_030788295.1	NC_091546.1	742103	1347550	NO BP
GCF_030788295.1	NC_091546.1	2150653	2217100	NO BP
GCF_030788295.1	NC_091546.1	2558653	3753753	NO BP
GCF_030788295.1	NC_091546.1	5758100	6208950	NO BP
GCF_030788295.1	NC_091546.1	6634600	6789150	NO BP
GCF_030788295.1	NC_091546.1	7088853	7718253	NO BP
GCF_030788295.1	NC_091546.1	8450950	8696903	NO BP
GCF_030788295.1	NC_091546.1	9907100	9952253	NO BP
GCF_030788295.1	NC_091546.1	10170100	10688103	NO BP
GCF_030788295.1	NC_091546.1	11018603	11234153	NO BP
GCF_030788295.1	NC_091546.1	12289550	13133150	NO BP
GCF_030788295.1	NC_091546.1	14216753	14358350	NO BP
GCF_030788295.1	NC_091546.1	14550350	14772053	NO BP
GCF_030788295.1	NC_091546.1	14985300	16151403	NO BP
GCF_030788295.1	NC_091546.1	16625050	16809400	NO BP
GCF_030788295.1	NC_091546.1	17316253	17376550	NO BP
GCF_030788295.1	NC_091546.1	18918250	20064103	NO BP
GCF_030788295.1	NC_091546.1	20865003	21209653	NO BP
GCF_030788295.1	NC_091546.1	21922950	22040650	NO BP
GCF_030788295.1	NC_091546.1	22285603	22341950	NO BP
GCF_030788295.1	NC_091546.1	24698803	24756000	NO BP
GCF_030788295.1	NC_091546.1	25707100	25739903	NO BP
GCF_030788295.1	NC_091546.1	27994750	28264053	NO BP
GCF_030788295.1	NW_027212798.1	0	59877	POT BP1
GCF_030788295.1	NW_027212799.1	0	19807	POT BP2
GCF_030179895.1	NC_091728.1	639676	817500	NO BP
GCF_030179895.1	NC_091728.1	1270400	1601350	NO BP
GCF_030179895.1	NC_091728.1	2448900	2626676	NO BP
GCF_030179895.1	NC_091728.1	5336100	5877472	NO BP
GCF_030179895.1	NC_091728.1	6342822	6374372	NO BP
GCF_030179895.1	NC_091728.1	6745822	7317472	NO BP
GCF_030179895.1	NC_091728.1	7788850	7843922	NO BP
GCF_030179895.1	NC_091728.1	8863222	9083672	NO BP
GCF_030179895.1	NC_091728.1	9510172	9911650	NO BP
GCF_030179895.1	NC_091728.1	10433772	10714322	NO BP
GCF_030179895.1	NC_091728.1	11453300	11605772	NO BP
GCF_030179895.1	NC_091728.1	12294372	14166822	NO BP
GCF_030179895.1	NC_091728.1	14437350	14525822	NO BP
GCF_030179895.1	NC_091728.1	15420950	15993250	NO BP
GCF_030179895.1	NC_091728.1	16549500	16701100	NO BP
GCF_030179895.1	NC_091728.1	17747772	18122350	NO BP
GCF_030179895.1	NC_091728.1	18277622	18334422	NO BP
GCF_030179895.1	NC_091728.1	18912750	18999022	NO BP
GCF_030179895.1	NC_091728.1	20079572	20136372	NO BP
GCF_030179895.1	NC_091728.1	20362250	20696138	NO BP
GCF_030179895.1	NC_091728.1	20925550	20986038	NO BP
GCF_030179895.1	NC_091728.1	21198600	21572850	NO BP
GCF_030179895.1	NC_091728.1	21969411	22199161	NO BP
GCF_030179895.1	NC_091728.1	22441461	22476911	NO BP
GCF_030179895.1	NC_091729.1	8712161	8896311	NO BP
GCF_030179895.1	NC_091729.1	9242574	9738511	NO BP
GCF_030179895.1	NC_091730.1	886036	1038536	NO BP
GCF_030179895.1	NC_091730.1	3097689	3406789	NO BP
GCF_030179895.1	NC_091730.1	3878939	3933039	NO BP
GCF_030179895.1	NC_091730.1	4780581	4823631	NO BP
GCF_030179895.1	NC_091730.1	4980189	5295739	NO BP
GCF_030179895.1	NC_091730.1	6160489	6673989	NO BP
GCF_030179895.1	NC_091730.1	7063939	7156489	NO BP
GCF_030179895.1	NC_091730.1	7895581	8008589	NO BP
GCF_030179895.1	NC_091730.1	8433889	9204589	NO BP
GCF_030179895.1	NC_091730.1	10291489	10320181	NO BP
GCF_030179895.1	NC_091730.1	10588781	10634231	NO BP
GCF_030179895.1	NC_091730.1	12011089	12480889	NO BP
GCF_030179895.1	NC_091730.1	13554281	14588639	NO BP
GCF_030179895.1	NC_091730.1	15798639	16124039	NO BP
GCF_030179895.1	NC_091730.1	18156481	18198831	NO BP
GCF_030179895.1	NC_091730.1	18376931	19000031	NO BP
GCF_030179895.1	NC_091730.1	19210639	19243089	NO BP
GCF_030179895.1	NC_091730.1	19844739	20182639	NO BP
GCF_030179895.1	NC_091730.1	20446439	20496539	NO BP
GCF_030179895.1	NC_091730.1	21659289	22842331	NO BP
GCF_030179895.1	NC_091730.1	24032731	24559250	NO BP
GCF_030179895.1	NC_091730.1	24824431	25829750	NO BP
GCF_030179915.1	NC_091678.1	108100	433600	NO BP
GCF_030179915.1	NC_091678.1	909200	1302050	NO BP
GCF_030179915.1	NC_091678.1	3076050	3333650	NO BP
GCF_030179915.1	NC_091678.1	3666950	3869850	NO BP
GCF_030179915.1	NC_091678.1	5112300	6503000	NO BP
GCF_030179915.1	NC_091678.1	7137350	7774646	NO BP
GCF_030179915.1	NC_091678.1	8531550	8563150	NO BP
GCF_030179915.1	NC_091678.1	9120100	9233396	NO BP
GCF_030179915.1	NC_091678.1	9591496	9616646	NO BP
GCF_030179915.1	NC_091678.1	10153646	10344100	NO BP
GCF_030179915.1	NC_091678.1	11980650	12047546	NO BP
GCF_030179915.1	NC_091678.1	12327646	12578700	NO BP
GCF_030179915.1	NC_091678.1	12828650	13084650	NO BP
GCF_030179915.1	NC_091678.1	15372651	15448400	NO BP
GCF_030179915.1	NC_091678.1	15710601	16076351	NO BP
GCF_030179915.1	NC_091678.1	16809001	16978900	NO BP
GCF_030179915.1	NC_091678.1	17716850	18071801	NO BP
GCF_030179915.1	NC_091678.1	18255801	18746551	NO BP
GCF_030179915.1	NC_091678.1	19364850	19529650	NO BP
GCF_030179915.1	NC_091678.1	20044701	20283651	NO BP
GCF_030179915.1	NC_091678.1	21193001	21223801	NO BP
GCF_030179915.1	NC_091678.1	21865500	22213550	NO BP
GCF_030179915.1	NC_091678.1	22995501	24095601	NO BP
GCF_030179915.1	NC_091678.1	25477950	26323650	NO BP
GCF_030179915.1	NC_091678.1	27351651	28180150	NO BP
GCF_030179915.1	NC_091680.1	102680	354480	NO BP
GCF_030179915.1	NC_091680.1	975900	1281430	NO BP
GCF_030179915.1	NC_091680.1	2147130	2276080	NO BP
GCF_030179915.1	NC_091680.1	3768750	4459800	NO BP
GCF_030179915.1	NC_091680.1	5507350	6952350	NO BP
GCF_030179915.1	NC_091680.1	7732100	7797180	NO BP
GCF_030179915.1	NC_091680.1	7956930	8286700	NO BP
GCF_030179915.1	NC_091680.1	8559700	8717150	NO BP
GCF_030179915.1	NC_091680.1	9541000	9663350	NO BP
GCF_030179915.1	NC_091680.1	9824430	9910550	NO BP
GCF_030179915.1	NC_091680.1	11008730	11187900	NO BP
GCF_030179915.1	NC_091680.1	11998550	12174880	NO BP
GCF_030179915.1	NC_091680.1	13126230	13260400	NO BP
GCF_030179915.1	NC_091680.1	13436430	13694350	NO BP
GCF_030179915.1	NC_091680.1	15196050	15236330	NO BP
GCF_030179915.1	NC_091680.1	15609780	15802530	NO BP
GCF_030179915.1	NC_091680.1	16172650	16560950	NO BP
GCF_030179915.1	NC_091680.1	17733707	18615476	NO BP
GCF_030179915.1	NC_091680.1	19045680	19139630	NO BP
GCF_030179915.1	NC_091680.1	19388830	19426549	NO BP
GCF_030179915.1	NC_091680.1	20312549	20376580	NO BP
GCF_030179915.1	NC_091680.1	20533649	20674649	NO BP
GCF_030179915.1	NC_091680.1	22861199	23238449	NO BP
GCF_030179915.1	NC_091680.1	24532999	24565180	NO BP
GCF_030179915.1	NC_091680.1	25104949	25245280	NO BP
GCF_030179915.1	NC_091680.1	25459680	26165080	NO BP
GCF_030179915.1	NC_091680.1	26317880	26505930	NO BP
GCF_030179915.1	NC_091680.1	27099499	27917530	NO BP
In [65]:
%%bash

# this output is arguably very useful: you can use it as a "synteny lense" and then run resource-intensive programs
# only on those regions to find or not find potential homologs/orthologs of interest instead of the whole genome
# e.g. lets discover more potential orthologs for the halo elements

# here we will: 
# 1. write those sequences using the genomes which i put in out/utils
python3 extract_syntenic_fastas.py syntenic_regions_succinct ../../utils/genomes syn_regions
# 2. run a blast search (+ clasp chaining) with low word size and high e-value on them
cat halos/proteins/* > all_halos.faa
python3 run_blast_on_regions.py \
    --fastas_dir syn_regions/ \
    --output_dir syn_regions_blast_out \
    --clasp_path ../../utils/clasp.x \
    --cores 8 \
    --evalue 0.1 \
    --word_size 2 \
    all_halos.faa tblastn 
# 3. convert that into input for the synthology script
python3 blast_results_to_synthology.py \
    --deduplicate_overlaps \
    --genomes_dir ../../utils/genomes \
    --output_dir halos_from_syn_regions \
    --mode protein \
    syn_regions_blast_out
GCF_000001215.4: 40 regions extracted
GCF_016746365.2: 39 regions extracted
GCF_018153835.1: 50 regions extracted
GCF_004382195.2: 39 regions extracted
GCF_018902025.1: 55 regions extracted
GCF_017639315.1: 41 regions extracted
GCF_009870125.1: 41 regions extracted
GCF_016746395.2: 40 regions extracted
GCF_030788295.1: 57 regions extracted
GCF_030179895.1: 48 regions extracted
GCF_030179915.1: 53 regions extracted
11 organisms with FASTA files
Organisms: ['GCF_000001215.4', 'GCF_030788295.1', 'GCF_030179895.1', 'GCF_009870125.1', 'GCF_018902025.1', 'GCF_016746395.2', 'GCF_017639315.1', 'GCF_030179915.1', 'GCF_018153835.1', 'GCF_016746365.2', 'GCF_004382195.2']
Done
11 organisms with temp_coords files
Deduplication of overlapping hits is ENABLED
  GCF_009870125.1: Deduplicated 721 hits to 16 (705 merged)
GCF_009870125.1: 16 sequences extracted
  GCF_018902025.1: Deduplicated 677 hits to 14 (663 merged)
GCF_018902025.1: 14 sequences extracted
  GCF_030179895.1: Deduplicated 705 hits to 15 (690 merged)
GCF_030179895.1: 15 sequences extracted
  GCF_030788295.1: Deduplicated 1100 hits to 19 (1081 merged)
GCF_030788295.1: 19 sequences extracted
  GCF_000001215.4: Deduplicated 747 hits to 13 (734 merged)
GCF_000001215.4: 13 sequences extracted
  GCF_016746365.2: Deduplicated 748 hits to 12 (736 merged)
GCF_016746365.2: 12 sequences extracted
  GCF_004382195.2: Deduplicated 748 hits to 12 (736 merged)
GCF_004382195.2: 12 sequences extracted
  GCF_016746395.2: Deduplicated 748 hits to 12 (736 merged)
GCF_016746395.2: 12 sequences extracted
  GCF_017639315.1: Deduplicated 752 hits to 13 (739 merged)
GCF_017639315.1: 13 sequences extracted
  GCF_030179915.1: Deduplicated 689 hits to 15 (674 merged)
GCF_030179915.1: 15 sequences extracted
  GCF_018153835.1: Deduplicated 699 hits to 13 (686 merged)
GCF_018153835.1: 13 sequences extracted
Done
In [69]:
%%bash 

# now here we have input for run_synthology_prot.py
# those are all chained blast hits within those regions

ls halos_from_syn_regions/*

# you could run:
#python3 run_synthology_prot.py \
#    --annotation_dir halos_from_syn_regions \
# with some other parameters like above defined to now investigate whether the official annotations missed any potential halo genes or pseudogenes
halos_from_syn_regions/gff:
GCF_000001215.4.gff
GCF_004382195.2.gff
GCF_009870125.1.gff
GCF_016746365.2.gff
GCF_016746395.2.gff
GCF_017639315.1.gff
GCF_018153835.1.gff
GCF_018902025.1.gff
GCF_030179895.1.gff
GCF_030179915.1.gff
GCF_030788295.1.gff

halos_from_syn_regions/proteins:
GCF_000001215.4.faa
GCF_004382195.2.faa
GCF_009870125.1.faa
GCF_016746365.2.faa
GCF_016746395.2.faa
GCF_017639315.1.faa
GCF_018153835.1.faa
GCF_018902025.1.faa
GCF_030179895.1.faa
GCF_030179915.1.faa
GCF_030788295.1.faa
In [ ]:
%%bash

# lets turn to run_synthology_nucl.py
# this was developed for comparing nucleotide instead of protein sequences
# e.g. for RNAs this is very useful
# in trnas we have annotations of tRNAs for the Drosophilas
# those are typically hard to differentiate basd on sequence similarity alone and often appear in tandem arrays
# this will take ~1-2 hours on a normal desktop computer with 8 cores
# if you dont want to wait like me, the Zenodo archive contains the results as:
# syn_out_nucl_preloaded. i replaced the actual output "synthology_out_nucl" with that directory for the analysis below

python3 run_synthology_nucl.py \
      --cores 8 \
      --col ../MCScanX/MCScanX.collinearity \
      --homology_threshold 40 \
      --annotation_dir trnas \
      --new_blast y \
      --mapping_file ../../utils/mapping \
      --pairwise_alignments ../../utils/pairwise_alignments_table \
      --use-clasp y \
      --blast-mode chains \
      --blastn-path blastn \
      --clasp-path ../../utils/clasp.x \
      --output_dir synthology_out_nucl
In [93]:
%%bash

# there are hundreds of tRNAs per genome, so lets leave the detailed analysis to the experts
# here we can plot some overall statistics answering general questions like:
# how many genes are placed into a multi- or single-copy group and how many are singletons globally and per species
# (these scripts work generically also with run_synthology_prot.py output)

# 1. Compute element statistics (genes in synorthogroups)
python3 compute_element_stats.py syn_out_nucl_preloaded

# 2. Generate SVG plots (singletons, single-copy, multi-copy distributions)
python3 plot_element_stats.py syn_out_nucl_preloaded

# 3. Generate LaTeX tables
python3 make_element_tables.py syn_out_nucl_preloaded
cd syn_out_nucl_preloaded/element_tables
for tex_file in *.tex; do
    pdflatex -interaction=nonstopmode "$tex_file"
done
Loading final union graph...
Loaded graph with 3970 nodes and 12828 edges
Gene-species mapping: 4477 genes
Genes filtered out during pipeline: 507
  Run trace_filtered_genes.py first for detailed filtering analysis
Species: 11
Connected components: 366

============================================================
GLOBAL ELEMENT STATISTICS
============================================================
Total elements in input: 4477
Filtered during pipeline: 507 (11.3%)
In final graph: 3970 (88.7%)

Classification of 3970 elements in final graph:
  Singletons: 59 (1.5%)
  In single-copy groups: 945 (23.8%)
  In multi-copy groups: 2966 (74.7%)
  Within-species duplicates*: 2256

Orthology groups:
  Single-copy groups: 133
  Multi-copy groups: 174

============================================================
PER-SPECIES ELEMENT STATISTICS
============================================================

GCF_000001215.4:
  Total in input: 374
  Filtered: 22
  In final graph: 352
    Singletons: 3 (0.9%)
    In single-copy groups: 81 (23.0%)
    In multi-copy groups: 268 (76.1%)
      - Self-duplicated*: 204
  Top ortholog partners:
    - GCF_030179915.1: 309 edges
    - GCF_030179895.1: 260 edges
    - GCF_004382195.2: 245 edges

GCF_004382195.2:
  Total in input: 384
  In final graph: 384
    Singletons: 2 (0.5%)
    In single-copy groups: 94 (24.5%)
    In multi-copy groups: 288 (75.0%)
      - Self-duplicated*: 219
  Top ortholog partners:
    - GCF_030179915.1: 337 edges
    - GCF_030179895.1: 271 edges
    - GCF_009870125.1: 261 edges

GCF_009870125.1:
  Total in input: 424
  Filtered: 63
  In final graph: 361
    Singletons: 6 (1.7%)
    In single-copy groups: 83 (23.0%)
    In multi-copy groups: 272 (75.3%)
      - Self-duplicated*: 199
  Top ortholog partners:
    - GCF_030179895.1: 271 edges
    - GCF_016746365.2: 264 edges
    - GCF_004382195.2: 261 edges

GCF_016746365.2:
  Total in input: 417
  Filtered: 24
  In final graph: 393
    Singletons: 5 (1.3%)
    In single-copy groups: 88 (22.4%)
    In multi-copy groups: 300 (76.3%)
      - Self-duplicated*: 235
  Top ortholog partners:
    - GCF_030179915.1: 330 edges
    - GCF_016746395.2: 292 edges
    - GCF_030179895.1: 279 edges

GCF_016746395.2:
  Total in input: 383
  In final graph: 383
    Singletons: 0 (0.0%)
    In single-copy groups: 94 (24.5%)
    In multi-copy groups: 289 (75.5%)
      - Self-duplicated*: 229
  Top ortholog partners:
    - GCF_030179915.1: 325 edges
    - GCF_016746365.2: 292 edges
    - GCF_030179895.1: 267 edges

GCF_017639315.1:
  Total in input: 509
  Filtered: 138
  In final graph: 371
    Singletons: 5 (1.3%)
    In single-copy groups: 97 (26.1%)
    In multi-copy groups: 269 (72.5%)
      - Self-duplicated*: 201
  Top ortholog partners:
    - GCF_030179895.1: 291 edges
    - GCF_030179915.1: 285 edges
    - GCF_016746365.2: 252 edges

GCF_018153835.1:
  Total in input: 377
  Filtered: 13
  In final graph: 364
    Singletons: 3 (0.8%)
    In single-copy groups: 91 (25.0%)
    In multi-copy groups: 270 (74.2%)
      - Self-duplicated*: 212
  Top ortholog partners:
    - GCF_030179915.1: 293 edges
    - GCF_016746365.2: 251 edges
    - GCF_004382195.2: 238 edges

GCF_018902025.1:
  Total in input: 392
  Filtered: 115
  In final graph: 277
    Singletons: 13 (4.7%)
    In single-copy groups: 54 (19.5%)
    In multi-copy groups: 210 (75.8%)
      - Self-duplicated*: 157
  Top ortholog partners:
    - GCF_030179915.1: 214 edges
    - GCF_004382195.2: 200 edges
    - GCF_016746395.2: 197 edges

GCF_030179895.1:
  Total in input: 404
  Filtered: 22
  In final graph: 382
    Singletons: 8 (2.1%)
    In single-copy groups: 92 (24.1%)
    In multi-copy groups: 282 (73.8%)
      - Self-duplicated*: 213
  Top ortholog partners:
    - GCF_017639315.1: 291 edges
    - GCF_030179915.1: 285 edges
    - GCF_016746365.2: 279 edges

GCF_030179915.1:
  Total in input: 422
  Filtered: 12
  In final graph: 410
    Singletons: 3 (0.7%)
    In single-copy groups: 96 (23.4%)
    In multi-copy groups: 311 (75.9%)
      - Self-duplicated*: 254
  Top ortholog partners:
    - GCF_004382195.2: 337 edges
    - GCF_016746365.2: 330 edges
    - GCF_016746395.2: 325 edges

GCF_030788295.1:
  Total in input: 391
  Filtered: 98
  In final graph: 293
    Singletons: 11 (3.8%)
    In single-copy groups: 75 (25.6%)
    In multi-copy groups: 207 (70.6%)
      - Self-duplicated*: 133
  Top ortholog partners:
    - GCF_030179915.1: 223 edges
    - GCF_030179895.1: 213 edges
    - GCF_016746365.2: 206 edges

Saved element statistics to syn_out_nucl_preloaded/genetic_elements_stats/element_statistics.pickle
Loading element statistics...

Generating global element classification plot...
  Saved global classification plot

Generating per-species stacked bar plot...
  Saved per-species stacked bar plot

Generating per-species subplot grid...
  Saved per-species grid plot

Generating species connectivity histograms...
  Saved species connectivity histogram

Generating top ortholog partners plot...
  Saved top partners plot

All plots saved to syn_out_nucl_preloaded/genetic_elements_stats/plots/
Loading element statistics...
Saved global summary table: syn_out_nucl_preloaded/genetic_elements_stats/tables/global_element_summary.tex
Saved per-species summary table: syn_out_nucl_preloaded/genetic_elements_stats/tables/per_species_element_summary.tex
Saved top partners table: syn_out_nucl_preloaded/genetic_elements_stats/tables/per_species_top_partners.tex

All tables saved to syn_out_nucl_preloaded/genetic_elements_stats/tables/
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./global_element_summary.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./global_element_summary.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./global_element_summary.aux) )</usr/local/texlive/2022/texmf-dist/fonts/type1
/public/amsfonts/cm/cmr10.pfb>
Output written on global_element_summary.pdf (1 page, 19069 bytes).
Transcript written on global_element_summary.log.
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./per_species_element_summary.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./per_species_element_summary.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./per_species_element_summary.aux) )</usr/local/texlive/2022/texmf-dist/fonts/
type1/public/amsfonts/cm/cmr10.pfb>
Output written on per_species_element_summary.pdf (1 page, 19430 bytes).
Transcript written on per_species_element_summary.log.
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
(./per_species_top_partners.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./per_species_top_partners.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./per_species_top_partners.aux) )</usr/local/texlive/2022/texmf-dist/fonts/typ
e1/public/amsfonts/cm/cmr10.pfb>
Output written on per_species_top_partners.pdf (1 page, 17962 bytes).
Transcript written on per_species_top_partners.log.
In [94]:
# as can be seen in the plots and tables below, a significant portion of tRNAs can be placed into
# synteny chains and putative synorthogroups

from IPython.display import SVG, display, IFrame
import glob

# Display SVG plots
plot_dir = 'syn_out_nucl_preloaded/element_plots'
for svg_file in sorted(glob.glob(f'{plot_dir}/*.svg')):
    print(f"\n{'='*80}")
    print(svg_file.split('/')[-1])
    print('='*80)
    display(SVG(filename=svg_file))
================================================================================
global_element_classification.svg
================================================================================
No description has been provided for this image
================================================================================
per_species_classification_grid.svg
================================================================================
No description has been provided for this image
================================================================================
per_species_classification_stacked.svg
================================================================================
No description has been provided for this image
================================================================================
per_species_connectivity_histogram.svg
================================================================================
No description has been provided for this image
================================================================================
per_species_top_partners.svg
================================================================================
No description has been provided for this image
In [ ]: