In [ ]:
%%bash
# get the results
curl -L -o out.tar.gz https://anchored.bioinf.uni-leipzig.de/sets/04cb956d-66a6-4740-9e28-eb807e4c6e89/download-small/
# extract the webserver output archive and see whats in there
tar -xzf out.tar.gz
ls out/
# to run this notebook completely you can access the showcase data on Zenodo as well as this notebook:
# place the genomes into out/utils/
# place the halos,trnas and syn_out_nucl_preloaded directories into out/stable/synthology
# install the requirements.txt, e.g.:
# conda create -n ancst_downstream python=3.11
# conda activate ancst_downstream
# pip install -r requirements.txt
# you will also need MCScanX, blast and clasp
In [ ]:
%%bash
# if not available install MCScanX from:
# https://github.com/wyp1125/MCScanX
# first, run MCScanX on theanchor alignments
cd out/stable/MCScanX
MCScanX_h MCScanX
In [3]:
%%bash
# now draw the alignments globally for all "chromosomes" >= 1000000 nucleotides
cd out/stable/MCScanX
python3 mcdraw.py --mcscanx_file MCScanX.collinearity --threshold_chr 1000000
In [4]:
# it is drawn greedily, so the next track down is always the ones with most alignments out of the species left
# blue and green are forward and reverse alignments which map to the chromsome their chromosome mostly maps to in general
# magenta and orange alignments are forward and reverse ones to chromosomes which do not map to their chromosome primarily,
# indicating rearrangements
# here you can already see extensive conservation between the first two species but also small rearrangements
# the next pair shows significantly more global rearrangements
from IPython.display import SVG, display
display(SVG('out/stable/MCScanX/mcscx_blocks.svg'))
In [9]:
# lets run a comprehensive synteny-enabled orthology analysis for a multi-copy gene family in Drosophilas
%cd out/stable/synthology
/Users/schmackeroodle/test_out_dir2/out/stable/synthology
In [ ]:
%%bash
ls
# for the following analysis you need these genes (their annotations) and also the genome sequences of the Drosophilas
# get this data most easily from Zenodo (including the jnotebook):
# otherwise the Drosophilas are just NCBI accessions with which you can find this data
# you need the genomes. put them in out/utils/genomes/
# and the annotations which i put in halos here in out/stable/synthology
# I used blastp to find the candidates in the other (not D.mel) Drosophilas
# (the tRNA data is used below)
# you need the NCBI blast suite (blastn, tblastn, blastp for everything in our output)
# make sure clasp.x is working. there is compiled version in out/utils. if its not working on your machine, simply recompile from:
# http://legacy.bioinf.uni-leipzig.de/Software/clasp/
# first, lets run the run_synthology_prot.py pipeline. what it does (roughly):
# 1. place the elements of interest from the gff files in the MCScanX collinearity chains
# 2. make orthology graphs from all elements which are in the same pairwise chains and edit to cograph because of:
# https://pubmed.ncbi.nlm.nih.gov/22456957/
# 3. use dynamic programming to align them allowing duplications and other rearrangements as first decribed in:
# https://pmc.ncbi.nlm.nih.gov/articles/PMC4981973/#CR42
# 4. make union of pairwise graphs and edit to cograph again to get multi-species orthology graphs
# the results of this are already quite heavy in the sense that you can now technically check out the list alignments and orthology graphs
# to further investigate your orthology relationships of interest but that is too much for this tutorial
# most arguments are self-explanatory and just regard the locations of stuff
# technically with --use-clasp n you can avoid chaining blast hits but you'd usually want to do that
# --blast-mode chains means to only consider the elements of interest which can actually be placed into an MCScanX chain
# --blast-mode all-vs-all addittionally performs an all-vs-all blastp which was originally developed to investigate the discrepancy between
# orthology determination solely on reciprocal best hits vs. synteny-aided
python3 run_synthology_prot.py \
--cores 8 \
--col ../MCScanX/MCScanX.collinearity \
--homology_threshold 30 \
--annotation_dir halos \
--new_blast y \
--mapping_file ../../utils/mapping \
--pairwise_alignments ../../utils/pairwise_alignments_table \
--use-clasp y \
--blast-mode chains \
--blastp-path blastp \
--clasp-path ../../utils/clasp.x \
--output_dir synthology_out_prot
In [20]:
%%bash
# to make this heavy output usable we implemented the extraction of "SynOrthoGroup" akin to OrthoFinders OrthoGroups
# they are defined as the connected components in the final to-cograph-edited union graph of the elements which could be placed into
# MCScanX synteny chains and whose connection is also supported by a respective list alignment
# cograph editing has to be contrained in our use case, firstly in general to only make between-species not within-species edges and
# secondly, we create to final graphs, one in which the editing is allowed to introduce new edges which amounts to transitively closing some
# orthology relationships ("with_new_edges") and not allowing new edges ("no_new_edges"). i recommend to conservatively use the version
# without any new edges because the transitive closures can basically create big connected components due to sparse phylogenetic coverage,
# assembly artefacts and genuine genome rearrangements. but anyway, try both
# to get them run:
python3 export_synorthogroups.py synthology_out_prot --graph-version both
================================================================================ EXPORTING SYNORTHOGROUPS (no_new_edges) ================================================================================ Loading synthology_out_prot/final_union_graph_no_new_edges.pickle... Loading synthology_out_prot/parsed_faa_gff... Graph nodes: 100 Graph edges: 447 Total input genes: 104 Multi-species SynOrthogroups: 4 Single-species singletons: 0 Filtered genes: 4 ✓ Complete gene accounting verified Writing output files to synthology_out_prot/synorthogroups/no_new_edges/ Writing SynOrthogroups.tsv... Writing SynOrthogroups.GeneCount.tsv... Writing SynOrthogroups_SingleCopyOrthologues.txt... Writing SynOrthogroups_UnassignedGenes.tsv... Writing Statistics_Overall.tsv... Writing Statistics_PerSpecies.tsv... Writing SynOrthogroups_WithCoords.tsv... Writing synorthogroups.pickle... ✓ Export complete: synthology_out_prot/synorthogroups/no_new_edges/ ================================================================================ EXPORTING SYNORTHOGROUPS (with_new_edges) ================================================================================ Loading synthology_out_prot/final_union_graph_with_new_edges.pickle... Loading synthology_out_prot/parsed_faa_gff... Graph nodes: 100 Graph edges: 472 Total input genes: 104 Multi-species SynOrthogroups: 4 Single-species singletons: 0 Filtered genes: 4 ✓ Complete gene accounting verified Writing output files to synthology_out_prot/synorthogroups/with_new_edges/ Writing SynOrthogroups.tsv... Writing SynOrthogroups.GeneCount.tsv... Writing SynOrthogroups_SingleCopyOrthologues.txt... Writing SynOrthogroups_UnassignedGenes.tsv... Writing Statistics_Overall.tsv... Writing Statistics_PerSpecies.tsv... Writing SynOrthogroups_WithCoords.tsv... Writing synorthogroups.pickle... �� Export complete: synthology_out_prot/synorthogroups/with_new_edges/ ================================================================================ ALL EXPORTS COMPLETE ================================================================================
In [27]:
# e.g. look at the main halo protein from D. mel
import pandas as pd
# Read the synorthogroups table
sog_file = "synthology_out_prot/synorthogroups/with_new_edges/SynOrthogroups.tsv"
sog_df = pd.read_csv(sog_file, sep='\t')
# Read gene counts
counts_file = "synthology_out_prot/synorthogroups/with_new_edges/SynOrthogroups.GeneCount.tsv"
counts_df = pd.read_csv(counts_file, sep='\t')
# Read statistics
stats_file = "synthology_out_prot/synorthogroups/with_new_edges/Statistics_Overall.tsv"
stats_df = pd.read_csv(stats_file, sep='\t')
print("=" * 80)
print("OVERALL STATISTICS (WITH NEW EDGES)")
print("=" * 80)
for _, row in stats_df.iterrows():
val = row['Value']
# Round if it's a numeric value that should be an integer
if isinstance(val, float) and val == int(val):
val = int(val)
print(f"{row['Metric']}: {val}")
print("\n" + "=" * 80)
print("ALL SYNORTHOGROUPS")
print("=" * 80)
for idx, row in sog_df.iterrows():
sog_id = row['SynOrthogroup']
print(f"\n{sog_id}")
print("-" * 80)
# Get gene counts for this SOG
sog_counts = counts_df[counts_df['SynOrthogroup'] == sog_id]
print("\nMembers per species:")
for col in sog_df.columns:
if col != 'SynOrthogroup':
members = row[col]
count = sog_counts[col].values[0] if len(sog_counts) > 0 else 0
if isinstance(count, float):
count = int(count)
if pd.notna(members) and members != '':
print(f" {col} ({count}): {members}")
else:
print(f" {col} ({count}): -")
total = sog_counts['Total'].values[0] if len(sog_counts) > 0 else 0
if isinstance(total, float):
total = int(total)
print(f"\nTotal genes in {sog_id}: {total}")
================================================================================ OVERALL STATISTICS (WITH NEW EDGES) ================================================================================ Number of species: 11 Total input genes: 104 Genes in SynOrthogroups: 100 Genes unassigned (filtered): 4 Genes unassigned (singletons): 0 Percentage in SynOrthogroups: 96.2 Number of SynOrthogroups: 4 Number of single-copy SynOrthogroups: 1 Mean SynOrthogroup size: 25 Median SynOrthogroup size: 27 ================================================================================ ALL SYNORTHOGROUPS ================================================================================ SOG0000000 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (5): NP_001097507.1, NP_647925.1, NP_647926.1, NP_647928.1, NP_652709.1 GCF_004382195.2 (5): XP_002035381.1, XP_002035384.1, XP_002035386.1, XP_002035388.1, XP_032573789.1 GCF_009870125.1 (5): XP_001352657.3, XP_002135424.3, XP_033238392.1, XP_033238393.1, XP_033238394.1 GCF_016746365.2 (5): XP_002093738.1, XP_002093741.1, XP_002093742.1, XP_039229380.1, XP_039229381.1 GCF_016746395.2 (5): XP_002083669.1, XP_016030473.1, XP_016030475.1, XP_039149189.1, XP_039150257.1 GCF_017639315.1 (5): XP_001956198.1, XP_001956199.1, XP_001956200.1, XP_001956202.1, XP_032310569.1 GCF_018153835.1 (5): XP_017070999.1, XP_017071001.1, XP_017071002.1, XP_017071003.1, XP_017071329.1 GCF_018902025.1 (5): XP_002062546.1, XP_002062547.1, XP_002062549.2, XP_023034409.1, XP_023037390.1 GCF_030179895.1 (5): XP_017028404.1, XP_017028410.1, XP_017028411.1, XP_017028412.1, XP_070141710.1 GCF_030179915.1 (5): XP_017014258.1, XP_017014261.2, XP_017014262.1, XP_017014265.1, XP_070071522.1 GCF_030788295.1 (1): XP_002060039.2 Total genes in SOG0000000: 51 SOG0000001 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (2): NP_647920.1, NP_648635.1 GCF_004382195.2 (2): XP_002030387.1, XP_032575457.1 GCF_009870125.1 (2): XP_001352650.2, XP_001353594.4 GCF_016746365.2 (2): XP_002094686.1, XP_039229810.1 GCF_016746395.2 (2): XP_002084759.1, XP_016030468.1 GCF_017639315.1 (2): XP_032310568.1, XP_032310840.1 GCF_018153835.1 (2): XP_017071328.1, XP_017080036.2 GCF_018902025.1 (2): XP_002062545.1, XP_046868055.1 GCF_030179895.1 (2): XP_017028402.1, XP_070141894.1 GCF_030179915.1 (2): XP_017004446.2, XP_017014257.1 GCF_030788295.1 (7): XP_002047487.1, XP_002060043.1, XP_002060051.1, XP_032291392.1, XP_032291415.1, XP_032291459.1, XP_032291515.1 Total genes in SOG0000001: 27 SOG0000002 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (2): NP_001137765.1, NP_608940.1 GCF_004382195.2 (1): XP_032583477.1 GCF_009870125.1 (1): XP_001357034.2 GCF_016746365.2 (1): XP_002087550.1 GCF_016746395.2 (1): XP_039154542.1 GCF_017639315.1 (1): XP_032311464.1 GCF_018153835.1 (1): XP_017075608.1 GCF_018902025.1 (1): XP_002064938.1 GCF_030179895.1 (1): XP_017021791.1 GCF_030179915.1 (1): XP_016994922.1 GCF_030788295.1 (1): XP_002051725.1 Total genes in SOG0000002: 12 SOG0000003 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (0): - GCF_004382195.2 (1): XP_002037965.1 GCF_009870125.1 (1): XP_033235652.1 GCF_016746365.2 (1): XP_015054458.1 GCF_016746395.2 (1): XP_016023623.1 GCF_017639315.1 (1): XP_001961803.1 GCF_018153835.1 (1): XP_017082362.2 GCF_018902025.1 (1): XP_002069270.2 GCF_030179895.1 (1): XP_017032011.1 GCF_030179915.1 (1): XP_017006947.2 GCF_030788295.1 (1): XP_002052907.1 Total genes in SOG0000003: 10
In [28]:
import pandas as pd
# Read the synorthogroups table
sog_file = "synthology_out_prot/synorthogroups/no_new_edges/SynOrthogroups.tsv"
sog_df = pd.read_csv(sog_file, sep='\t')
# Read gene counts
counts_file = "synthology_out_prot/synorthogroups/no_new_edges/SynOrthogroups.GeneCount.tsv"
counts_df = pd.read_csv(counts_file, sep='\t')
# Read statistics
stats_file = "synthology_out_prot/synorthogroups/no_new_edges/Statistics_Overall.tsv"
stats_df = pd.read_csv(stats_file, sep='\t')
print("=" * 80)
print("OVERALL STATISTICS (NO NEW EDGES)")
print("=" * 80)
for _, row in stats_df.iterrows():
val = row['Value']
# Round if it's a numeric value that should be an integer
if isinstance(val, float) and val == int(val):
val = int(val)
print(f"{row['Metric']}: {val}")
print("\n" + "=" * 80)
print("ALL SYNORTHOGROUPS")
print("=" * 80)
for idx, row in sog_df.iterrows():
sog_id = row['SynOrthogroup']
print(f"\n{sog_id}")
print("-" * 80)
# Get gene counts for this SOG
sog_counts = counts_df[counts_df['SynOrthogroup'] == sog_id]
print("\nMembers per species:")
for col in sog_df.columns:
if col != 'SynOrthogroup':
members = row[col]
count = sog_counts[col].values[0] if len(sog_counts) > 0 else 0
if isinstance(count, float):
count = int(count)
if pd.notna(members) and members != '':
print(f" {col} ({count}): {members}")
else:
print(f" {col} ({count}): -")
total = sog_counts['Total'].values[0] if len(sog_counts) > 0 else 0
if isinstance(total, float):
total = int(total)
print(f"\nTotal genes in {sog_id}: {total}")
================================================================================ OVERALL STATISTICS (NO NEW EDGES) ================================================================================ Number of species: 11 Total input genes: 104 Genes in SynOrthogroups: 100 Genes unassigned (filtered): 4 Genes unassigned (singletons): 0 Percentage in SynOrthogroups: 96.2 Number of SynOrthogroups: 4 Number of single-copy SynOrthogroups: 1 Mean SynOrthogroup size: 25 Median SynOrthogroup size: 27 ================================================================================ ALL SYNORTHOGROUPS ================================================================================ SOG0000000 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (5): NP_001097507.1, NP_647925.1, NP_647926.1, NP_647928.1, NP_652709.1 GCF_004382195.2 (5): XP_002035381.1, XP_002035384.1, XP_002035386.1, XP_002035388.1, XP_032573789.1 GCF_009870125.1 (5): XP_001352657.3, XP_002135424.3, XP_033238392.1, XP_033238393.1, XP_033238394.1 GCF_016746365.2 (5): XP_002093738.1, XP_002093741.1, XP_002093742.1, XP_039229380.1, XP_039229381.1 GCF_016746395.2 (5): XP_002083669.1, XP_016030473.1, XP_016030475.1, XP_039149189.1, XP_039150257.1 GCF_017639315.1 (5): XP_001956198.1, XP_001956199.1, XP_001956200.1, XP_001956202.1, XP_032310569.1 GCF_018153835.1 (5): XP_017070999.1, XP_017071001.1, XP_017071002.1, XP_017071003.1, XP_017071329.1 GCF_018902025.1 (5): XP_002062546.1, XP_002062547.1, XP_002062549.2, XP_023034409.1, XP_023037390.1 GCF_030179895.1 (5): XP_017028404.1, XP_017028410.1, XP_017028411.1, XP_017028412.1, XP_070141710.1 GCF_030179915.1 (5): XP_017014258.1, XP_017014261.2, XP_017014262.1, XP_017014265.1, XP_070071522.1 GCF_030788295.1 (1): XP_002060039.2 Total genes in SOG0000000: 51 SOG0000001 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (2): NP_647920.1, NP_648635.1 GCF_004382195.2 (2): XP_002030387.1, XP_032575457.1 GCF_009870125.1 (2): XP_001352650.2, XP_001353594.4 GCF_016746365.2 (2): XP_002094686.1, XP_039229810.1 GCF_016746395.2 (2): XP_002084759.1, XP_016030468.1 GCF_017639315.1 (2): XP_032310568.1, XP_032310840.1 GCF_018153835.1 (2): XP_017071328.1, XP_017080036.2 GCF_018902025.1 (2): XP_002062545.1, XP_046868055.1 GCF_030179895.1 (2): XP_017028402.1, XP_070141894.1 GCF_030179915.1 (2): XP_017004446.2, XP_017014257.1 GCF_030788295.1 (7): XP_002047487.1, XP_002060043.1, XP_002060051.1, XP_032291392.1, XP_032291415.1, XP_032291459.1, XP_032291515.1 Total genes in SOG0000001: 27 SOG0000002 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (2): NP_001137765.1, NP_608940.1 GCF_004382195.2 (1): XP_032583477.1 GCF_009870125.1 (1): XP_001357034.2 GCF_016746365.2 (1): XP_002087550.1 GCF_016746395.2 (1): XP_039154542.1 GCF_017639315.1 (1): XP_032311464.1 GCF_018153835.1 (1): XP_017075608.1 GCF_018902025.1 (1): XP_002064938.1 GCF_030179895.1 (1): XP_017021791.1 GCF_030179915.1 (1): XP_016994922.1 GCF_030788295.1 (1): XP_002051725.1 Total genes in SOG0000002: 12 SOG0000003 -------------------------------------------------------------------------------- Members per species: GCF_000001215.4 (0): - GCF_004382195.2 (1): XP_002037965.1 GCF_009870125.1 (1): XP_033235652.1 GCF_016746365.2 (1): XP_015054458.1 GCF_016746395.2 (1): XP_016023623.1 GCF_017639315.1 (1): XP_001961803.1 GCF_018153835.1 (1): XP_017082362.2 GCF_018902025.1 (1): XP_002069270.2 GCF_030179895.1 (1): XP_017032011.1 GCF_030179915.1 (1): XP_017006947.2 GCF_030788295.1 (1): XP_002052907.1 Total genes in SOG0000003: 10
In [32]:
%%bash
# so for this example, the two graphs represent the same information
# since i know about this example, i know that NP_608940.1 should probably be allocated into SOG0000003
# thats a good example - always check this stuff from multiple angles:
# lets first draw an image of the synorthogroups and synteny anchor alignments with 250000 margin to check for anchors around
# in the next cell we will draw the raw data
python3 draw_synorthogroups.py 250000 synthology_out_prot/synorthogroups/no_new_edges/
Loading SynOrthogroups from synthology_out_prot/synorthogroups/no_new_edges//SynOrthogroups.tsv... Found 4 SynOrthogroups across 11 species Loading gene coordinates from synthology_out_prot/synorthogroups/no_new_edges//SynOrthogroups_WithCoords.tsv... Loaded coordinates for 100 genes After filtering (min 2 species): 4 SOGs to draw Loaded 679879 alignment fragments === Drawing SOG0000000 (1/4) === Species: 11, Genes: 51 Saved to images_synorthogroups/no_new_edges/250000/SOG0000000_11species_51genes.svg === Drawing SOG0000001 (2/4) === Species: 11, Genes: 27 Saved to images_synorthogroups/no_new_edges/250000/SOG0000001_11species_27genes.svg === Drawing SOG0000002 (3/4) === Species: 11, Genes: 12 Saved to images_synorthogroups/no_new_edges/250000/SOG0000002_11species_12genes.svg === Drawing SOG0000003 (4/4) === Species: 10, Genes: 10 Saved to images_synorthogroups/no_new_edges/250000/SOG0000003_10species_10genes.svg === Done! Figures saved to images_synorthogroups/no_new_edges/250000/ ===
In [35]:
from IPython.display import SVG, display, Markdown
import glob
import os
svg_files = sorted(glob.glob('images_synorthogroups/no_new_edges/250000/*.svg'))
for svg_file in svg_files:
display(Markdown(f"### {os.path.basename(svg_file)}"))
display(SVG(svg_file))
In [ ]:
%%bash
# now lets double check - this is a recommended workflow
# with the draw_verbose.py script one can draw such images only given gffs directory or an alternative "coords" file
# and the anchor alignments (without synthology inference)
# (the output warnings about tracks not being adjacent are expected and the negative coordinates too)
python3 draw_verbose.py 250000 \
--gff-dir halos/gff \
--alignments-file ../../utils/pairwise_alignments_table \
--auto-order \
--draw-all-elements
In [53]:
from IPython.display import SVG, display, Markdown
import glob
import os
svg_files = sorted(glob.glob('images_draw_verbose/250000/*.svg'))
# so this makes 3 images...and at that margin therefore 3, not 4 synorthogroups.
# in component_2_of_3_ref_GCF_018902025.1_NW_025814050.1_23821953.svg one can see why: the elements of SOG0000002 and SOG0000003 are on
# the same chromosome and soemtimes have larger and sometimes smaller gaps between each other in the genomes which is why depending
# on the margin used, they get thrown into the same "group"
for svg_file in svg_files:
display(Markdown(f"### {os.path.basename(svg_file)}"))
display(SVG(svg_file))
In [ ]:
%%bash
# lets try with a smaller margin
python3 draw_verbose.py 50000 \
--gff-dir halos/gff \
--alignments-file ../../utils/pairwise_alignments_table \
--auto-order \
--draw-all-elements
In [54]:
from IPython.display import SVG, display, Markdown
import glob
import os
svg_files = sorted(glob.glob('images_draw_verbose/50000/*.svg'))
# now it makes 4 images corresponding largely to the synorthogroups but with the D. mel element actually probably in the right group
# so the image for halo is: component_3_of_4_ref_GCF_016746395.2_NC_052520.2_1548430.svg
# and the other element: component_1_of_4_ref_GCF_016746365.2_NC_052527.2_13771800.svg
# IMPORTANT: for any synteny analysis like this, always try with different margins that you look at. if the scenario you are looking at
# changes drastically between using different margin within the same order of magnitude, its a sign of the unreliability of the synteny
# resolution. conservely, if increasing the margin only adds more elements but does not change the overall picture, the synteny evidence
# for orthology claims is good
for svg_file in svg_files:
display(Markdown(f"### {os.path.basename(svg_file)}"))
display(SVG(svg_file))
In [82]:
# one perk of the synteny-based orthology detection is that tandem arrays can potentially be disentangled
# for that we can use list alignments with synteny anchors. we already did that for the synthology pipeline
# here, lets extract some of this info to show that the synthology output already contains the information
# to further investigate these relationship
# we have not been able to automize this approach in a sensible fashion until writing this tutorial
# that being said, its good to investigate this "rawer" data as well and should probably be advised in most cases
import re
print("=" * 80)
print("TANDEM DUPLICATION EXAMPLES FROM SYNORTHOGROUP ALIGNMENTS")
print("=" * 80)
dmel_tandem = ['NP_001097507.1', 'NP_647925.1', 'NP_647926.1', 'NP_647928.1', 'NP_652709.1']
print(f"\nD. melanogaster SOG0000000 tandem array (5 genes):")
for gene in dmel_tandem:
print(f" - {gene}")
# Example 1: All matches (conserved 5→5)
species1 = 'GCF_018153835.1'
print(f"\n{'=' * 80}")
print(f"EXAMPLE 1: D. mel → {species1} (all matches)")
print("=" * 80)
alignment_file = f'synthology_out_prot/alignments/GCF_000001215.4___{species1}.txt'
with open(alignment_file) as f:
content = f.read()
edges_match = re.search(r'Edges \(\d+\):(.*?)(?=\n-{10,}|\Z)', content, re.DOTALL)
if edges_match:
edges_text = edges_match.group(1)
print("\nCorrespondences:")
for line in edges_text.split('\n'):
line = line.strip()
if '<->' not in line:
continue
match = re.match(r'(\S+)\s+<->\s+(\S+)\s+\[event=(\w+)', line)
if not match:
continue
g1, g2, event = match.groups()
if g1 in dmel_tandem:
print(f" {g1} <-> {g2} [event={event}]")
elif g2 in dmel_tandem:
print(f" {g2} <-> {g1} [event={event}]")
# Example 2: Some tandem duplications (many-to-one)
species2 = 'GCF_016746395.2'
print(f"\n{'=' * 80}")
print(f"EXAMPLE 2: D. mel → {species2} (with tandem duplications)")
print("=" * 80)
alignment_file = f'synthology_out_prot/alignments/GCF_000001215.4___{species2}.txt'
with open(alignment_file) as f:
content = f.read()
edges_match = re.search(r'Edges \(\d+\):(.*?)(?=\n-{10,}|\Z)', content, re.DOTALL)
if edges_match:
edges_text = edges_match.group(1)
# Build reverse mapping
correspondences = {gene: [] for gene in dmel_tandem}
for line in edges_text.split('\n'):
line = line.strip()
if '<->' not in line:
continue
match = re.match(r'(\S+)\s+<->\s+(\S+)\s+\[event=(\w+)', line)
if not match:
continue
g1, g2, event = match.groups()
if g1 in dmel_tandem:
correspondences[g1].append((g2, event))
elif g2 in dmel_tandem:
correspondences[g2].append((g1, event))
# Build target mapping
target_to_dmel = {}
for dmel_gene, hits in correspondences.items():
for target, event in hits:
if target not in target_to_dmel:
target_to_dmel[target] = []
target_to_dmel[target].append((dmel_gene, event))
print("\nCorrespondences:")
displayed = set()
# Show many-to-one first
for target, sources in sorted(target_to_dmel.items()):
if len(sources) > 1:
dmel_genes = [s[0] for s in sources]
events = set(s[1] for s in sources)
print(f"\n {' + '.join(dmel_genes)} → {target}")
print(f" [{len(sources)}:1 many-to-one, event: {', '.join(events)}]")
displayed.update(dmel_genes)
# Then show 1:1
for dmel_gene in dmel_tandem:
if dmel_gene in displayed:
continue
hits = correspondences[dmel_gene]
if len(hits) == 1:
target, event = hits[0]
print(f"\n {dmel_gene} → {target} [event={event}]")
print("\n" + "=" * 80)
================================================================================
TANDEM DUPLICATION EXAMPLES FROM SYNORTHOGROUP ALIGNMENTS
================================================================================
D. melanogaster SOG0000000 tandem array (5 genes):
- NP_001097507.1
- NP_647925.1
- NP_647926.1
- NP_647928.1
- NP_652709.1
================================================================================
EXAMPLE 1: D. mel → GCF_018153835.1 (all matches)
================================================================================
Correspondences:
NP_001097507.1 <-> XP_017071329.1 [event=match]
NP_647925.1 <-> XP_017070999.1 [event=match]
NP_652709.1 <-> XP_017071003.1 [event=match]
NP_647926.1 <-> XP_017071002.1 [event=match]
NP_647928.1 <-> XP_017071001.1 [event=match]
================================================================================
EXAMPLE 2: D. mel → GCF_016746395.2 (with tandem duplications)
================================================================================
Correspondences:
NP_647926.1 + NP_652709.1 → XP_002083669.1
[2:1 many-to-one, event: tandem_dup]
NP_001097507.1 → XP_016030475.1 [event=match]
NP_647928.1 → XP_039149189.1 [event=match]
================================================================================
In [61]:
%%bash
# now lets introduce another interesting approach of how to use synteny anchors:
# we can use the alignments to iteratively find all regions in a set of species which show synteny with regions of interest
# e.g. lets run this for the halos. again, sequence margin/tolerances matter and different should be tried. im using a large
# one because (1) the species are realtively closely related which reduces the risk of rearragements causing the too many regions
# to be detected and (2) to actually include as much aas possible. this is the tradeoff
# the script is roughly doing the following:
# 1. extract the elements of interest (from the gff directories in our case) and the anchors around them
# 2. the DBScan algorithm is used to find clusters of anchors to define syntenic regions
# 3. iteratively extract new anchor matches based on the anchors extracted until no new regions are found
python3 get_syn_regions.py --cores 8 --iter 20 --threshold_new_region 1000000 --anchor_scope 500000 --gff halos/gff
Loaded 122 elements from GFF files Total elements to process: 122 NT_033779.5 in genome GCF_000001215.4 NT_033779.5 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NT_037436.4 in genome GCF_000001215.4 NC_052527.2 in genome GCF_016746365.2 NC_052527.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NC_052529.2 in genome GCF_016746365.2 NW_024571981.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024572964.1 in genome GCF_018153835.1 NW_024573038.1 in genome GCF_018153835.1 NW_024573454.1 in genome GCF_018153835.1 NC_045949.1 in genome GCF_004382195.2 NC_045949.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NC_045951.1 in genome GCF_004382195.2 NW_025814050.1 in genome GCF_018902025.1 NW_025814050.1 in genome GCF_018902025.1 NW_025814050.1 in genome GCF_018902025.1 NW_025814050.1 in genome GCF_018902025.1 NW_025814050.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NW_025814057.1 in genome GCF_018902025.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057928.1 in genome GCF_017639315.1 NC_057930.1 in genome GCF_017639315.1 NC_057930.1 in genome GCF_017639315.1 NC_046681.1 in genome GCF_009870125.1 NC_046681.1 in genome GCF_009870125.1 NC_046681.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_046683.1 in genome GCF_009870125.1 NC_052520.2 in genome GCF_016746395.2 NC_052520.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_052522.2 in genome GCF_016746395.2 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091545.1 in genome GCF_030788295.1 NC_091546.1 in genome GCF_030788295.1 NC_091546.1 in genome GCF_030788295.1 NW_027212798.1 in genome GCF_030788295.1 NW_027212798.1 in genome GCF_030788295.1 NW_027212798.1 in genome GCF_030788295.1 NW_027212798.1 in genome GCF_030788295.1 NW_027212798.1 in genome GCF_030788295.1 NC_091728.1 in genome GCF_030179895.1 NC_091728.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091730.1 in genome GCF_030179895.1 NC_091678.1 in genome GCF_030179915.1 NC_091678.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 NC_091680.1 in genome GCF_030179915.1 iteration no 1 found 415 new regions after iteration 1 iteration no 2 found 0 new regions after iteration 2 exiting before iteration no 3 since no new regions found writing final output file
In [ ]:
In [62]:
%%bash
# the output defines regions where potential orthologs to our elements of interest could lie
# BP == breakpoint. it just indicates that a region might be broken into two pieces while its syntenic partner is
# connected in another species
cat syntenic_regions_succinct
GCF_000001215.4 NT_033779.5 17050 285850 NO BP GCF_000001215.4 NT_033779.5 695500 2143484 NO BP GCF_000001215.4 NT_033779.5 3004200 3376834 NO BP GCF_000001215.4 NT_033779.5 4998800 6080850 NO BP GCF_000001215.4 NT_033779.5 7195784 7492634 NO BP GCF_000001215.4 NT_033779.5 7695400 7830050 NO BP GCF_000001215.4 NT_033779.5 8033584 8142584 NO BP GCF_000001215.4 NT_033779.5 10478900 11521934 NO BP GCF_000001215.4 NT_033779.5 11979000 12118150 NO BP GCF_000001215.4 NT_033779.5 13153300 13195450 NO BP GCF_000001215.4 NT_033779.5 13904150 14057600 NO BP GCF_000001215.4 NT_033779.5 15240500 15443434 NO BP GCF_000001215.4 NT_033779.5 16279100 16338150 NO BP GCF_000001215.4 NT_033779.5 16615100 17241700 NO BP GCF_000001215.4 NT_033779.5 17792950 18150350 NO BP GCF_000001215.4 NT_033779.5 18696400 18721950 NO BP GCF_000001215.4 NT_033779.5 19419400 19453600 NO BP GCF_000001215.4 NT_033779.5 19918500 20025250 NO BP GCF_000001215.4 NT_033779.5 20704700 21206850 NO BP GCF_000001215.4 NT_037436.4 201786 336586 NO BP GCF_000001215.4 NT_037436.4 613950 840700 NO BP GCF_000001215.4 NT_037436.4 1546350 1637450 NO BP GCF_000001215.4 NT_037436.4 2770886 3309450 NO BP GCF_000001215.4 NT_037436.4 4100400 5308216 NO BP GCF_000001215.4 NT_037436.4 5903066 5940886 NO BP GCF_000001215.4 NT_037436.4 7240836 7381236 NO BP GCF_000001215.4 NT_037436.4 7968916 8611566 NO BP GCF_000001215.4 NT_037436.4 9350216 9520966 NO BP GCF_000001215.4 NT_037436.4 9694916 9913066 NO BP GCF_000001215.4 NT_037436.4 10951466 11071816 NO BP GCF_000001215.4 NT_037436.4 11948366 12135236 NO BP GCF_000001215.4 NT_037436.4 12463116 14040166 NO BP GCF_000001215.4 NT_037436.4 15559586 15586666 NO BP GCF_000001215.4 NT_037436.4 16547066 16618566 NO BP GCF_000001215.4 NT_037436.4 16778236 16897166 NO BP GCF_000001215.4 NT_037436.4 17243066 17476116 NO BP GCF_000001215.4 NT_037436.4 19245266 19878316 NO BP GCF_000001215.4 NT_037436.4 20119316 20417586 NO BP GCF_000001215.4 NT_037436.4 21814836 21959836 NO BP GCF_000001215.4 NT_037436.4 22396816 22724686 NO BP GCF_016746365.2 NC_052527.2 64444 335700 NO BP GCF_016746365.2 NC_052527.2 733544 2176044 NO BP GCF_016746365.2 NC_052527.2 3060300 3428400 NO BP GCF_016746365.2 NC_052527.2 5158794 5284950 NO BP GCF_016746365.2 NC_052527.2 5494700 5796394 NO BP GCF_016746365.2 NC_052527.2 6967494 8017350 NO BP GCF_016746365.2 NC_052527.2 8482900 8616700 NO BP GCF_016746365.2 NC_052527.2 9016200 9482000 NO BP GCF_016746365.2 NC_052527.2 10173344 10218144 NO BP GCF_016746365.2 NC_052527.2 13238694 14279094 NO BP GCF_016746365.2 NC_052527.2 19168194 19228744 NO BP GCF_016746365.2 NC_052527.2 20073994 20283338 NO BP GCF_016746365.2 NC_052528.2 4127714 4493414 NO BP GCF_016746365.2 NC_052528.2 5538513 5880063 NO BP GCF_016746365.2 NC_052528.2 6435213 7239313 NO BP GCF_016746365.2 NC_052528.2 8501063 8539213 NO BP GCF_016746365.2 NC_052528.2 8973563 9077214 NO BP GCF_016746365.2 NC_052528.2 12315414 12444514 NO BP GCF_016746365.2 NC_052529.2 222400 356350 NO BP GCF_016746365.2 NC_052529.2 693200 876729 NO BP GCF_016746365.2 NC_052529.2 1537600 1640000 NO BP GCF_016746365.2 NC_052529.2 2269850 2302729 NO BP GCF_016746365.2 NC_052529.2 2467279 2630600 NO BP GCF_016746365.2 NC_052529.2 3365429 3914200 NO BP GCF_016746365.2 NC_052529.2 4728379 5891450 NO BP GCF_016746365.2 NC_052529.2 6544400 6575129 NO BP GCF_016746365.2 NC_052529.2 7896179 8039579 NO BP GCF_016746365.2 NC_052529.2 8618350 9231300 NO BP GCF_016746365.2 NC_052529.2 9883000 9948329 NO BP GCF_016746365.2 NC_052529.2 11040629 11171000 NO BP GCF_016746365.2 NC_052529.2 12074950 12266729 NO BP GCF_016746365.2 NC_052529.2 12604879 14218700 NO BP GCF_016746365.2 NC_052529.2 16703879 16945800 NO BP GCF_016746365.2 NC_052529.2 17291929 17402229 NO BP GCF_016746365.2 NC_052529.2 17557129 17621600 NO BP GCF_016746365.2 NC_052529.2 20183979 20455779 NO BP GCF_016746365.2 NC_052529.2 20692529 21287529 NO BP GCF_016746365.2 NC_052529.2 21998279 22148779 NO BP GCF_016746365.2 NC_052529.2 22610000 22942829 NO BP GCF_018153835.1 NW_024571881.1 92398 720050 NO BP GCF_018153835.1 NW_024571981.1 258850 1374700 NO BP GCF_018153835.1 NW_024571981.1 1585469 1789819 NO BP GCF_018153835.1 NW_024571981.1 1985300 2687350 NO BP GCF_018153835.1 NW_024571981.1 3317800 3569900 NO BP GCF_018153835.1 NW_024572000.1 0 173700 NO BP GCF_018153835.1 NW_024572672.1 639620 748920 NO BP GCF_018153835.1 NW_024572675.1 443913 922313 NO BP GCF_018153835.1 NW_024572700.1 2200 121180 NO BP GCF_018153835.1 NW_024572939.1 0 37050 NO BP GCF_018153835.1 NW_024572964.1 7461 136761 NO BP GCF_018153835.1 NW_024572964.1 654650 891100 NO BP GCF_018153835.1 NW_024572964.1 1560811 1665700 NO BP GCF_018153835.1 NW_024572964.1 2862911 3398550 NO BP GCF_018153835.1 NW_024572964.1 4235511 5450400 NO BP GCF_018153835.1 NW_024572964.1 6054300 6092711 NO BP GCF_018153835.1 NW_024572964.1 6422600 6615311 NO BP GCF_018153835.1 NW_024572964.1 6950911 7244211 NO BP GCF_018153835.1 NW_024572964.1 7490950 7557900 NO BP GCF_018153835.1 NW_024572964.1 7820350 7909450 NO BP GCF_018153835.1 NW_024573036.1 0 139608 NO BP GCF_018153835.1 NW_024573036.1 784400 1169658 NO BP GCF_018153835.1 NW_024573036.1 2168450 2465658 NO BP GCF_018153835.1 NW_024573038.1 416753 741850 NO BP GCF_018153835.1 NW_024573038.1 1187403 1339600 NO BP GCF_018153835.1 NW_024573038.1 1720303 2215253 NO BP GCF_018153835.1 NW_024573038.1 4089753 4340350 NO BP GCF_018153835.1 NW_024573038.1 4683750 4805603 NO BP GCF_018153835.1 NW_024573038.1 4965553 5031650 NO BP GCF_018153835.1 NW_024573038.1 5550650 5636900 NO BP GCF_018153835.1 NW_024573038.1 5785203 6005450 NO BP GCF_018153835.1 NW_024573038.1 7058500 7183100 NO BP GCF_018153835.1 NW_024573038.1 9828003 11449100 NO BP GCF_018153835.1 NW_024573038.1 13133653 13162153 NO BP GCF_018153835.1 NW_024573209.1 1454800 1613802 NO BP GCF_018153835.1 NW_024573245.1 19225 309625 NO BP GCF_018153835.1 NW_024573245.1 709125 815025 NO BP GCF_018153835.1 NW_024573253.1 0 227800 NO BP GCF_018153835.1 NW_024573287.1 146350 347889 NO BP GCF_018153835.1 NW_024573311.1 131610 404310 NO BP GCF_018153835.1 NW_024573311.1 1708710 1742310 NO BP GCF_018153835.1 NW_024573311.1 2757310 2885210 NO BP GCF_018153835.1 NW_024573387.1 775850 883080 NO BP GCF_018153835.1 NW_024573410.1 0 401664 NO BP GCF_018153835.1 NW_024573454.1 0 932300 NO BP GCF_018153835.1 NW_024573532.1 0 117113 NO BP GCF_018153835.1 NW_024573566.1 8000 120581 NO BP GCF_018153835.1 NW_024573661.1 638278 723978 NO BP GCF_018153835.1 NW_024573793.1 0 114350 NO BP GCF_018153835.1 NW_024573870.1 678035 715450 NO BP GCF_004382195.2 NC_045949.1 82050 344100 NO BP GCF_004382195.2 NC_045949.1 727400 2108100 NO BP GCF_004382195.2 NC_045949.1 2949750 3312150 NO BP GCF_004382195.2 NC_045949.1 4909800 5981000 NO BP GCF_004382195.2 NC_045949.1 7068050 7357200 NO BP GCF_004382195.2 NC_045949.1 7556100 7691750 NO BP GCF_004382195.2 NC_045949.1 7877850 7985250 NO BP GCF_004382195.2 NC_045949.1 10268650 11275200 NO BP GCF_004382195.2 NC_045949.1 11733300 11861000 NO BP GCF_004382195.2 NC_045949.1 12860800 12903650 NO BP GCF_004382195.2 NC_045949.1 13588350 13744450 NO BP GCF_004382195.2 NC_045949.1 14930200 15130500 NO BP GCF_004382195.2 NC_045949.1 15919000 15977200 NO BP GCF_004382195.2 NC_045949.1 16209600 16875650 NO BP GCF_004382195.2 NC_045949.1 17511600 17773350 NO BP GCF_004382195.2 NC_045949.1 18315900 18342950 NO BP GCF_004382195.2 NC_045949.1 19030850 19067100 NO BP GCF_004382195.2 NC_045949.1 19497250 19602750 NO BP GCF_004382195.2 NC_045949.1 20279650 20759500 NO BP GCF_004382195.2 NC_045951.1 277450 415100 NO BP GCF_004382195.2 NC_045951.1 683550 915300 NO BP GCF_004382195.2 NC_045951.1 1554900 1656550 NO BP GCF_004382195.2 NC_045951.1 2778650 3288050 NO BP GCF_004382195.2 NC_045951.1 4076600 5238800 NO BP GCF_004382195.2 NC_045951.1 5811150 5852450 NO BP GCF_004382195.2 NC_045951.1 7094750 7158050 NO BP GCF_004382195.2 NC_045951.1 7807400 8423600 NO BP GCF_004382195.2 NC_045951.1 9131650 9293650 NO BP GCF_004382195.2 NC_045951.1 9454000 9670850 NO BP GCF_004382195.2 NC_045951.1 10678750 10810500 NO BP GCF_004382195.2 NC_045951.1 11645700 11832400 NO BP GCF_004382195.2 NC_045951.1 12159200 13704000 NO BP GCF_004382195.2 NC_045951.1 16158200 16210150 NO BP GCF_004382195.2 NC_045951.1 16368150 16479350 NO BP GCF_004382195.2 NC_045951.1 16811900 17046400 NO BP GCF_004382195.2 NC_045951.1 18745000 19323000 NO BP GCF_004382195.2 NC_045951.1 19626500 19909900 NO BP GCF_004382195.2 NC_045951.1 21260400 21411350 NO BP GCF_004382195.2 NC_045951.1 21861100 22173900 NO BP GCF_018902025.1 NW_025814050.1 2344750 2492100 NO BP GCF_018902025.1 NW_025814050.1 3367850 4076750 NO BP GCF_018902025.1 NW_025814050.1 4801850 4897545 NO BP GCF_018902025.1 NW_025814050.1 5464950 5724145 NO BP GCF_018902025.1 NW_025814050.1 6128450 6864800 NO BP GCF_018902025.1 NW_025814050.1 7140250 7179645 NO BP GCF_018902025.1 NW_025814050.1 8964150 9053350 NO BP GCF_018902025.1 NW_025814050.1 9507045 9724150 NO BP GCF_018902025.1 NW_025814050.1 10660850 11301600 NO BP GCF_018902025.1 NW_025814050.1 11695445 12027350 NO BP GCF_018902025.1 NW_025814050.1 12922300 14013450 NO BP GCF_018902025.1 NW_025814050.1 14539450 14986500 NO BP GCF_018902025.1 NW_025814050.1 16499045 16637200 NO BP GCF_018902025.1 NW_025814050.1 18086650 19316050 NO BP GCF_018902025.1 NW_025814050.1 20859350 21103050 NO BP GCF_018902025.1 NW_025814050.1 21983100 22271945 NO BP GCF_018902025.1 NW_025814050.1 23015895 24296366 POT BP1 GCF_018902025.1 NW_025814051.1 0 41800 POT BP2 GCF_018902025.1 NW_025814051.1 776178 975600 NO BP GCF_018902025.1 NW_025814051.1 1154950 1331950 NO BP GCF_018902025.1 NW_025814051.1 3115700 3616450 NO BP GCF_018902025.1 NW_025814051.1 4479750 5061650 NO BP GCF_018902025.1 NW_025814051.1 6101600 6418200 NO BP GCF_018902025.1 NW_025814051.1 6607150 6682700 NO BP GCF_018902025.1 NW_025814054.1 1541550 1614694 NO BP GCF_018902025.1 NW_025814054.1 2301900 2615394 NO BP GCF_018902025.1 NW_025814054.1 3338100 3409894 NO BP GCF_018902025.1 NW_025814055.1 844250 924850 NO BP GCF_018902025.1 NW_025814055.1 1372700 1486471 NO BP GCF_018902025.1 NW_025814056.1 244486 309186 NO BP GCF_018902025.1 NW_025814056.1 2963250 3073350 NO BP GCF_018902025.1 NW_025814056.1 3894200 4275100 NO BP GCF_018902025.1 NW_025814056.1 4786200 4980136 NO BP GCF_018902025.1 NW_025814056.1 5519686 5912600 NO BP GCF_018902025.1 NW_025814056.1 6837600 6951750 NO BP GCF_018902025.1 NW_025814056.1 7134150 7214100 NO BP GCF_018902025.1 NW_025814056.1 7643586 7940500 NO BP GCF_018902025.1 NW_025814056.1 8100300 8267950 NO BP GCF_018902025.1 NW_025814057.1 92600 1405550 NO BP GCF_018902025.1 NW_025814057.1 1618200 1711550 NO BP GCF_018902025.1 NW_025814057.1 2223950 2684050 NO BP GCF_018902025.1 NW_025814057.1 2870500 2922050 NO BP GCF_018902025.1 NW_025814057.1 4043400 5264711 NO BP GCF_018902025.1 NW_025814057.1 6079411 6245300 NO BP GCF_018902025.1 NW_025814057.1 6916700 7091050 NO BP GCF_018902025.1 NW_025814057.1 8244500 8772400 NO BP GCF_018902025.1 NW_025814057.1 9249700 9376611 NO BP GCF_018902025.1 NW_025814057.1 10671811 11306361 NO BP GCF_018902025.1 NW_025814058.1 41600 268413 NO BP GCF_018902025.1 NW_025814058.1 500950 1042313 NO BP GCF_018902025.1 NW_025814058.1 1198200 1309250 NO BP GCF_018902025.1 NW_025814058.1 1799750 1939900 NO BP GCF_018902025.1 NW_025814059.1 1009463 1038900 NO BP GCF_018902025.1 NW_025814059.1 1908350 2311013 NO BP GCF_018902025.1 NW_025814059.1 2678550 2878663 NO BP GCF_017639315.1 NC_057928.1 4557850 5632300 NO BP GCF_017639315.1 NC_057928.1 5969800 6152576 NO BP GCF_017639315.1 NC_057928.1 7579350 7662250 NO BP GCF_017639315.1 NC_057928.1 8853850 9106950 NO BP GCF_017639315.1 NC_057928.1 9599100 9747450 NO BP GCF_017639315.1 NC_057928.1 9949600 10190500 NO BP GCF_017639315.1 NC_057928.1 10936050 11272050 NO BP GCF_017639315.1 NC_057928.1 12131700 12475400 NO BP GCF_017639315.1 NC_057928.1 12958350 13775250 NO BP GCF_017639315.1 NC_057928.1 14391350 15458850 NO BP GCF_017639315.1 NC_057928.1 16279450 16362676 NO BP GCF_017639315.1 NC_057928.1 17746650 18088226 NO BP GCF_017639315.1 NC_057928.1 18657850 19031650 NO BP GCF_017639315.1 NC_057928.1 19450500 19553526 NO BP GCF_017639315.1 NC_057928.1 20500800 20896600 NO BP GCF_017639315.1 NC_057928.1 22948976 22986000 NO BP GCF_017639315.1 NC_057928.1 23333526 23580800 NO BP GCF_017639315.1 NC_057928.1 23748000 24903800 NO BP GCF_017639315.1 NC_057930.1 7069023 7182400 NO BP GCF_017639315.1 NC_057930.1 8852700 8921950 NO BP GCF_017639315.1 NC_057930.1 9890400 9963450 NO BP GCF_017639315.1 NC_057930.1 10890450 11369623 NO BP GCF_017639315.1 NC_057930.1 11522550 11606223 NO BP GCF_017639315.1 NC_057930.1 12603800 12992300 NO BP GCF_017639315.1 NC_057930.1 13215150 14319273 NO BP GCF_017639315.1 NC_057930.1 14902050 15023150 NO BP GCF_017639315.1 NC_057930.1 15214900 15303650 NO BP GCF_017639315.1 NC_057930.1 16366023 16543300 NO BP GCF_017639315.1 NC_057930.1 16793400 16952073 NO BP GCF_017639315.1 NC_057930.1 17213473 17696400 NO BP GCF_017639315.1 NC_057930.1 18662723 19143300 NO BP GCF_017639315.1 NC_057930.1 19855650 19946050 NO BP GCF_017639315.1 NC_057930.1 20580750 20647150 NO BP GCF_017639315.1 NC_057930.1 21009223 21206623 NO BP GCF_017639315.1 NC_057930.1 21467800 21557600 NO BP GCF_017639315.1 NC_057930.1 22112873 22437473 NO BP GCF_017639315.1 NC_057930.1 23766200 25160250 NO BP GCF_017639315.1 NC_057930.1 25754473 26189150 NO BP GCF_017639315.1 NC_057930.1 26677700 26768773 NO BP GCF_017639315.1 NC_057930.1 26976600 27103150 NO BP GCF_017639315.1 NC_057932.1 20486550 20511099 NO BP GCF_009870125.1 NC_046681.1 3437092 3929600 NO BP GCF_009870125.1 NC_046681.1 5656600 5816300 NO BP GCF_009870125.1 NC_046681.1 6207300 7162800 NO BP GCF_009870125.1 NC_046681.1 7925450 8087950 NO BP GCF_009870125.1 NC_046681.1 8500942 8674250 NO BP GCF_009870125.1 NC_046681.1 9418150 10659850 NO BP GCF_009870125.1 NC_046681.1 16510700 17240550 NO BP GCF_009870125.1 NC_046681.1 18526842 18762750 NO BP GCF_009870125.1 NC_046681.1 19409792 19461800 NO BP GCF_009870125.1 NC_046681.1 20522250 20683150 NO BP GCF_009870125.1 NC_046681.1 20881050 20963042 NO BP GCF_009870125.1 NC_046681.1 23444842 23520400 NO BP GCF_009870125.1 NC_046681.1 24066850 25438950 NO BP GCF_009870125.1 NC_046681.1 26210400 26912550 NO BP GCF_009870125.1 NC_046681.1 27311050 28521850 NO BP GCF_009870125.1 NC_046681.1 29339350 29744042 NO BP GCF_009870125.1 NC_046681.1 29936100 30212500 NO BP GCF_009870125.1 NC_046681.1 30450342 30566350 NO BP GCF_009870125.1 NC_046683.1 42680055 42708268 NO BP GCF_009870125.1 NC_046683.1 44464655 44731555 NO BP GCF_009870125.1 NC_046683.1 45761618 46086755 NO BP GCF_009870125.1 NC_046683.1 46760505 47058855 NO BP GCF_009870125.1 NC_046683.1 47674055 48673768 NO BP GCF_009870125.1 NC_046683.1 50793118 50849018 NO BP GCF_009870125.1 NC_046683.1 51673205 52136905 NO BP GCF_009870125.1 NC_046683.1 52458918 52637905 NO BP GCF_009870125.1 NC_046683.1 55936055 55992705 NO BP GCF_009870125.1 NC_046683.1 56242868 56600855 NO BP GCF_009870125.1 NC_046683.1 56993905 57043768 NO BP GCF_009870125.1 NC_046683.1 57607318 57862668 NO BP GCF_009870125.1 NC_046683.1 58593768 59818018 NO BP GCF_009870125.1 NC_046683.1 60070868 60174655 NO BP GCF_009870125.1 NC_046683.1 60698655 60897718 NO BP GCF_009870125.1 NC_046683.1 61374918 61817005 NO BP GCF_009870125.1 NC_046683.1 62201468 62452668 NO BP GCF_009870125.1 NC_046683.1 62741968 62813418 NO BP GCF_009870125.1 NC_046683.1 63033068 63214168 NO BP GCF_009870125.1 NC_046683.1 64545605 64888518 NO BP GCF_009870125.1 NC_046683.1 65041305 65706718 NO BP GCF_009870125.1 NC_046683.1 65888268 66031105 NO BP GCF_009870125.1 NC_046683.1 67013655 67193305 NO BP GCF_016746395.2 NC_052520.2 82000 351450 NO BP GCF_016746395.2 NC_052520.2 740300 2142800 NO BP GCF_016746395.2 NC_052520.2 3004350 3362650 NO BP GCF_016746395.2 NC_052520.2 5001650 6055650 NO BP GCF_016746395.2 NC_052520.2 7154300 7452750 NO BP GCF_016746395.2 NC_052520.2 7654300 7785800 NO BP GCF_016746395.2 NC_052520.2 7977950 8086550 NO BP GCF_016746395.2 NC_052520.2 10386100 11425750 NO BP GCF_016746395.2 NC_052520.2 11865800 11993950 NO BP GCF_016746395.2 NC_052520.2 12998350 13041200 NO BP GCF_016746395.2 NC_052520.2 13633800 13880600 NO BP GCF_016746395.2 NC_052520.2 15047800 15249600 NO BP GCF_016746395.2 NC_052520.2 16018400 16077300 NO BP GCF_016746395.2 NC_052520.2 16288850 16976100 NO BP GCF_016746395.2 NC_052520.2 17549400 17896150 NO BP GCF_016746395.2 NC_052520.2 18444450 18469350 NO BP GCF_016746395.2 NC_052520.2 19149550 19182200 NO BP GCF_016746395.2 NC_052520.2 19617250 19718300 NO BP GCF_016746395.2 NC_052520.2 20381213 20861163 NO BP GCF_016746395.2 NC_052522.2 209600 350100 NO BP GCF_016746395.2 NC_052522.2 684650 864050 NO BP GCF_016746395.2 NC_052522.2 1504550 1604800 NO BP GCF_016746395.2 NC_052522.2 2746600 3261500 NO BP GCF_016746395.2 NC_052522.2 4049080 5163400 NO BP GCF_016746395.2 NC_052522.2 5800650 5833300 NO BP GCF_016746395.2 NC_052522.2 7116950 7158050 NO BP GCF_016746395.2 NC_052522.2 7822750 8451200 NO BP GCF_016746395.2 NC_052522.2 9166600 9331450 NO BP GCF_016746395.2 NC_052522.2 9495600 9711080 NO BP GCF_016746395.2 NC_052522.2 10744850 10862530 NO BP GCF_016746395.2 NC_052522.2 11727650 11913030 NO BP GCF_016746395.2 NC_052522.2 12244050 13807900 NO BP GCF_016746395.2 NC_052522.2 15311600 15338050 NO BP GCF_016746395.2 NC_052522.2 16286300 16347350 NO BP GCF_016746395.2 NC_052522.2 16498200 16612000 NO BP GCF_016746395.2 NC_052522.2 16953480 17188100 NO BP GCF_016746395.2 NC_052522.2 18926100 19541200 NO BP GCF_016746395.2 NC_052522.2 19795880 20074480 NO BP GCF_016746395.2 NC_052522.2 21428030 21573630 NO BP GCF_016746395.2 NC_052522.2 22025550 22347730 NO BP GCF_030788295.1 NC_091544.1 20937561 20994600 NO BP GCF_030788295.1 NC_091545.1 32750 159112 NO BP GCF_030788295.1 NC_091545.1 592150 1849650 NO BP GCF_030788295.1 NC_091545.1 3223300 3295350 NO BP GCF_030788295.1 NC_091545.1 3649300 3923450 NO BP GCF_030788295.1 NC_091545.1 4077562 4394162 NO BP GCF_030788295.1 NC_091545.1 7087512 7147650 NO BP GCF_030788295.1 NC_091545.1 7534450 8117512 NO BP GCF_030788295.1 NC_091545.1 9676462 9896912 NO BP GCF_030788295.1 NC_091545.1 10085462 10452162 NO BP GCF_030788295.1 NC_091545.1 10681162 10987900 NO BP GCF_030788295.1 NC_091545.1 11213212 11323100 NO BP GCF_030788295.1 NC_091545.1 11577150 11610412 NO BP GCF_030788295.1 NC_091545.1 12046700 12105050 NO BP GCF_030788295.1 NC_091545.1 12417400 12461250 NO BP GCF_030788295.1 NC_091545.1 13204500 13305400 NO BP GCF_030788295.1 NC_091545.1 14200950 15327712 NO BP GCF_030788295.1 NC_091545.1 15602262 15666562 NO BP GCF_030788295.1 NC_091545.1 15867300 15961512 NO BP GCF_030788295.1 NC_091545.1 16914212 17292150 NO BP GCF_030788295.1 NC_091545.1 17713200 17851150 NO BP GCF_030788295.1 NC_091545.1 19154062 19520150 NO BP GCF_030788295.1 NC_091545.1 19687512 19773350 NO BP GCF_030788295.1 NC_091545.1 20201162 20229262 NO BP GCF_030788295.1 NC_091545.1 20600162 20706150 NO BP GCF_030788295.1 NC_091545.1 21127262 21302200 NO BP GCF_030788295.1 NC_091545.1 21470912 22085800 NO BP GCF_030788295.1 NC_091545.1 22960500 23087112 NO BP GCF_030788295.1 NC_091545.1 24802650 24922700 NO BP GCF_030788295.1 NC_091545.1 25142862 25433762 NO BP GCF_030788295.1 NC_091545.1 26305400 26371450 NO BP GCF_030788295.1 NC_091546.1 5900 129553 NO BP GCF_030788295.1 NC_091546.1 742103 1347550 NO BP GCF_030788295.1 NC_091546.1 2150653 2217100 NO BP GCF_030788295.1 NC_091546.1 2558653 3753753 NO BP GCF_030788295.1 NC_091546.1 5758100 6208950 NO BP GCF_030788295.1 NC_091546.1 6634600 6789150 NO BP GCF_030788295.1 NC_091546.1 7088853 7718253 NO BP GCF_030788295.1 NC_091546.1 8450950 8696903 NO BP GCF_030788295.1 NC_091546.1 9907100 9952253 NO BP GCF_030788295.1 NC_091546.1 10170100 10688103 NO BP GCF_030788295.1 NC_091546.1 11018603 11234153 NO BP GCF_030788295.1 NC_091546.1 12289550 13133150 NO BP GCF_030788295.1 NC_091546.1 14216753 14358350 NO BP GCF_030788295.1 NC_091546.1 14550350 14772053 NO BP GCF_030788295.1 NC_091546.1 14985300 16151403 NO BP GCF_030788295.1 NC_091546.1 16625050 16809400 NO BP GCF_030788295.1 NC_091546.1 17316253 17376550 NO BP GCF_030788295.1 NC_091546.1 18918250 20064103 NO BP GCF_030788295.1 NC_091546.1 20865003 21209653 NO BP GCF_030788295.1 NC_091546.1 21922950 22040650 NO BP GCF_030788295.1 NC_091546.1 22285603 22341950 NO BP GCF_030788295.1 NC_091546.1 24698803 24756000 NO BP GCF_030788295.1 NC_091546.1 25707100 25739903 NO BP GCF_030788295.1 NC_091546.1 27994750 28264053 NO BP GCF_030788295.1 NW_027212798.1 0 59877 POT BP1 GCF_030788295.1 NW_027212799.1 0 19807 POT BP2 GCF_030179895.1 NC_091728.1 639676 817500 NO BP GCF_030179895.1 NC_091728.1 1270400 1601350 NO BP GCF_030179895.1 NC_091728.1 2448900 2626676 NO BP GCF_030179895.1 NC_091728.1 5336100 5877472 NO BP GCF_030179895.1 NC_091728.1 6342822 6374372 NO BP GCF_030179895.1 NC_091728.1 6745822 7317472 NO BP GCF_030179895.1 NC_091728.1 7788850 7843922 NO BP GCF_030179895.1 NC_091728.1 8863222 9083672 NO BP GCF_030179895.1 NC_091728.1 9510172 9911650 NO BP GCF_030179895.1 NC_091728.1 10433772 10714322 NO BP GCF_030179895.1 NC_091728.1 11453300 11605772 NO BP GCF_030179895.1 NC_091728.1 12294372 14166822 NO BP GCF_030179895.1 NC_091728.1 14437350 14525822 NO BP GCF_030179895.1 NC_091728.1 15420950 15993250 NO BP GCF_030179895.1 NC_091728.1 16549500 16701100 NO BP GCF_030179895.1 NC_091728.1 17747772 18122350 NO BP GCF_030179895.1 NC_091728.1 18277622 18334422 NO BP GCF_030179895.1 NC_091728.1 18912750 18999022 NO BP GCF_030179895.1 NC_091728.1 20079572 20136372 NO BP GCF_030179895.1 NC_091728.1 20362250 20696138 NO BP GCF_030179895.1 NC_091728.1 20925550 20986038 NO BP GCF_030179895.1 NC_091728.1 21198600 21572850 NO BP GCF_030179895.1 NC_091728.1 21969411 22199161 NO BP GCF_030179895.1 NC_091728.1 22441461 22476911 NO BP GCF_030179895.1 NC_091729.1 8712161 8896311 NO BP GCF_030179895.1 NC_091729.1 9242574 9738511 NO BP GCF_030179895.1 NC_091730.1 886036 1038536 NO BP GCF_030179895.1 NC_091730.1 3097689 3406789 NO BP GCF_030179895.1 NC_091730.1 3878939 3933039 NO BP GCF_030179895.1 NC_091730.1 4780581 4823631 NO BP GCF_030179895.1 NC_091730.1 4980189 5295739 NO BP GCF_030179895.1 NC_091730.1 6160489 6673989 NO BP GCF_030179895.1 NC_091730.1 7063939 7156489 NO BP GCF_030179895.1 NC_091730.1 7895581 8008589 NO BP GCF_030179895.1 NC_091730.1 8433889 9204589 NO BP GCF_030179895.1 NC_091730.1 10291489 10320181 NO BP GCF_030179895.1 NC_091730.1 10588781 10634231 NO BP GCF_030179895.1 NC_091730.1 12011089 12480889 NO BP GCF_030179895.1 NC_091730.1 13554281 14588639 NO BP GCF_030179895.1 NC_091730.1 15798639 16124039 NO BP GCF_030179895.1 NC_091730.1 18156481 18198831 NO BP GCF_030179895.1 NC_091730.1 18376931 19000031 NO BP GCF_030179895.1 NC_091730.1 19210639 19243089 NO BP GCF_030179895.1 NC_091730.1 19844739 20182639 NO BP GCF_030179895.1 NC_091730.1 20446439 20496539 NO BP GCF_030179895.1 NC_091730.1 21659289 22842331 NO BP GCF_030179895.1 NC_091730.1 24032731 24559250 NO BP GCF_030179895.1 NC_091730.1 24824431 25829750 NO BP GCF_030179915.1 NC_091678.1 108100 433600 NO BP GCF_030179915.1 NC_091678.1 909200 1302050 NO BP GCF_030179915.1 NC_091678.1 3076050 3333650 NO BP GCF_030179915.1 NC_091678.1 3666950 3869850 NO BP GCF_030179915.1 NC_091678.1 5112300 6503000 NO BP GCF_030179915.1 NC_091678.1 7137350 7774646 NO BP GCF_030179915.1 NC_091678.1 8531550 8563150 NO BP GCF_030179915.1 NC_091678.1 9120100 9233396 NO BP GCF_030179915.1 NC_091678.1 9591496 9616646 NO BP GCF_030179915.1 NC_091678.1 10153646 10344100 NO BP GCF_030179915.1 NC_091678.1 11980650 12047546 NO BP GCF_030179915.1 NC_091678.1 12327646 12578700 NO BP GCF_030179915.1 NC_091678.1 12828650 13084650 NO BP GCF_030179915.1 NC_091678.1 15372651 15448400 NO BP GCF_030179915.1 NC_091678.1 15710601 16076351 NO BP GCF_030179915.1 NC_091678.1 16809001 16978900 NO BP GCF_030179915.1 NC_091678.1 17716850 18071801 NO BP GCF_030179915.1 NC_091678.1 18255801 18746551 NO BP GCF_030179915.1 NC_091678.1 19364850 19529650 NO BP GCF_030179915.1 NC_091678.1 20044701 20283651 NO BP GCF_030179915.1 NC_091678.1 21193001 21223801 NO BP GCF_030179915.1 NC_091678.1 21865500 22213550 NO BP GCF_030179915.1 NC_091678.1 22995501 24095601 NO BP GCF_030179915.1 NC_091678.1 25477950 26323650 NO BP GCF_030179915.1 NC_091678.1 27351651 28180150 NO BP GCF_030179915.1 NC_091680.1 102680 354480 NO BP GCF_030179915.1 NC_091680.1 975900 1281430 NO BP GCF_030179915.1 NC_091680.1 2147130 2276080 NO BP GCF_030179915.1 NC_091680.1 3768750 4459800 NO BP GCF_030179915.1 NC_091680.1 5507350 6952350 NO BP GCF_030179915.1 NC_091680.1 7732100 7797180 NO BP GCF_030179915.1 NC_091680.1 7956930 8286700 NO BP GCF_030179915.1 NC_091680.1 8559700 8717150 NO BP GCF_030179915.1 NC_091680.1 9541000 9663350 NO BP GCF_030179915.1 NC_091680.1 9824430 9910550 NO BP GCF_030179915.1 NC_091680.1 11008730 11187900 NO BP GCF_030179915.1 NC_091680.1 11998550 12174880 NO BP GCF_030179915.1 NC_091680.1 13126230 13260400 NO BP GCF_030179915.1 NC_091680.1 13436430 13694350 NO BP GCF_030179915.1 NC_091680.1 15196050 15236330 NO BP GCF_030179915.1 NC_091680.1 15609780 15802530 NO BP GCF_030179915.1 NC_091680.1 16172650 16560950 NO BP GCF_030179915.1 NC_091680.1 17733707 18615476 NO BP GCF_030179915.1 NC_091680.1 19045680 19139630 NO BP GCF_030179915.1 NC_091680.1 19388830 19426549 NO BP GCF_030179915.1 NC_091680.1 20312549 20376580 NO BP GCF_030179915.1 NC_091680.1 20533649 20674649 NO BP GCF_030179915.1 NC_091680.1 22861199 23238449 NO BP GCF_030179915.1 NC_091680.1 24532999 24565180 NO BP GCF_030179915.1 NC_091680.1 25104949 25245280 NO BP GCF_030179915.1 NC_091680.1 25459680 26165080 NO BP GCF_030179915.1 NC_091680.1 26317880 26505930 NO BP GCF_030179915.1 NC_091680.1 27099499 27917530 NO BP
In [65]:
%%bash
# this output is arguably very useful: you can use it as a "synteny lense" and then run resource-intensive programs
# only on those regions to find or not find potential homologs/orthologs of interest instead of the whole genome
# e.g. lets discover more potential orthologs for the halo elements
# here we will:
# 1. write those sequences using the genomes which i put in out/utils
python3 extract_syntenic_fastas.py syntenic_regions_succinct ../../utils/genomes syn_regions
# 2. run a blast search (+ clasp chaining) with low word size and high e-value on them
cat halos/proteins/* > all_halos.faa
python3 run_blast_on_regions.py \
--fastas_dir syn_regions/ \
--output_dir syn_regions_blast_out \
--clasp_path ../../utils/clasp.x \
--cores 8 \
--evalue 0.1 \
--word_size 2 \
all_halos.faa tblastn
# 3. convert that into input for the synthology script
python3 blast_results_to_synthology.py \
--deduplicate_overlaps \
--genomes_dir ../../utils/genomes \
--output_dir halos_from_syn_regions \
--mode protein \
syn_regions_blast_out
GCF_000001215.4: 40 regions extracted GCF_016746365.2: 39 regions extracted GCF_018153835.1: 50 regions extracted GCF_004382195.2: 39 regions extracted GCF_018902025.1: 55 regions extracted GCF_017639315.1: 41 regions extracted GCF_009870125.1: 41 regions extracted GCF_016746395.2: 40 regions extracted GCF_030788295.1: 57 regions extracted GCF_030179895.1: 48 regions extracted GCF_030179915.1: 53 regions extracted 11 organisms with FASTA files Organisms: ['GCF_000001215.4', 'GCF_030788295.1', 'GCF_030179895.1', 'GCF_009870125.1', 'GCF_018902025.1', 'GCF_016746395.2', 'GCF_017639315.1', 'GCF_030179915.1', 'GCF_018153835.1', 'GCF_016746365.2', 'GCF_004382195.2'] Done 11 organisms with temp_coords files Deduplication of overlapping hits is ENABLED GCF_009870125.1: Deduplicated 721 hits to 16 (705 merged) GCF_009870125.1: 16 sequences extracted GCF_018902025.1: Deduplicated 677 hits to 14 (663 merged) GCF_018902025.1: 14 sequences extracted GCF_030179895.1: Deduplicated 705 hits to 15 (690 merged) GCF_030179895.1: 15 sequences extracted GCF_030788295.1: Deduplicated 1100 hits to 19 (1081 merged) GCF_030788295.1: 19 sequences extracted GCF_000001215.4: Deduplicated 747 hits to 13 (734 merged) GCF_000001215.4: 13 sequences extracted GCF_016746365.2: Deduplicated 748 hits to 12 (736 merged) GCF_016746365.2: 12 sequences extracted GCF_004382195.2: Deduplicated 748 hits to 12 (736 merged) GCF_004382195.2: 12 sequences extracted GCF_016746395.2: Deduplicated 748 hits to 12 (736 merged) GCF_016746395.2: 12 sequences extracted GCF_017639315.1: Deduplicated 752 hits to 13 (739 merged) GCF_017639315.1: 13 sequences extracted GCF_030179915.1: Deduplicated 689 hits to 15 (674 merged) GCF_030179915.1: 15 sequences extracted GCF_018153835.1: Deduplicated 699 hits to 13 (686 merged) GCF_018153835.1: 13 sequences extracted Done
In [69]:
%%bash
# now here we have input for run_synthology_prot.py
# those are all chained blast hits within those regions
ls halos_from_syn_regions/*
# you could run:
#python3 run_synthology_prot.py \
# --annotation_dir halos_from_syn_regions \
# with some other parameters like above defined to now investigate whether the official annotations missed any potential halo genes or pseudogenes
halos_from_syn_regions/gff: GCF_000001215.4.gff GCF_004382195.2.gff GCF_009870125.1.gff GCF_016746365.2.gff GCF_016746395.2.gff GCF_017639315.1.gff GCF_018153835.1.gff GCF_018902025.1.gff GCF_030179895.1.gff GCF_030179915.1.gff GCF_030788295.1.gff halos_from_syn_regions/proteins: GCF_000001215.4.faa GCF_004382195.2.faa GCF_009870125.1.faa GCF_016746365.2.faa GCF_016746395.2.faa GCF_017639315.1.faa GCF_018153835.1.faa GCF_018902025.1.faa GCF_030179895.1.faa GCF_030179915.1.faa GCF_030788295.1.faa
In [ ]:
%%bash
# lets turn to run_synthology_nucl.py
# this was developed for comparing nucleotide instead of protein sequences
# e.g. for RNAs this is very useful
# in trnas we have annotations of tRNAs for the Drosophilas
# those are typically hard to differentiate basd on sequence similarity alone and often appear in tandem arrays
# this will take ~1-2 hours on a normal desktop computer with 8 cores
# if you dont want to wait like me, the Zenodo archive contains the results as:
# syn_out_nucl_preloaded. i replaced the actual output "synthology_out_nucl" with that directory for the analysis below
python3 run_synthology_nucl.py \
--cores 8 \
--col ../MCScanX/MCScanX.collinearity \
--homology_threshold 40 \
--annotation_dir trnas \
--new_blast y \
--mapping_file ../../utils/mapping \
--pairwise_alignments ../../utils/pairwise_alignments_table \
--use-clasp y \
--blast-mode chains \
--blastn-path blastn \
--clasp-path ../../utils/clasp.x \
--output_dir synthology_out_nucl
In [93]:
%%bash
# there are hundreds of tRNAs per genome, so lets leave the detailed analysis to the experts
# here we can plot some overall statistics answering general questions like:
# how many genes are placed into a multi- or single-copy group and how many are singletons globally and per species
# (these scripts work generically also with run_synthology_prot.py output)
# 1. Compute element statistics (genes in synorthogroups)
python3 compute_element_stats.py syn_out_nucl_preloaded
# 2. Generate SVG plots (singletons, single-copy, multi-copy distributions)
python3 plot_element_stats.py syn_out_nucl_preloaded
# 3. Generate LaTeX tables
python3 make_element_tables.py syn_out_nucl_preloaded
cd syn_out_nucl_preloaded/element_tables
for tex_file in *.tex; do
pdflatex -interaction=nonstopmode "$tex_file"
done
Loading final union graph...
Loaded graph with 3970 nodes and 12828 edges
Gene-species mapping: 4477 genes
Genes filtered out during pipeline: 507
Run trace_filtered_genes.py first for detailed filtering analysis
Species: 11
Connected components: 366
============================================================
GLOBAL ELEMENT STATISTICS
============================================================
Total elements in input: 4477
Filtered during pipeline: 507 (11.3%)
In final graph: 3970 (88.7%)
Classification of 3970 elements in final graph:
Singletons: 59 (1.5%)
In single-copy groups: 945 (23.8%)
In multi-copy groups: 2966 (74.7%)
Within-species duplicates*: 2256
Orthology groups:
Single-copy groups: 133
Multi-copy groups: 174
============================================================
PER-SPECIES ELEMENT STATISTICS
============================================================
GCF_000001215.4:
Total in input: 374
Filtered: 22
In final graph: 352
Singletons: 3 (0.9%)
In single-copy groups: 81 (23.0%)
In multi-copy groups: 268 (76.1%)
- Self-duplicated*: 204
Top ortholog partners:
- GCF_030179915.1: 309 edges
- GCF_030179895.1: 260 edges
- GCF_004382195.2: 245 edges
GCF_004382195.2:
Total in input: 384
In final graph: 384
Singletons: 2 (0.5%)
In single-copy groups: 94 (24.5%)
In multi-copy groups: 288 (75.0%)
- Self-duplicated*: 219
Top ortholog partners:
- GCF_030179915.1: 337 edges
- GCF_030179895.1: 271 edges
- GCF_009870125.1: 261 edges
GCF_009870125.1:
Total in input: 424
Filtered: 63
In final graph: 361
Singletons: 6 (1.7%)
In single-copy groups: 83 (23.0%)
In multi-copy groups: 272 (75.3%)
- Self-duplicated*: 199
Top ortholog partners:
- GCF_030179895.1: 271 edges
- GCF_016746365.2: 264 edges
- GCF_004382195.2: 261 edges
GCF_016746365.2:
Total in input: 417
Filtered: 24
In final graph: 393
Singletons: 5 (1.3%)
In single-copy groups: 88 (22.4%)
In multi-copy groups: 300 (76.3%)
- Self-duplicated*: 235
Top ortholog partners:
- GCF_030179915.1: 330 edges
- GCF_016746395.2: 292 edges
- GCF_030179895.1: 279 edges
GCF_016746395.2:
Total in input: 383
In final graph: 383
Singletons: 0 (0.0%)
In single-copy groups: 94 (24.5%)
In multi-copy groups: 289 (75.5%)
- Self-duplicated*: 229
Top ortholog partners:
- GCF_030179915.1: 325 edges
- GCF_016746365.2: 292 edges
- GCF_030179895.1: 267 edges
GCF_017639315.1:
Total in input: 509
Filtered: 138
In final graph: 371
Singletons: 5 (1.3%)
In single-copy groups: 97 (26.1%)
In multi-copy groups: 269 (72.5%)
- Self-duplicated*: 201
Top ortholog partners:
- GCF_030179895.1: 291 edges
- GCF_030179915.1: 285 edges
- GCF_016746365.2: 252 edges
GCF_018153835.1:
Total in input: 377
Filtered: 13
In final graph: 364
Singletons: 3 (0.8%)
In single-copy groups: 91 (25.0%)
In multi-copy groups: 270 (74.2%)
- Self-duplicated*: 212
Top ortholog partners:
- GCF_030179915.1: 293 edges
- GCF_016746365.2: 251 edges
- GCF_004382195.2: 238 edges
GCF_018902025.1:
Total in input: 392
Filtered: 115
In final graph: 277
Singletons: 13 (4.7%)
In single-copy groups: 54 (19.5%)
In multi-copy groups: 210 (75.8%)
- Self-duplicated*: 157
Top ortholog partners:
- GCF_030179915.1: 214 edges
- GCF_004382195.2: 200 edges
- GCF_016746395.2: 197 edges
GCF_030179895.1:
Total in input: 404
Filtered: 22
In final graph: 382
Singletons: 8 (2.1%)
In single-copy groups: 92 (24.1%)
In multi-copy groups: 282 (73.8%)
- Self-duplicated*: 213
Top ortholog partners:
- GCF_017639315.1: 291 edges
- GCF_030179915.1: 285 edges
- GCF_016746365.2: 279 edges
GCF_030179915.1:
Total in input: 422
Filtered: 12
In final graph: 410
Singletons: 3 (0.7%)
In single-copy groups: 96 (23.4%)
In multi-copy groups: 311 (75.9%)
- Self-duplicated*: 254
Top ortholog partners:
- GCF_004382195.2: 337 edges
- GCF_016746365.2: 330 edges
- GCF_016746395.2: 325 edges
GCF_030788295.1:
Total in input: 391
Filtered: 98
In final graph: 293
Singletons: 11 (3.8%)
In single-copy groups: 75 (25.6%)
In multi-copy groups: 207 (70.6%)
- Self-duplicated*: 133
Top ortholog partners:
- GCF_030179915.1: 223 edges
- GCF_030179895.1: 213 edges
- GCF_016746365.2: 206 edges
Saved element statistics to syn_out_nucl_preloaded/genetic_elements_stats/element_statistics.pickle
Loading element statistics...
Generating global element classification plot...
Saved global classification plot
Generating per-species stacked bar plot...
Saved per-species stacked bar plot
Generating per-species subplot grid...
Saved per-species grid plot
Generating species connectivity histograms...
Saved species connectivity histogram
Generating top ortholog partners plot...
Saved top partners plot
All plots saved to syn_out_nucl_preloaded/genetic_elements_stats/plots/
Loading element statistics...
Saved global summary table: syn_out_nucl_preloaded/genetic_elements_stats/tables/global_element_summary.tex
Saved per-species summary table: syn_out_nucl_preloaded/genetic_elements_stats/tables/per_species_element_summary.tex
Saved top partners table: syn_out_nucl_preloaded/genetic_elements_stats/tables/per_species_top_partners.tex
All tables saved to syn_out_nucl_preloaded/genetic_elements_stats/tables/
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
(./global_element_summary.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./global_element_summary.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./global_element_summary.aux) )</usr/local/texlive/2022/texmf-dist/fonts/type1
/public/amsfonts/cm/cmr10.pfb>
Output written on global_element_summary.pdf (1 page, 19069 bytes).
Transcript written on global_element_summary.log.
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
(./per_species_element_summary.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./per_species_element_summary.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./per_species_element_summary.aux) )</usr/local/texlive/2022/texmf-dist/fonts/
type1/public/amsfonts/cm/cmr10.pfb>
Output written on per_species_element_summary.pdf (1 page, 19430 bytes).
Transcript written on per_species_element_summary.log.
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex)
restricted \write18 enabled.
entering extended mode
(./per_species_top_partners.tex
LaTeX2e <2023-11-01>
L3 programming layer <2023-11-01>
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/article.cls
Document Class: article 2023/05/17 v1.4n Standard LaTeX document class
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/size11.clo))
(/usr/local/texlive/2022/texmf-dist/tex/latex/base/inputenc.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/booktabs/booktabs.sty)
(/usr/local/texlive/2022/texmf-dist/tex/latex/geometry/geometry.sty
(/usr/local/texlive/2022/texmf-dist/tex/latex/graphics/keyval.sty)
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/ifvtex.sty
(/usr/local/texlive/2022/texmf-dist/tex/generic/iftex/iftex.sty)))
(/usr/local/texlive/2022/texmf-dist/tex/latex/l3backend/l3backend-pdftex.def)
(./per_species_top_partners.aux)
*geometry* driver: auto-detecting
*geometry* detected driver: pdftex
[1{/usr/local/texlive/2022/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
(./per_species_top_partners.aux) )</usr/local/texlive/2022/texmf-dist/fonts/typ
e1/public/amsfonts/cm/cmr10.pfb>
Output written on per_species_top_partners.pdf (1 page, 17962 bytes).
Transcript written on per_species_top_partners.log.
In [94]:
# as can be seen in the plots and tables below, a significant portion of tRNAs can be placed into
# synteny chains and putative synorthogroups
from IPython.display import SVG, display, IFrame
import glob
# Display SVG plots
plot_dir = 'syn_out_nucl_preloaded/element_plots'
for svg_file in sorted(glob.glob(f'{plot_dir}/*.svg')):
print(f"\n{'='*80}")
print(svg_file.split('/')[-1])
print('='*80)
display(SVG(filename=svg_file))
================================================================================ global_element_classification.svg ================================================================================
================================================================================ per_species_classification_grid.svg ================================================================================
================================================================================ per_species_classification_stacked.svg ================================================================================
================================================================================ per_species_connectivity_histogram.svg ================================================================================
================================================================================ per_species_top_partners.svg ================================================================================
In [ ]: