<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Wilians Conde</title>
    <description>The latest articles on Forem by Wilians Conde (@wilians_conde_6d4bbc5eed2).</description>
    <link>https://forem.com/wilians_conde_6d4bbc5eed2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3881374%2F40ab2911-6de7-4298-bcec-279704dd3108.png</url>
      <title>Forem: Wilians Conde</title>
      <link>https://forem.com/wilians_conde_6d4bbc5eed2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/wilians_conde_6d4bbc5eed2"/>
    <language>en</language>
    <item>
      <title>Processing High Frequency Solar Data Without HPC: Real Constraints and Design Decisions in MackSun</title>
      <dc:creator>Wilians Conde</dc:creator>
      <pubDate>Thu, 16 Apr 2026 01:58:29 +0000</pubDate>
      <link>https://forem.com/wilians_conde_6d4bbc5eed2/processing-high-frequency-solar-data-without-hpc-real-constraints-and-design-decisions-in-macksun-3ikf</link>
      <guid>https://forem.com/wilians_conde_6d4bbc5eed2/processing-high-frequency-solar-data-without-hpc-real-constraints-and-design-decisions-in-macksun-3ikf</guid>
      <description>&lt;p&gt;Solar activity directly impacts Earth, from GPS accuracy to power systems.&lt;/p&gt;

&lt;p&gt;MackSun was designed to process billions of high frequency solar data points under strict hardware constraints, without relying on HPC infrastructure.&lt;/p&gt;

&lt;p&gt;The platform is available at:&lt;br&gt;
&lt;a href="https://www.macksun.org" rel="noopener noreferrer"&gt;https://www.macksun.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The problem&lt;/p&gt;

&lt;p&gt;Instruments such as POEMAS (&lt;a href="https://www.macksun.org/pages/wiki/arquivos-telescopios.html" rel="noopener noreferrer"&gt;https://www.macksun.org/pages/wiki/arquivos-telescopios.html&lt;/a&gt;) operate with acquisition intervals around 10 milliseconds. This enables detailed analysis of solar activity, but also produces a continuous stream of data.&lt;/p&gt;

&lt;p&gt;This creates a set of concrete challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;continuous ingestion under load&lt;/li&gt;
&lt;li&gt;long term storage of billions of records&lt;/li&gt;
&lt;li&gt;memory and IO limitations&lt;/li&gt;
&lt;li&gt;processing under constant pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In most scenarios, this would require distributed systems or HPC clusters. Here, the system had to work without that.&lt;/p&gt;

&lt;p&gt;Data origin&lt;/p&gt;

&lt;p&gt;The data used in MackSun is not synthetic.&lt;/p&gt;

&lt;p&gt;It comes from real solar observation instruments located in South America, operated at the CASLEO observatory in Argentina.&lt;/p&gt;

&lt;p&gt;These instruments are managed by CRAAM, part of Mackenzie Presbiterian University in Brazil.&lt;/p&gt;

&lt;p&gt;This matters because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data is generated under real observational conditions&lt;/li&gt;
&lt;li&gt;acquisition is continuous and subject to physical constraints&lt;/li&gt;
&lt;li&gt;system behavior is influenced by real hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a controlled environment. It is a live acquisition scenario.&lt;/p&gt;

&lt;p&gt;Infrastructure limits&lt;/p&gt;

&lt;p&gt;The system runs under a constrained but well defined setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;single Linux server&lt;/li&gt;
&lt;li&gt;16 vCPU&lt;/li&gt;
&lt;li&gt;32 GB of RAM in total&lt;/li&gt;
&lt;li&gt;4 GB reserved for the operating system&lt;/li&gt;
&lt;li&gt;16 GB allocated to MongoDB running in sharded mode&lt;/li&gt;
&lt;li&gt;12 GB allocated to the ingestion pipeline container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MongoDB allocation is not arbitrary. It was defined based on limits observed during experimental validation.&lt;/p&gt;

&lt;p&gt;Even on a single machine, MongoDB showed better performance in sharded mode. This is not assumed. It was experimentally validated and later published in Astronomy and Computing:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.sciencedirect.com/science/article/pii/S221313372500126X" rel="noopener noreferrer"&gt;https://www.sciencedirect.com/science/article/pii/S221313372500126X&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These limits are enforced. The system is designed to operate within them.&lt;/p&gt;

&lt;p&gt;Data scale&lt;/p&gt;

&lt;p&gt;The current volume is around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 billion data points&lt;/li&gt;
&lt;li&gt;continuous ingestion from solar instruments&lt;/li&gt;
&lt;li&gt;original data at high frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this scale, uncontrolled growth leads to instability.&lt;/p&gt;

&lt;p&gt;The system must control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;memory usage&lt;/li&gt;
&lt;li&gt;write patterns&lt;/li&gt;
&lt;li&gt;data organization&lt;/li&gt;
&lt;li&gt;query behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Partitioning strategy&lt;/p&gt;

&lt;p&gt;The system enforces a strict limit:&lt;/p&gt;

&lt;p&gt;about 150 million data points per collection&lt;/p&gt;

&lt;p&gt;Beyond this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;performance degrades&lt;/li&gt;
&lt;li&gt;queries slow down&lt;/li&gt;
&lt;li&gt;memory pressure increases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data is therefore split across multiple collections.&lt;/p&gt;

&lt;p&gt;This is required for stability.&lt;/p&gt;

&lt;p&gt;Ingestion model&lt;/p&gt;

&lt;p&gt;The ingestion process is not real time.&lt;/p&gt;

&lt;p&gt;It runs as a sequential pipeline with five stages, executed once per day.&lt;/p&gt;

&lt;p&gt;This approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;avoids continuous load pressure&lt;/li&gt;
&lt;li&gt;keeps resource usage predictable&lt;/li&gt;
&lt;li&gt;simplifies failure handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We chose batch processing over real time ingestion. This reduces latency flexibility, but guarantees stability.&lt;/p&gt;

&lt;p&gt;Precomputed datasets&lt;/p&gt;

&lt;p&gt;On demand processing is not viable under these constraints.&lt;/p&gt;

&lt;p&gt;One day of observation generates around:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 million data points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Processing this during a request would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;increase latency&lt;/li&gt;
&lt;li&gt;consume too much memory&lt;/li&gt;
&lt;li&gt;destabilize the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system generates daily datasets in advance.&lt;/p&gt;

&lt;p&gt;Each dataset is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processed&lt;/li&gt;
&lt;li&gt;consolidated&lt;/li&gt;
&lt;li&gt;stored in a ready to serve format&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Datasets are available at:&lt;br&gt;
&lt;a href="https://www.macksun.org" rel="noopener noreferrer"&gt;https://www.macksun.org&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Structure and format are documented here:&lt;br&gt;
&lt;a href="https://www.macksun.org/pages/wiki/arquivos-telescopios.html" rel="noopener noreferrer"&gt;https://www.macksun.org/pages/wiki/arquivos-telescopios.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We chose precomputed datasets instead of on demand processing. This reduces flexibility, but ensures consistent performance.&lt;/p&gt;

&lt;p&gt;Trade offs&lt;/p&gt;

&lt;p&gt;This architecture makes explicit decisions.&lt;/p&gt;

&lt;p&gt;Real time vs stability&lt;br&gt;
No real time processing&lt;br&gt;
Predictable execution&lt;/p&gt;

&lt;p&gt;Flexibility vs predictability&lt;br&gt;
No arbitrary queries over raw data&lt;br&gt;
Structured access through prepared datasets&lt;/p&gt;

&lt;p&gt;Infrastructure vs engineering&lt;br&gt;
No hardware scaling&lt;br&gt;
More control over data and processing&lt;/p&gt;

&lt;p&gt;We chose sharding on a single server. This is not the typical approach, but it was experimentally validated.&lt;/p&gt;

&lt;p&gt;We chose precomputation instead of real time processing. This reduces flexibility, but guarantees stability.&lt;/p&gt;

&lt;p&gt;Why this works&lt;/p&gt;

&lt;p&gt;The system works because it enforces limits.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;collections are bounded&lt;/li&gt;
&lt;li&gt;memory usage is controlled&lt;/li&gt;
&lt;li&gt;ingestion and access are separated&lt;/li&gt;
&lt;li&gt;heavy processing is done in advance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead of relying on infrastructure scaling, the system relies on controlled behavior.&lt;/p&gt;

&lt;p&gt;Final thoughts&lt;/p&gt;

&lt;p&gt;MackSun shows that it is possible to process billions of records without HPC, but only if constraints are treated as part of the design.&lt;/p&gt;

&lt;p&gt;This requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strict partitioning&lt;/li&gt;
&lt;li&gt;controlled ingestion&lt;/li&gt;
&lt;li&gt;precomputed outputs&lt;/li&gt;
&lt;li&gt;disciplined resource usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Explore the datasets and see how MackSun handles billions of records under constrained hardware:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.macksun.org" rel="noopener noreferrer"&gt;https://www.macksun.org&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>mongodb</category>
      <category>systemdesign</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
